[olsr-dev] Improvements in the algorithm, and bugs that are fixed in the upcoming release
Tue Nov 15 22:39:30 CET 2005
Hi all -
let me introduce you to some thoughts and off-list discussions that we
had here in Berlin about the occurance of routing loops in a mesh with
I stated the hypothesis that routing loops occur rather local than from
end-to-end in a mesh cloud. Olsrd with LQ-Extension attempts by design
to know all the best routes all over the whole mesh cloud, but it is
likely that it never will be able to achieve this in a mesh that has
more than a handful of nodes. TC informations are likely to be lost on
there way while they propagate over the mesh. And this likelyhood
increases with the number of nodes that forward TC information.
But this fact doesnt necessarily harm the routing of traffic if it
travels a long multihop path. A node at one end of a mesh cloud may have
the illusion to know the exact and best path its packages go when
sending payload to a node that is several hops away. But this
information may be pretty outdated and incomplete.
In fact all we can and have to achieve with the algorithm is a
reasonable choice for the next two or three hops. If our path is 8 hops,
for example, nodes that are more hops away from the node initiating
traffic and closer to the destination have better and more accurate
information about the best path. They dont know what a node that
initiates traffic thinks about the path its packets have to take - they
have more accurate routing information and will look into their routing
table and make a choice based on their knowledge.
If I send a snailmail parcel to india I dont have to write the name of
the postman on the paket that finally hands it over, or the brand of
bicycle he rides. I dont have to decide the way he has to drive in the
village of the destination. He knows better than me.
Btw. this is the attempt of source routing - quite stupid - isn't it?
So my conclusion is that we have the challenge to keep information of
TCs in sync in a local area - say 2-3 hops wide. And we need to have a
vague idea how the meshcloud looks alike many hops away. But we dont
have to introduce a high overhead trying to achieve the (almost)
impossible and completely sync routing tables all over the mesh.
People are happy with their olsrd-meshclouds as long as they only share
a few internet connections, because their uplinks are usually a
bottleneck that limits traffic and therefore interference between
transmissions. As someone or something starts to saturate a multihop
link 0.4.9 and 0.4.8 go mad (i dont speak about the rfc-compatible
mode). The simple reason is that the routing tables get out of sync -
because Hellos, TCs, HNAs *everything that the protocol has to transfer
sucessfully in order to work* gets lost until the route breaks down that
the traffic saturating the link used. As the traffic causing the
interference breaks down olsrd becomes stable again until the link gets
saturated again. And so on and so on...
So here are the conclusions:
The routing information in an area of 2-3 hops radius around a node has
to be in sync. The same applies to all other message types that olsrd
uses. Especially TCs and MPR selections. MIDs should be reasonably
reliable with long validity timeouts. The problem here are lost packages
if the radio channel is saturated. The quick fix is to send TC and MPR
selection messages more often than Hellos, so TCs are more redundant.
Maybe two times more frequent than Hellos is a good value, maybe more.
The best thing IMHO would be a algorithm that will make sure the
neighbours have actually received the TC packets. We don't have to
bother about overhead here. If the protocol cant deal with a payload
that starts saturating bandwidth it is far from optimal, pardon
0.4.9 and .8 had a bug in the algorithm - a node learns about a change
in the topology by Hellos and changes his routing table and routing
decisions accordingly *before* telling its neighburs with TCs about its
decisions. Depending on how often the TCs are send in relation to Hellos
this can cause trouble. Now Thomas has implemented a fix that makes sure
a node changes its routing table *after* it told others with a TC-message.
But so far we do nothing to really make sure our precious routing
information is in sync in a local area. I assume this is the next
important thing to do in the olsrd algorithm.
More information about the Olsr-dev