[Olsr-dev] pre-Xmas Bug Hunting and other stuff

Bernd Petrovitsch (spam-protected)
Fri Dec 21 12:23:19 CET 2007

Hi all!

I was quite quiet the last time - mainly because of my day job also
needs attention (and pre Xmas time in Vienna implies various Xmas
parties and meeting people for a beer;-).

More serious:

We (at least Hannes Gredler and /me) *thought* that it is a good point
(and time) for a release (and several others didn't disagree:-).
Sven-Ola Tücke has some build fixes and cleanups for the Windows in
some patches which can/should IMHO go in before.

We are also in the process of migrating from CVS (on sf.net) to
Mercurial (on sf.net). BTW that was driven primarily by Hannes.
The benefits are:
- since Mercurial is one this modern distributed SCMs it is for
  developers easier to mover changesets around. 
- since the anonymous access to the main repository[0] will go over a
  CGI script on the sf.net server, there shouldn't be any delay (of
  several hours) between a commit mail and the actual change in the
  publicly visible repository.
- Mercurial automagically provides an RSS feed out of a repository.
- Since I'm personally quite mail-centric, we will also send emails on
  changes to the main repository.
  To minimize changes and effort on all sides, I intend to keep even the
  (spam-protected) mailing list for commit mails.

For a smooth transition, we need to update the documentation on
http://www.olsr.org/ - including a simple introduction for people
knowing CVS (WTH - I'm such a person). This should happen over Xmas
holiday time.

Why the above *thought*:
The FunkFeuer net in Vienna started upgrade several nodes to 0.5.4 -
including the gateway to the erst of the Internet. However, we
experienced route flaps in the net afterwards.

Summarizing from the internal FunkFeuer core list (in the Cc:, which is
also German otherwise):

It turned out that once in a while (the "while" can be AFAIK from a few
minutes to lots of minutes) olsrd decides that one neighbor is not
reachable (read: ETX == 0) and drops all routes to it. After a while,
the connection is back and all routes are installed (and everything was
as before).
And that is quite noticable if that "dropped" neighbor is the main link
to the Internet gateway.
And this also happens on openvpn tunneled connections which are usually
more like ETX == 1.00.

The thread on http://www.freifunk-bno.de/forum/index.php?topic=930.0 (in
German, found via Google) seems to be BTW the similar problem.

Reverting to 0.5.0 on the Internet gateway solved (or at least seems to)
the problem. So it seems to have to do with the olsrd version - and thus
the decision to postpone a release until the clause of that it is clear.

ATM no one knows (to the best of my knowledge) if that is a new bug in
the implementation or combination of (vastly?) different versions or
something hidden which is now only coming up or something completely

The main question IMHO is: Why is olsrd deciding that ETX == 0 at some
point in time even on a stable link?

That requires adding debug code to a known br0ken version (above thread
indicates the e.g. CVS-HEAD is one) and find the cause of it on a node
that reliably shows that issue.

If you are not so into programming and debugging:
It would also help if someone experiencing such problem reliably and
quickly could find the point in time in the CVS were it started to
happen - or at least the two points where is definit^Wmost certainly not
in (at or after 0.5.0) or - later on - is there (at or before 0.5.4).
then everyone can look at the code.

That requires getting some in-between version from the CVS and trying it
out and see if it occurs. Write that down and take a later or an earlier
one. Ideally one takes a center of the remaining interval to minimize
the tries.
Repeat until you feel you know the above result.

Further helpful information is of course also welcome. Including
corrections of above if I missed something or misunderstand something.


[0]: I have to admit that I forget to ask Hannes what the official term
for that in Mercurial speak is;-)
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services

More information about the Olsr-dev mailing list