[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync

Tue Sep 2 14:42:58 CEST 2008

Hannes-

Thanks for the quick reply.

I had some time to run this in strace this morning and it appears the
culprit is the fact that at daemon start time, the return values from the
times() system call are working their way at system boot time on my box
from -32768 at bootup up towards overflow at 0 and olsrd does his ~40
second "coma" close to zero and then times() return values go positive and
all is well.

See the following strace log which I think tells the whole story:

http://www.malknet.org/strace_olsrd_0.5.6_uclibc.log

Towards the end of the file you can see pretty clearly what is going on --
a lot of various odd errno values and possibly a bad pointer or address
being passed into times() for a bit - but it recovers.

>From the manpage for times() it appears one needs to deal w/ the return
value overflowing the range of clock_t -- I guess my clock_t is signed 16
bit - haven't looked:

RETURN VALUE
       times()  returns  the  number  of  clock ticks that have elapsed
since an arbitrary
       point in the past.  For Linux 2.4 and earlier this point is the
moment  the  system
       was  booted.   Since Linux 2.6, this point is (2^32/HZ) - 300
(i.e., about 429 mil-
       lion) seconds before system boot time.  The return value may
overflow the  possible
       range of type clock_t.  On error, (clock_t) -1 is returned, and
errno is set appro-
       priately.

If you'd like, I'd be happy to test the timer processing patch below --
but I have to go work now!!  :(  So I'll try that as soon as I get home
this evening here on the east coast of the US.

Thanks again for looking into this -- I'm thinking we're well on our way
to a fix here.

-Eric

> eric,
>
> would you mind testing the following fix to timer processing.
>
> http://gredler.at/hg/olsrd/rev/ace7c0970eca
>
> /hannes
>
> On Tue, Sep 02, 2008 at 12:13:38AM -0400, Eric Malkowski wrote:
> | One more datapoint -- I ran this same test on the box that does dyn_gw
> | and during the 40 second "coma" of the main OLSR thread, the thread that
> | does the pings keeps going and prints out Pings are OK and that the
> | HNA[0000000.00000000] route is still good every 5 seconds (that works
> | off it's own nanosleep() call.
>