[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync

Wed Sep 3 06:31:29 CEST 2008

Hannes-

I've tested the patch http://gredler.at/hg/olsrd/rev/ace7c0970eca w/ my 
two node setup and everything appears to be fine.
The only difference I've perceived is that it seems the links want to 
stay at perfect 1.0 quality / 1.0 ETX more often than they did in the 
past.  I don't know if this has anything to do w/ the timer update below 
(perhaps a timer lapsing by 2.54 seconds led to some "loss" before the 
above patch was in -- you would know better).

I've haven't tested extensively (like removing a node and re-adding the 
node / more than 2 nodes).  Hopefully I can try that out tomorrow.

In regards to the times() call and rollover issues -- I've done the 
following work-around for my setup:

- Wrote a small program the check for when times() starts returning 
positive numbers (takes about 4 mins of uptime).
- In the OLSR init script -- I poll w/ my program until times() is 
returning positive numbers and then start OLSRd.
- This guarantees no timers get "lost" or other rogue behavior that 
could result from the rollover - you guys would know better than me, but 
I figure having it start w/ times() being positive for the life of the 
node is the best for stability.

This will get me w/ 100hz kernel up to 500+ days uptimt before rollover 
w/ only side effect being the 4 minute wait at powerup.

I compiled a 250hz kernel and noticed the initial value returned by 
times() changes to the math the manpage talks about (which would give me 
something like 48 days of uptime since those numbers are positive from 
the get-go and 490 million ticks or whatever away from rollover.

One thing I noticed that I think is different than what Bernd had 
mentioned -- w/ 250HZ kernel, times() still runs at 100 ticks per second 
and sysconf(_SC_CLK_TCK) still returned 100.  I also noticed this w/ my 
desktop fedora box running a 2.6.20ish kernel at 1000HZ and times() was 
still 100 ticks per second.  So this 42 second original "coma" we're 
talking about when rollover happens I would think would still be 42 
seconds even on varying HZ kernels unless I'm missing something?  i.e. 
42 seconds on 100Hz and Bernd's thought was 4.2 seconds on 1000HZ.  
Perhaps other platforms is what Bernd was thinking of?  I don't know 
what can influence the return value of sysconf(_SC_CLK_TCK) -- perhaps a 
sysctl setting or kernel config param.

Anyway -- until you guys get a fix in for the rollover problems, I'll 
run as I've described and keep your patch for the "timers getting 
skipped - the above http link" in my running setup.  If the 4 minutes of 
waiting for OLSR to start really starts to bug me, I can simply rebuild 
my kernel w/ 250HZ -- this will cover me fine for my 3 day outdoor event 
and if someone reboots off watchdog or has some "other" problem, said 
node can chime in w/o waiting the 4 minutes.

If there's anything I can do to help w/ testing patches etc., just say 
the word.

Thanks for looking into this -- seemed to stir up quite a bit of good 
conversation.

-Eric

Hannes Gredler wrote:
> eric,
>
> would you mind testing the following fix to timer processing.
>
> http://gredler.at/hg/olsrd/rev/ace7c0970eca
>
> /hannes
>
> On Tue, Sep 02, 2008 at 12:13:38AM -0400, Eric Malkowski wrote:
> | One more datapoint -- I ran this same test on the box that does dyn_gw 
> | and during the 40 second "coma" of the main OLSR thread, the thread that 
> | does the pings keeps going and prints out Pings are OK and that the 
> | HNA[0000000.00000000] route is still good every 5 seconds (that works 
> | off it's own nanosleep() call.
>