[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync
Eric Malkowski
(spam-protected)
Wed Sep 3 06:31:29 CEST 2008
Hannes-
I've tested the patch http://gredler.at/hg/olsrd/rev/ace7c0970eca w/ my
two node setup and everything appears to be fine.
The only difference I've perceived is that it seems the links want to
stay at perfect 1.0 quality / 1.0 ETX more often than they did in the
past. I don't know if this has anything to do w/ the timer update below
(perhaps a timer lapsing by 2.54 seconds led to some "loss" before the
above patch was in -- you would know better).
I've haven't tested extensively (like removing a node and re-adding the
node / more than 2 nodes). Hopefully I can try that out tomorrow.
In regards to the times() call and rollover issues -- I've done the
following work-around for my setup:
- Wrote a small program the check for when times() starts returning
positive numbers (takes about 4 mins of uptime).
- In the OLSR init script -- I poll w/ my program until times() is
returning positive numbers and then start OLSRd.
- This guarantees no timers get "lost" or other rogue behavior that
could result from the rollover - you guys would know better than me, but
I figure having it start w/ times() being positive for the life of the
node is the best for stability.
This will get me w/ 100hz kernel up to 500+ days uptimt before rollover
w/ only side effect being the 4 minute wait at powerup.
I compiled a 250hz kernel and noticed the initial value returned by
times() changes to the math the manpage talks about (which would give me
something like 48 days of uptime since those numbers are positive from
the get-go and 490 million ticks or whatever away from rollover.
One thing I noticed that I think is different than what Bernd had
mentioned -- w/ 250HZ kernel, times() still runs at 100 ticks per second
and sysconf(_SC_CLK_TCK) still returned 100. I also noticed this w/ my
desktop fedora box running a 2.6.20ish kernel at 1000HZ and times() was
still 100 ticks per second. So this 42 second original "coma" we're
talking about when rollover happens I would think would still be 42
seconds even on varying HZ kernels unless I'm missing something? i.e.
42 seconds on 100Hz and Bernd's thought was 4.2 seconds on 1000HZ.
Perhaps other platforms is what Bernd was thinking of? I don't know
what can influence the return value of sysconf(_SC_CLK_TCK) -- perhaps a
sysctl setting or kernel config param.
Anyway -- until you guys get a fix in for the rollover problems, I'll
run as I've described and keep your patch for the "timers getting
skipped - the above http link" in my running setup. If the 4 minutes of
waiting for OLSR to start really starts to bug me, I can simply rebuild
my kernel w/ 250HZ -- this will cover me fine for my 3 day outdoor event
and if someone reboots off watchdog or has some "other" problem, said
node can chime in w/o waiting the 4 minutes.
If there's anything I can do to help w/ testing patches etc., just say
the word.
Thanks for looking into this -- seemed to stir up quite a bit of good
conversation.
-Eric
Hannes Gredler wrote:
> eric,
>
> would you mind testing the following fix to timer processing.
>
> http://gredler.at/hg/olsrd/rev/ace7c0970eca
>
> /hannes
>
> On Tue, Sep 02, 2008 at 12:13:38AM -0400, Eric Malkowski wrote:
> | One more datapoint -- I ran this same test on the box that does dyn_gw
> | and during the 40 second "coma" of the main OLSR thread, the thread that
> | does the pings keeps going and prints out Pings are OK and that the
> | HNA[0000000.00000000] route is still good every 5 seconds (that works
> | off it's own nanosleep() call.
>
More information about the Olsr-users
mailing list