[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync

Wed Sep 3 06:47:46 CEST 2008

As I re-read this I forgot to mention my motivation for keep a 100Hz 
kernel was for older WRAP and Soekris net4826 hardware thinking they 
would be using the PIT timer and I didn't want too much timer overhead.  
But alas w/ the 2.6 kernels I've been using lately both WRAP and Soekris 
have support for the Geode GX 27mhz clock for hi-res timer source and 
they use the tsc just fine also and if I remember correctly can run 
"tickless".  So I suppose I could just go w/ 250Hz and avoid the 4 
minute wait on all of my hardware.

One other quick question totally off topic -- I may want to limit 
bandwidth for some of my AP subnets in my setup and even though I've 
been doing a lot w/ linux / networking etc. over the years I've never 
had to limit bandwidth.  Can you guys recommend your favorite mechanism 
the kernel has for doing this (and userspace tools to configure it).  
I'm thinking IP tables could probably use a match rule to match the 
subnet traffic that needs to be limited and jump to a target that takes 
the packets through a queue of some type that add the necessary delay / 
lag.  The idea is to slow low priority traffic down before it goes out 
an internet gateway so our 3-4 Mbps backhaul doesn't get flooded.  I'm 
not too concerned about traffic w/in the small mesh since I'm getting 20 
megabits of throughput on the 54 Mbps wireless links involved (at least 
for TCP downloads from one of my web servers right in my house.  Sorry 
to go off topic, but I think the "bandwidth limiting" howto I always 
find when googling seems to be way outdated and Alan Cox's original 
"traffic shaper" device driver I think is gone etc.  I just don't really 
know what the state is of the latest / most robust mechanism to 
accomplish it.  Any ideas appreciated or feel free to ignore.

On one last note -- did you have a moment to look at my patch in my post 
before this big thread talking about the "interval" parameter not 
working for the dyn_gw plugin?  No rush on that one -- it's trivial / 
most people don't seem to change the default 5 second ping interval.

Anyway -- thanks again.

-Eric

Eric Malkowski wrote:
> Hannes-
>
> I've tested the patch http://gredler.at/hg/olsrd/rev/ace7c0970eca w/ 
> my two node setup and everything appears to be fine.
> The only difference I've perceived is that it seems the links want to 
> stay at perfect 1.0 quality / 1.0 ETX more often than they did in the 
> past.  I don't know if this has anything to do w/ the timer update 
> below (perhaps a timer lapsing by 2.54 seconds led to some "loss" 
> before the above patch was in -- you would know better).
>
> I've haven't tested extensively (like removing a node and re-adding 
> the node / more than 2 nodes).  Hopefully I can try that out tomorrow.
>
> In regards to the times() call and rollover issues -- I've done the 
> following work-around for my setup:
>
> - Wrote a small program the check for when times() starts returning 
> positive numbers (takes about 4 mins of uptime).
> - In the OLSR init script -- I poll w/ my program until times() is 
> returning positive numbers and then start OLSRd.
> - This guarantees no timers get "lost" or other rogue behavior that 
> could result from the rollover - you guys would know better than me, 
> but I figure having it start w/ times() being positive for the life of 
> the node is the best for stability.
>
> This will get me w/ 100hz kernel up to 500+ days uptimt before 
> rollover w/ only side effect being the 4 minute wait at powerup.
>
> I compiled a 250hz kernel and noticed the initial value returned by 
> times() changes to the math the manpage talks about (which would give 
> me something like 48 days of uptime since those numbers are positive 
> from the get-go and 490 million ticks or whatever away from rollover.
>
> One thing I noticed that I think is different than what Bernd had 
> mentioned -- w/ 250HZ kernel, times() still runs at 100 ticks per 
> second and sysconf(_SC_CLK_TCK) still returned 100.  I also noticed 
> this w/ my desktop fedora box running a 2.6.20ish kernel at 1000HZ and 
> times() was still 100 ticks per second.  So this 42 second original 
> "coma" we're talking about when rollover happens I would think would 
> still be 42 seconds even on varying HZ kernels unless I'm missing 
> something?  i.e. 42 seconds on 100Hz and Bernd's thought was 4.2 
> seconds on 1000HZ.  Perhaps other platforms is what Bernd was thinking 
> of?  I don't know what can influence the return value of 
> sysconf(_SC_CLK_TCK) -- perhaps a sysctl setting or kernel config param.
>
> Anyway -- until you guys get a fix in for the rollover problems, I'll 
> run as I've described and keep your patch for the "timers getting 
> skipped - the above http link" in my running setup.  If the 4 minutes 
> of waiting for OLSR to start really starts to bug me, I can simply 
> rebuild my kernel w/ 250HZ -- this will cover me fine for my 3 day 
> outdoor event and if someone reboots off watchdog or has some "other" 
> problem, said node can chime in w/o waiting the 4 minutes.
>
> If there's anything I can do to help w/ testing patches etc., just say 
> the word.
>
> Thanks for looking into this -- seemed to stir up quite a bit of good 
> conversation.
>
> -Eric
>
> Hannes Gredler wrote:
>> eric,
>>
>> would you mind testing the following fix to timer processing.
>>
>> http://gredler.at/hg/olsrd/rev/ace7c0970eca
>>
>> /hannes
>>
>> On Tue, Sep 02, 2008 at 12:13:38AM -0400, Eric Malkowski wrote:
>> | One more datapoint -- I ran this same test on the box that does 
>> dyn_gw | and during the 40 second "coma" of the main OLSR thread, the 
>> thread that | does the pings keeps going and prints out Pings are OK 
>> and that the | HNA[0000000.00000000] route is still good every 5 
>> seconds (that works | off it's own nanosleep() call.
>>   
>
>