[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync

Tue Sep 2 06:13:38 CEST 2008

One more datapoint -- I ran this same test on the box that does dyn_gw 
and during the 40 second "coma" of the main OLSR thread, the thread that 
does the pings keeps going and prints out Pings are OK and that the 
HNA[0000000.00000000] route is still good every 5 seconds (that works 
off it's own nanosleep() call.

So this could be a bug in OLSR timers that is exacerbated by the 
environment I'm running in...  I'm doubtful of that and will spend some 
more time tomorrow on it.  So far it appears I can safely ignore the 
problem since it does this just once at 4 minutes uptime on my box and 
never anymore. 

Eric Malkowski wrote:
> I did some more investigation to find that after about 4 mins of 
> uptime even a single box w/ OLSR running, no dyn_gw plugin (so single 
> threaded), if I run it in the foreground with debug level 9 it simply 
> hangs for about 40 seconds.
>
> There are several "TIMER: fire unknown timer 0x8XXXXXX, ctx (nil) at 
> clocktick 1193:02:47.2" messages along w/ [ENC] timestamp: 946703476 
> and [ENC] Message signed messages.  All of this just ceases for 40 
> seconds, and then picks up and keep going.  It's like the process gets 
> blocked on something -- then it's apparently fine after that.
>
> There's no jump in time on this box -- it's not doing any time syncing 
> or anything of the sort.
> In fact - when it starts going again after the hang, the 
> [ENC]timestamp: 946703570 last message is 40 seconds ahead like it 
> should be.
> Other processes on the box appear to be fine while olsr is stuck.
>
> My only guess is possibly the kernel does something w/ timers or clock 
> etc. 4 minutes after bootup.
>
> This kernel is latest 2.6.26.3 with the ALIX MFGPT (general purpose 
> timer hardware from 5536 chip) enabled and hi-res timer support turned 
> on -- also it's set for "tickless" operation.  I can build up fairly 
> easily a kernel w/ traditional PIT timer, no hi-res support and no 
> MFGPT support just to see if that makes a difference.
>
> Another thing I can try is to run it inside strace to what system call 
> the process last did to get a clue of what's going on.  I didn't run 
> "top" to see if it was "spinning" in a loop on the CPU when "stuck" 
> not outputting anything -- I'll check for that.  I may also try 
> running it in the debugger and CTRL-C while it's "stuck". and also try 
> sending it a signal while stuck.
>
> After the box has been up and this has happened once, I can CTRL-C and 
> start OLSR again and no problem whatsoever...  so it could be my 
> environment (probably IS my environment) and not a real problem w/ OLSR.
>
> Any insight would be appreciated, but it appears I'll be on my own w/ 
> this one.
>
> Eric Malkowski wrote:
>> Hi all-
>>
>> Forgive me if this is already known - I'm new to OLSR and searched 
>> around a bit and didn't find anything that sounded like what I'm seeing.
>>
>> I'm using the latest OLSR 0.5.6 with the link quality extension 
>> default config olsrd.conf.default.lq along with secure plugin and 
>> dyn_gw on one of the nodes for an "internet" default route where he's 
>> doing NAT firewall for the mesh.
>>
>> The hardware for my test is two ALIX 3c2 boards with Ubiquiti XR5 in 
>> adhoc mode (madwifi 0.9.4.1, linux kernel 2.6.26.3 on a uclibc 0.9.29 
>> buildroot setup) and an Engenius EMP 8602+S for 2.4ghz AP mode as an 
>> HNA network for each mesh node.  The one running dyn_gw has static 
>> default route out wired network and dyn_gw injects the 0.0.0.0 HNA 
>> just fine for me.  The httpinfo plugin is really cool too by the way.
>>
>> So its:  ALIX #1 w/ dyn_gw --> 5ghz adhoc subnet --> ALIX #2, no dyn_gw
>> And both have 2.4 ghz HNA networks for "AP services".
>>
>> These boards have no clock batteries so when I power them up 
>> simultaneously, their clocks are in the best sync they can be (but 
>> January, 1 2000 obviously).  After about 4 minutes of uptime, both 
>> OLSRs delete all of the routes, then they come back in a minute or so.
>>
>> I've found that killing them off and running w/ debug level 2 in the 
>> foreground w/ TCP dump going on both 5 ghz links watching packets 
>> going out, everything works fine -- no problem.  This is after the 
>> "bootup" disappearing routes.  My theory is that the clocks are 
>> already drifted apart enough since I saw a lot of chatter early in 
>> this 2nd run w/ debug 2 in the foreground complaining about clock 
>> skew.  They seem to "get it out of their system" early on and then 
>> happily work together.
>>
>> I reproduced the problem w/ debug level 2 on by powering up both 
>> boxes simultaneously, and hastily getting OLSR and tcpdumps going via 
>> SSH sessions -- everything comes up and builds the mesh just fine.  I 
>> can see perfect 1.0 ETX and link qualities where the DIJKSTRA / LINKS 
>> / TOPO output doesn't happen (LQ doesn't run -- not needed since 
>> stuff is perfect is what I assume) and then I see a bit of that 
>> output where ETX and link qualities are very close to 1.0  (0.99X 
>> link, 1.005 ETX for instance).  Then all of a sudden at about 4 
>> minutes into the run, they both get upset about clock skew, both stop 
>> outputting packets all together on the 5ghz link, and no output from 
>> OLSR at all -- I thought they were hung on TTY stuff or something for 
>> a moment w/ exception of 5 second dyn_gw pings from the dyn_gw ping 
>> thread.  Then after about 30 seconds, they we're all happy again and 
>> everything is perfect since.  There were some clock skew complaints 
>> (I'm fairly certain), but I'm pretty sure the clearscreens had wiped 
>> them when each appeared hung and stopped putting out packets on the 5 
>> ghz.  I also saw for a moment an ETX of INFINITY on one of them.  
>> Also after the 4 minute route bounce I see "Received timestamp 
>> 946705387 diff: 2" messages I wasn't seeing (I'm pretty sure) in the 
>> first run after boot and it appears I'll have timestamp diff 
>> complaints forever at this point.
>>
>> So I think what happens is after 4 minutes on my hardware when 
>> powered up simultaneously, the clocks drift enough that OLSRs get 
>> upset about skew, idle for 30 seconds (or do some other odd thing I 
>> was hoping you guys could explain) and come back.
>>
>> If I spread out the boot time to create skew at OLSR startup (haven't 
>> tried that w/ no foreground run by just deliberately powering up one, 
>> waiting a few seconds, and powering up the other) -- they get it out 
>> of the way and there's never an interruption.
>>
>> Am I going to be OK w/ this w/ additional nodes in my mesh? (I've got 
>> a total of 5 we plan to use for a temporary outdoor mesh at the Head 
>> of the Charles Regatta in Boston coming up in October).  We may use 
>> even more if we need the coverage -- I like the approach of "add 
>> nodes" to "mesh in more reliability".
>>
>> Should I have my boxes sync w/ NTP as soon as possible and deal with 
>> the resulting route disappear / reappear that might happen depending 
>> upon internet state when everything is initially brought up (clock 
>> will jump from 2000 to current time - I really don't want to solder 
>> batteries onto the boards)?
>>
>> Am I on the right track?
>>
>> I can attach logs (w/o clearscreens!) and my actual configs in use if 
>> that helps.
>>
>> What I'd like to do is just have each node chime into the mesh w/ 
>> obvious big clock skew (since at the event they will not be able to 
>> be powered up all simultaneously), have the skew get worked out as 
>> each node comes in and just leave it.  I don't really care about the 
>> clocks on these nodes -- they run w/ just a kernel and ramdisk booted 
>> from flash and the flash is only mounted read-only so I don't have to 
>> worry about filesystem damage -- they just get power cycled and logs 
>> and such in ramdisk are gone -- I don't really care about timestamps 
>> in that data.  Hoping OLSR will run fine like this.  Any advice would 
>> be appreciated.
>>
>> Also -- being a first timer w/ OLSR and a traditional "AP / Station 
>> backhaul from the early Ad-hoc not working in MADWIFI (2005 
>> timeframe)" with OSPF to manage all of the routed AP / backhaul 
>> subnets everywhere, OLSR to do a mesh is really nice / and Better! -- 
>> especially with the link quality extension -- looks fantastic -- very 
>> nice work.  I'm excited to try out the real outdoor setup w/ more 
>> nodes as long as I don't have any real problems to deal with on the 
>> bench.
>>
>> Thanks,
>>
>> -- 
>> Eric Malkowski
>> BVWireless, LLC
>> Northbridge, MA  USA
>>
>>