[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync

Tue Sep 2 05:27:08 CEST 2008

I did some more investigation to find that after about 4 mins of uptime 
even a single box w/ OLSR running, no dyn_gw plugin (so single 
threaded), if I run it in the foreground with debug level 9 it simply 
hangs for about 40 seconds.

There are several "TIMER: fire unknown timer 0x8XXXXXX, ctx (nil) at 
clocktick 1193:02:47.2" messages along w/ [ENC] timestamp: 946703476 and 
[ENC] Message signed messages.  All of this just ceases for 40 seconds, 
and then picks up and keep going.  It's like the process gets blocked on 
something -- then it's apparently fine after that.

There's no jump in time on this box -- it's not doing any time syncing 
or anything of the sort.
In fact - when it starts going again after the hang, the [ENC]timestamp: 
946703570 last message is 40 seconds ahead like it should be.
Other processes on the box appear to be fine while olsr is stuck.

My only guess is possibly the kernel does something w/ timers or clock 
etc. 4 minutes after bootup.

This kernel is latest 2.6.26.3 with the ALIX MFGPT (general purpose 
timer hardware from 5536 chip) enabled and hi-res timer support turned 
on -- also it's set for "tickless" operation.  I can build up fairly 
easily a kernel w/ traditional PIT timer, no hi-res support and no MFGPT 
support just to see if that makes a difference.

Another thing I can try is to run it inside strace to what system call 
the process last did to get a clue of what's going on.  I didn't run 
"top" to see if it was "spinning" in a loop on the CPU when "stuck" not 
outputting anything -- I'll check for that.  I may also try running it 
in the debugger and CTRL-C while it's "stuck". and also try sending it a 
signal while stuck.

After the box has been up and this has happened once, I can CTRL-C and 
start OLSR again and no problem whatsoever...  so it could be my 
environment (probably IS my environment) and not a real problem w/ OLSR.

Any insight would be appreciated, but it appears I'll be on my own w/ 
this one.

Eric Malkowski wrote:
> Hi all-
>
> Forgive me if this is already known - I'm new to OLSR and searched 
> around a bit and didn't find anything that sounded like what I'm seeing.
>
> I'm using the latest OLSR 0.5.6 with the link quality extension 
> default config olsrd.conf.default.lq along with secure plugin and 
> dyn_gw on one of the nodes for an "internet" default route where he's 
> doing NAT firewall for the mesh.
>
> The hardware for my test is two ALIX 3c2 boards with Ubiquiti XR5 in 
> adhoc mode (madwifi 0.9.4.1, linux kernel 2.6.26.3 on a uclibc 0.9.29 
> buildroot setup) and an Engenius EMP 8602+S for 2.4ghz AP mode as an 
> HNA network for each mesh node.  The one running dyn_gw has static 
> default route out wired network and dyn_gw injects the 0.0.0.0 HNA 
> just fine for me.  The httpinfo plugin is really cool too by the way.
>
> So its:  ALIX #1 w/ dyn_gw --> 5ghz adhoc subnet --> ALIX #2, no dyn_gw
> And both have 2.4 ghz HNA networks for "AP services".
>
> These boards have no clock batteries so when I power them up 
> simultaneously, their clocks are in the best sync they can be (but 
> January, 1 2000 obviously).  After about 4 minutes of uptime, both 
> OLSRs delete all of the routes, then they come back in a minute or so.
>
> I've found that killing them off and running w/ debug level 2 in the 
> foreground w/ TCP dump going on both 5 ghz links watching packets 
> going out, everything works fine -- no problem.  This is after the 
> "bootup" disappearing routes.  My theory is that the clocks are 
> already drifted apart enough since I saw a lot of chatter early in 
> this 2nd run w/ debug 2 in the foreground complaining about clock 
> skew.  They seem to "get it out of their system" early on and then 
> happily work together.
>
> I reproduced the problem w/ debug level 2 on by powering up both boxes 
> simultaneously, and hastily getting OLSR and tcpdumps going via SSH 
> sessions -- everything comes up and builds the mesh just fine.  I can 
> see perfect 1.0 ETX and link qualities where the DIJKSTRA / LINKS / 
> TOPO output doesn't happen (LQ doesn't run -- not needed since stuff 
> is perfect is what I assume) and then I see a bit of that output where 
> ETX and link qualities are very close to 1.0  (0.99X link, 1.005 ETX 
> for instance).  Then all of a sudden at about 4 minutes into the run, 
> they both get upset about clock skew, both stop outputting packets all 
> together on the 5ghz link, and no output from OLSR at all -- I thought 
> they were hung on TTY stuff or something for a moment w/ exception of 
> 5 second dyn_gw pings from the dyn_gw ping thread.  Then after about 
> 30 seconds, they we're all happy again and everything is perfect 
> since.  There were some clock skew complaints (I'm fairly certain), 
> but I'm pretty sure the clearscreens had wiped them when each appeared 
> hung and stopped putting out packets on the 5 ghz.  I also saw for a 
> moment an ETX of INFINITY on one of them.  Also after the 4 minute 
> route bounce I see "Received timestamp 946705387 diff: 2" messages I 
> wasn't seeing (I'm pretty sure) in the first run after boot and it 
> appears I'll have timestamp diff complaints forever at this point.
>
> So I think what happens is after 4 minutes on my hardware when powered 
> up simultaneously, the clocks drift enough that OLSRs get upset about 
> skew, idle for 30 seconds (or do some other odd thing I was hoping you 
> guys could explain) and come back.
>
> If I spread out the boot time to create skew at OLSR startup (haven't 
> tried that w/ no foreground run by just deliberately powering up one, 
> waiting a few seconds, and powering up the other) -- they get it out 
> of the way and there's never an interruption.
>
> Am I going to be OK w/ this w/ additional nodes in my mesh? (I've got 
> a total of 5 we plan to use for a temporary outdoor mesh at the Head 
> of the Charles Regatta in Boston coming up in October).  We may use 
> even more if we need the coverage -- I like the approach of "add 
> nodes" to "mesh in more reliability".
>
> Should I have my boxes sync w/ NTP as soon as possible and deal with 
> the resulting route disappear / reappear that might happen depending 
> upon internet state when everything is initially brought up (clock 
> will jump from 2000 to current time - I really don't want to solder 
> batteries onto the boards)?
>
> Am I on the right track?
>
> I can attach logs (w/o clearscreens!) and my actual configs in use if 
> that helps.
>
> What I'd like to do is just have each node chime into the mesh w/ 
> obvious big clock skew (since at the event they will not be able to be 
> powered up all simultaneously), have the skew get worked out as each 
> node comes in and just leave it.  I don't really care about the clocks 
> on these nodes -- they run w/ just a kernel and ramdisk booted from 
> flash and the flash is only mounted read-only so I don't have to worry 
> about filesystem damage -- they just get power cycled and logs and 
> such in ramdisk are gone -- I don't really care about timestamps in 
> that data.  Hoping OLSR will run fine like this.  Any advice would be 
> appreciated.
>
> Also -- being a first timer w/ OLSR and a traditional "AP / Station 
> backhaul from the early Ad-hoc not working in MADWIFI (2005 
> timeframe)" with OSPF to manage all of the routed AP / backhaul 
> subnets everywhere, OLSR to do a mesh is really nice / and Better! -- 
> especially with the link quality extension -- looks fantastic -- very 
> nice work.  I'm excited to try out the real outdoor setup w/ more 
> nodes as long as I don't have any real problems to deal with on the 
> bench.
>
> Thanks,
>
> -- 
> Eric Malkowski
> BVWireless, LLC
> Northbridge, MA  USA
>
>