[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync
Eric Malkowski
(spam-protected)
Tue Sep 2 06:13:38 CEST 2008
One more datapoint -- I ran this same test on the box that does dyn_gw
and during the 40 second "coma" of the main OLSR thread, the thread that
does the pings keeps going and prints out Pings are OK and that the
HNA[0000000.00000000] route is still good every 5 seconds (that works
off it's own nanosleep() call.
So this could be a bug in OLSR timers that is exacerbated by the
environment I'm running in... I'm doubtful of that and will spend some
more time tomorrow on it. So far it appears I can safely ignore the
problem since it does this just once at 4 minutes uptime on my box and
never anymore.
Eric Malkowski wrote:
> I did some more investigation to find that after about 4 mins of
> uptime even a single box w/ OLSR running, no dyn_gw plugin (so single
> threaded), if I run it in the foreground with debug level 9 it simply
> hangs for about 40 seconds.
>
> There are several "TIMER: fire unknown timer 0x8XXXXXX, ctx (nil) at
> clocktick 1193:02:47.2" messages along w/ [ENC] timestamp: 946703476
> and [ENC] Message signed messages. All of this just ceases for 40
> seconds, and then picks up and keep going. It's like the process gets
> blocked on something -- then it's apparently fine after that.
>
> There's no jump in time on this box -- it's not doing any time syncing
> or anything of the sort.
> In fact - when it starts going again after the hang, the
> [ENC]timestamp: 946703570 last message is 40 seconds ahead like it
> should be.
> Other processes on the box appear to be fine while olsr is stuck.
>
> My only guess is possibly the kernel does something w/ timers or clock
> etc. 4 minutes after bootup.
>
> This kernel is latest 2.6.26.3 with the ALIX MFGPT (general purpose
> timer hardware from 5536 chip) enabled and hi-res timer support turned
> on -- also it's set for "tickless" operation. I can build up fairly
> easily a kernel w/ traditional PIT timer, no hi-res support and no
> MFGPT support just to see if that makes a difference.
>
> Another thing I can try is to run it inside strace to what system call
> the process last did to get a clue of what's going on. I didn't run
> "top" to see if it was "spinning" in a loop on the CPU when "stuck"
> not outputting anything -- I'll check for that. I may also try
> running it in the debugger and CTRL-C while it's "stuck". and also try
> sending it a signal while stuck.
>
> After the box has been up and this has happened once, I can CTRL-C and
> start OLSR again and no problem whatsoever... so it could be my
> environment (probably IS my environment) and not a real problem w/ OLSR.
>
> Any insight would be appreciated, but it appears I'll be on my own w/
> this one.
>
> Eric Malkowski wrote:
>> Hi all-
>>
>> Forgive me if this is already known - I'm new to OLSR and searched
>> around a bit and didn't find anything that sounded like what I'm seeing.
>>
>> I'm using the latest OLSR 0.5.6 with the link quality extension
>> default config olsrd.conf.default.lq along with secure plugin and
>> dyn_gw on one of the nodes for an "internet" default route where he's
>> doing NAT firewall for the mesh.
>>
>> The hardware for my test is two ALIX 3c2 boards with Ubiquiti XR5 in
>> adhoc mode (madwifi 0.9.4.1, linux kernel 2.6.26.3 on a uclibc 0.9.29
>> buildroot setup) and an Engenius EMP 8602+S for 2.4ghz AP mode as an
>> HNA network for each mesh node. The one running dyn_gw has static
>> default route out wired network and dyn_gw injects the 0.0.0.0 HNA
>> just fine for me. The httpinfo plugin is really cool too by the way.
>>
>> So its: ALIX #1 w/ dyn_gw --> 5ghz adhoc subnet --> ALIX #2, no dyn_gw
>> And both have 2.4 ghz HNA networks for "AP services".
>>
>> These boards have no clock batteries so when I power them up
>> simultaneously, their clocks are in the best sync they can be (but
>> January, 1 2000 obviously). After about 4 minutes of uptime, both
>> OLSRs delete all of the routes, then they come back in a minute or so.
>>
>> I've found that killing them off and running w/ debug level 2 in the
>> foreground w/ TCP dump going on both 5 ghz links watching packets
>> going out, everything works fine -- no problem. This is after the
>> "bootup" disappearing routes. My theory is that the clocks are
>> already drifted apart enough since I saw a lot of chatter early in
>> this 2nd run w/ debug 2 in the foreground complaining about clock
>> skew. They seem to "get it out of their system" early on and then
>> happily work together.
>>
>> I reproduced the problem w/ debug level 2 on by powering up both
>> boxes simultaneously, and hastily getting OLSR and tcpdumps going via
>> SSH sessions -- everything comes up and builds the mesh just fine. I
>> can see perfect 1.0 ETX and link qualities where the DIJKSTRA / LINKS
>> / TOPO output doesn't happen (LQ doesn't run -- not needed since
>> stuff is perfect is what I assume) and then I see a bit of that
>> output where ETX and link qualities are very close to 1.0 (0.99X
>> link, 1.005 ETX for instance). Then all of a sudden at about 4
>> minutes into the run, they both get upset about clock skew, both stop
>> outputting packets all together on the 5ghz link, and no output from
>> OLSR at all -- I thought they were hung on TTY stuff or something for
>> a moment w/ exception of 5 second dyn_gw pings from the dyn_gw ping
>> thread. Then after about 30 seconds, they we're all happy again and
>> everything is perfect since. There were some clock skew complaints
>> (I'm fairly certain), but I'm pretty sure the clearscreens had wiped
>> them when each appeared hung and stopped putting out packets on the 5
>> ghz. I also saw for a moment an ETX of INFINITY on one of them.
>> Also after the 4 minute route bounce I see "Received timestamp
>> 946705387 diff: 2" messages I wasn't seeing (I'm pretty sure) in the
>> first run after boot and it appears I'll have timestamp diff
>> complaints forever at this point.
>>
>> So I think what happens is after 4 minutes on my hardware when
>> powered up simultaneously, the clocks drift enough that OLSRs get
>> upset about skew, idle for 30 seconds (or do some other odd thing I
>> was hoping you guys could explain) and come back.
>>
>> If I spread out the boot time to create skew at OLSR startup (haven't
>> tried that w/ no foreground run by just deliberately powering up one,
>> waiting a few seconds, and powering up the other) -- they get it out
>> of the way and there's never an interruption.
>>
>> Am I going to be OK w/ this w/ additional nodes in my mesh? (I've got
>> a total of 5 we plan to use for a temporary outdoor mesh at the Head
>> of the Charles Regatta in Boston coming up in October). We may use
>> even more if we need the coverage -- I like the approach of "add
>> nodes" to "mesh in more reliability".
>>
>> Should I have my boxes sync w/ NTP as soon as possible and deal with
>> the resulting route disappear / reappear that might happen depending
>> upon internet state when everything is initially brought up (clock
>> will jump from 2000 to current time - I really don't want to solder
>> batteries onto the boards)?
>>
>> Am I on the right track?
>>
>> I can attach logs (w/o clearscreens!) and my actual configs in use if
>> that helps.
>>
>> What I'd like to do is just have each node chime into the mesh w/
>> obvious big clock skew (since at the event they will not be able to
>> be powered up all simultaneously), have the skew get worked out as
>> each node comes in and just leave it. I don't really care about the
>> clocks on these nodes -- they run w/ just a kernel and ramdisk booted
>> from flash and the flash is only mounted read-only so I don't have to
>> worry about filesystem damage -- they just get power cycled and logs
>> and such in ramdisk are gone -- I don't really care about timestamps
>> in that data. Hoping OLSR will run fine like this. Any advice would
>> be appreciated.
>>
>> Also -- being a first timer w/ OLSR and a traditional "AP / Station
>> backhaul from the early Ad-hoc not working in MADWIFI (2005
>> timeframe)" with OSPF to manage all of the routed AP / backhaul
>> subnets everywhere, OLSR to do a mesh is really nice / and Better! --
>> especially with the link quality extension -- looks fantastic -- very
>> nice work. I'm excited to try out the real outdoor setup w/ more
>> nodes as long as I don't have any real problems to deal with on the
>> bench.
>>
>> Thanks,
>>
>> --
>> Eric Malkowski
>> BVWireless, LLC
>> Northbridge, MA USA
>>
>>
More information about the Olsr-users
mailing list