[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync

Eric Malkowski (spam-protected)
Tue Sep 2 04:16:53 CEST 2008


Hi all-

Forgive me if this is already known - I'm new to OLSR and searched 
around a bit and didn't find anything that sounded like what I'm seeing.

I'm using the latest OLSR 0.5.6 with the link quality extension default 
config olsrd.conf.default.lq along with secure plugin and dyn_gw on one 
of the nodes for an "internet" default route where he's doing NAT 
firewall for the mesh.

The hardware for my test is two ALIX 3c2 boards with Ubiquiti XR5 in 
adhoc mode (madwifi 0.9.4.1, linux kernel 2.6.26.3 on a uclibc 0.9.29 
buildroot setup) and an Engenius EMP 8602+S for 2.4ghz AP mode as an HNA 
network for each mesh node.  The one running dyn_gw has static default 
route out wired network and dyn_gw injects the 0.0.0.0 HNA just fine for 
me.  The httpinfo plugin is really cool too by the way.

So its:  ALIX #1 w/ dyn_gw --> 5ghz adhoc subnet --> ALIX #2, no dyn_gw
And both have 2.4 ghz HNA networks for "AP services".

These boards have no clock batteries so when I power them up 
simultaneously, their clocks are in the best sync they can be (but 
January, 1 2000 obviously).  After about 4 minutes of uptime, both OLSRs 
delete all of the routes, then they come back in a minute or so.

I've found that killing them off and running w/ debug level 2 in the 
foreground w/ TCP dump going on both 5 ghz links watching packets going 
out, everything works fine -- no problem.  This is after the "bootup" 
disappearing routes.  My theory is that the clocks are already drifted 
apart enough since I saw a lot of chatter early in this 2nd run w/ debug 
2 in the foreground complaining about clock skew.  They seem to "get it 
out of their system" early on and then happily work together.

I reproduced the problem w/ debug level 2 on by powering up both boxes 
simultaneously, and hastily getting OLSR and tcpdumps going via SSH 
sessions -- everything comes up and builds the mesh just fine.  I can 
see perfect 1.0 ETX and link qualities where the DIJKSTRA / LINKS / TOPO 
output doesn't happen (LQ doesn't run -- not needed since stuff is 
perfect is what I assume) and then I see a bit of that output where ETX 
and link qualities are very close to 1.0  (0.99X link, 1.005 ETX for 
instance).  Then all of a sudden at about 4 minutes into the run, they 
both get upset about clock skew, both stop outputting packets all 
together on the 5ghz link, and no output from OLSR at all -- I thought 
they were hung on TTY stuff or something for a moment w/ exception of 5 
second dyn_gw pings from the dyn_gw ping thread.  Then after about 30 
seconds, they we're all happy again and everything is perfect since.  
There were some clock skew complaints (I'm fairly certain), but I'm 
pretty sure the clearscreens had wiped them when each appeared hung and 
stopped putting out packets on the 5 ghz.  I also saw for a moment an 
ETX of INFINITY on one of them.  Also after the 4 minute route bounce I 
see "Received timestamp 946705387 diff: 2" messages I wasn't seeing (I'm 
pretty sure) in the first run after boot and it appears I'll have 
timestamp diff complaints forever at this point.

So I think what happens is after 4 minutes on my hardware when powered 
up simultaneously, the clocks drift enough that OLSRs get upset about 
skew, idle for 30 seconds (or do some other odd thing I was hoping you 
guys could explain) and come back.

If I spread out the boot time to create skew at OLSR startup (haven't 
tried that w/ no foreground run by just deliberately powering up one, 
waiting a few seconds, and powering up the other) -- they get it out of 
the way and there's never an interruption.

Am I going to be OK w/ this w/ additional nodes in my mesh? (I've got a 
total of 5 we plan to use for a temporary outdoor mesh at the Head of 
the Charles Regatta in Boston coming up in October).  We may use even 
more if we need the coverage -- I like the approach of "add nodes" to 
"mesh in more reliability".

Should I have my boxes sync w/ NTP as soon as possible and deal with the 
resulting route disappear / reappear that might happen depending upon 
internet state when everything is initially brought up (clock will jump 
from 2000 to current time - I really don't want to solder batteries onto 
the boards)?

Am I on the right track?

I can attach logs (w/o clearscreens!) and my actual configs in use if 
that helps.

What I'd like to do is just have each node chime into the mesh w/ 
obvious big clock skew (since at the event they will not be able to be 
powered up all simultaneously), have the skew get worked out as each 
node comes in and just leave it.  I don't really care about the clocks 
on these nodes -- they run w/ just a kernel and ramdisk booted from 
flash and the flash is only mounted read-only so I don't have to worry 
about filesystem damage -- they just get power cycled and logs and such 
in ramdisk are gone -- I don't really care about timestamps in that 
data.  Hoping OLSR will run fine like this.  Any advice would be 
appreciated.

Also -- being a first timer w/ OLSR and a traditional "AP / Station 
backhaul from the early Ad-hoc not working in MADWIFI (2005 timeframe)" 
with OSPF to manage all of the routed AP / backhaul subnets everywhere, 
OLSR to do a mesh is really nice / and Better! -- especially with the 
link quality extension -- looks fantastic -- very nice work.  I'm 
excited to try out the real outdoor setup w/ more nodes as long as I 
don't have any real problems to deal with on the bench.

Thanks,

--
Eric Malkowski
BVWireless, LLC
Northbridge, MA  USA





More information about the Olsr-users mailing list