[Olsr-users] 0.5.6 Routes disappear after 4 mins of uptime, then all OK - suspect clock sync
Eric Malkowski
(spam-protected)
Tue Sep 2 04:16:53 CEST 2008
Hi all-
Forgive me if this is already known - I'm new to OLSR and searched
around a bit and didn't find anything that sounded like what I'm seeing.
I'm using the latest OLSR 0.5.6 with the link quality extension default
config olsrd.conf.default.lq along with secure plugin and dyn_gw on one
of the nodes for an "internet" default route where he's doing NAT
firewall for the mesh.
The hardware for my test is two ALIX 3c2 boards with Ubiquiti XR5 in
adhoc mode (madwifi 0.9.4.1, linux kernel 2.6.26.3 on a uclibc 0.9.29
buildroot setup) and an Engenius EMP 8602+S for 2.4ghz AP mode as an HNA
network for each mesh node. The one running dyn_gw has static default
route out wired network and dyn_gw injects the 0.0.0.0 HNA just fine for
me. The httpinfo plugin is really cool too by the way.
So its: ALIX #1 w/ dyn_gw --> 5ghz adhoc subnet --> ALIX #2, no dyn_gw
And both have 2.4 ghz HNA networks for "AP services".
These boards have no clock batteries so when I power them up
simultaneously, their clocks are in the best sync they can be (but
January, 1 2000 obviously). After about 4 minutes of uptime, both OLSRs
delete all of the routes, then they come back in a minute or so.
I've found that killing them off and running w/ debug level 2 in the
foreground w/ TCP dump going on both 5 ghz links watching packets going
out, everything works fine -- no problem. This is after the "bootup"
disappearing routes. My theory is that the clocks are already drifted
apart enough since I saw a lot of chatter early in this 2nd run w/ debug
2 in the foreground complaining about clock skew. They seem to "get it
out of their system" early on and then happily work together.
I reproduced the problem w/ debug level 2 on by powering up both boxes
simultaneously, and hastily getting OLSR and tcpdumps going via SSH
sessions -- everything comes up and builds the mesh just fine. I can
see perfect 1.0 ETX and link qualities where the DIJKSTRA / LINKS / TOPO
output doesn't happen (LQ doesn't run -- not needed since stuff is
perfect is what I assume) and then I see a bit of that output where ETX
and link qualities are very close to 1.0 (0.99X link, 1.005 ETX for
instance). Then all of a sudden at about 4 minutes into the run, they
both get upset about clock skew, both stop outputting packets all
together on the 5ghz link, and no output from OLSR at all -- I thought
they were hung on TTY stuff or something for a moment w/ exception of 5
second dyn_gw pings from the dyn_gw ping thread. Then after about 30
seconds, they we're all happy again and everything is perfect since.
There were some clock skew complaints (I'm fairly certain), but I'm
pretty sure the clearscreens had wiped them when each appeared hung and
stopped putting out packets on the 5 ghz. I also saw for a moment an
ETX of INFINITY on one of them. Also after the 4 minute route bounce I
see "Received timestamp 946705387 diff: 2" messages I wasn't seeing (I'm
pretty sure) in the first run after boot and it appears I'll have
timestamp diff complaints forever at this point.
So I think what happens is after 4 minutes on my hardware when powered
up simultaneously, the clocks drift enough that OLSRs get upset about
skew, idle for 30 seconds (or do some other odd thing I was hoping you
guys could explain) and come back.
If I spread out the boot time to create skew at OLSR startup (haven't
tried that w/ no foreground run by just deliberately powering up one,
waiting a few seconds, and powering up the other) -- they get it out of
the way and there's never an interruption.
Am I going to be OK w/ this w/ additional nodes in my mesh? (I've got a
total of 5 we plan to use for a temporary outdoor mesh at the Head of
the Charles Regatta in Boston coming up in October). We may use even
more if we need the coverage -- I like the approach of "add nodes" to
"mesh in more reliability".
Should I have my boxes sync w/ NTP as soon as possible and deal with the
resulting route disappear / reappear that might happen depending upon
internet state when everything is initially brought up (clock will jump
from 2000 to current time - I really don't want to solder batteries onto
the boards)?
Am I on the right track?
I can attach logs (w/o clearscreens!) and my actual configs in use if
that helps.
What I'd like to do is just have each node chime into the mesh w/
obvious big clock skew (since at the event they will not be able to be
powered up all simultaneously), have the skew get worked out as each
node comes in and just leave it. I don't really care about the clocks
on these nodes -- they run w/ just a kernel and ramdisk booted from
flash and the flash is only mounted read-only so I don't have to worry
about filesystem damage -- they just get power cycled and logs and such
in ramdisk are gone -- I don't really care about timestamps in that
data. Hoping OLSR will run fine like this. Any advice would be
appreciated.
Also -- being a first timer w/ OLSR and a traditional "AP / Station
backhaul from the early Ad-hoc not working in MADWIFI (2005 timeframe)"
with OSPF to manage all of the routed AP / backhaul subnets everywhere,
OLSR to do a mesh is really nice / and Better! -- especially with the
link quality extension -- looks fantastic -- very nice work. I'm
excited to try out the real outdoor setup w/ more nodes as long as I
don't have any real problems to deal with on the bench.
Thanks,
--
Eric Malkowski
BVWireless, LLC
Northbridge, MA USA
More information about the Olsr-users
mailing list