[Olsr-dev] segfault in -r3 reproduced

Mon Dec 8 14:19:04 CET 2008

On Mon, Dec 8, 2008 at 7:51 AM, Sven-Ola Tuecke <(spam-protected)> wrote:

> Markus,
>
> the olsr.org routing daemon has issues with ifup/down as well with any
> combination of changeing IP, netmask, MTU, queue-config since the beginning
> of olsrd. For example: the daemon and the kernel both maintain a table of
> routing entries. If some braindead admin or script executes "ip l set dev X

ok this the script is a bit braindead, and may look very braindead, if you
don`t know the background for using it, but ... (see below)

>
> down;ip l set dev X up", the kernel routes are removed while not beeing

as you point it out this should only result in missing kernel routes, but in
reality you get wrong kernel routes also, ...
.
reproduceable with an braindeaded ifdownup script,
but also findable ocassionlly on routers whre nobody logged in, and olsr
runs only on quite normal interfaces, e.g. routers runnig freifunkfirmware
(#)
# where nobody ever logged into the shell,.. and just some wireless settings
were changed and ip4broadcast was confgured on the webif,..
imho it`s far less braindead to use such a script to get olsrd to reproduce
below effects, than to waste time on reading some parts of your email,..

1. missing routes (not the biggest problem as it has to be expected, as they
get deleted from the kernel)
but in fact olsr does usually quite well in inserting them again
2. wrong routes, yes you end up with routes going out on the wrong
interface, and stay in kernel quite long,..
mostly this are intermitted routed created while one interface is down, but
are never "moved" back to the "correct" interface after this is  up again

3. crashing olsr (if you run an ifupdown script to long against an 0.5.6
olsrd )
this was not the goal of this testscript, as it denies testing of above
effects, but it showed bugs in code only executed after netlink errors on
route updates,..

> removed in olsrd.

Same if you manually fiddle with routing entries
> (especially with the default route). This is what I denote "well known".
>
> With this in mind, you may remove your admin password from devices to
> prevent
> yourself from doing things in an uncontrolled way. As an alternative,
> replace
> the "ip" and "ifconfig" commands with a version that restarts olsrd.

doing so it will still crash olsrd regulary on about 10 routers in vienna,
after some braindead device owner updated their olsrd (or complete firmware)
from currently 0.5.5 which currently runs there (at least withut
crashing),..

Why will this happen?
openvpn runs on this routers, and its devices go up/down dynamically,..
(why openvpn is configured this way is another (long and sad olsr-related)
story, but you may also have dynamic wds interfaces you want olsr to run on,
or whatever)

so i didn't invent this braindead test script yesterday just for fun, it`s
just a way to reproduce the reason for olsrd 0.5.6 crashes on this routers
(and 0.5.5. (and 0.5.6.) producing wrong routes)

>
>
> Yes - the situation is worse in 0.5.6, because 0.5.5 has handled this
> better

ACK

>
> (but not perfectly to my knowledge).

but good enough to run for weeks without crashing on our vpn-server
.
in fact i can not remember it ever crashed, only very seldomly there were
some wrong/missing routes.
but this happened as seldomly as it happens on other routers with similar
number of interfaces and routing changes, that do not have "dynamic"
interfaces.
.
0.5.6 may crash within minutes/seconds there, and with much luck it runs
there for some hours, which is inacceptable, so we still have 0.5.5 there

>
> Why? Because there is no easy fix with the current stable code IMO. A
> reasonable fix includes changeing critical code pathes - which should not
> happen in a stable branch.

imo this branch may not be called stable (-;
(see at the end why)

Because it will induce new bugs - bugging the
> majority wich do not change their ifconfig every here and then. Something
> like this should take place in the development branch IMO.

partly ACK, as i also plan a better handling for this and related
problems,..
at the moment i have (a quite small) patch to make olsr using its own proto
tag in kernel table
preventing olsrd from deleting routes he never made, and enabling him to
flush his (of from a previously crashed instance) routes safely (at
startup/shutdown)
.
the bigger goal is to get rid of a well known olsr problem, inconsitent olsr
und kernel routing tables, which unfortunately happens quite often (even on
devices without any admins ever logging into) which is the main reason for
permanent routing problems in our network (which tend to stay for
hours/days)
.
i think the key to fix is lies in handling rtnetlink errors properly, and
reacting apropriate on destination unreachable, interface is down, replies
you get,
maybe we should also consider to let the kernel (via rtnetlink messages)
inform olsrd about external route updates,..
.
BUT: i still wish olsr 0.5.6 "stable" releases to run stable on routers
where 0.5.5 did run stable !!
i hope this "wish" is reasonable,..
.
Markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.olsr.org/pipermail/olsr-dev/attachments/20081208/800b02cb/attachment.html>