[olsr-dev] Question about implementation internals

Sun Jan 16 12:17:11 CET 2005

> Your question is an interesting one, and I'll have to look more
> into the issue before deciding if it is something that should
> be changed.
> Quick answer to your bottom line question: yes and no. Message
> generation is fired off by the scheduler which is just a endless loop
> that sleeps for a certain pollrate at the end of every run. However, to
> make the sleeptime more accurate a timestamp is fetched at the start and
> stop of every scheduler poll and the (stop - start) interval is
> subtraced from the pollinterval before calling nanosleep. So if the
> system time was changed while the scheduler was polling this could have
> a certain effect, but I guess what decides if this could have any real
> effect is wheter the timestamp values are signed or not. But still -
> fixing this should be a minor issue.

Sorry, what does "signed" mean here ?

> But when it comes to timeouts of registered entries there can be
> problems. The entries in the various tables(linkset, topology etc.)
> need to be very dynamic ie. they have to be polled for timeouts
> very rapidly. Therefore, to avoid CPU hogging they use timestamps
> that are compared to current time to decide validity. A change in
> system time could AFAIK cause real trouble here.

Correct me if I'm wrong: for a certain type of table, you insert new records
at the end, meaning that they are always sorted and the first one is the one
of earliest expiration. Then, for each check, you just have to start at the
first entry and proceed removing entries until you find one whose expiration
time is later that current time.

> One could ofcause consider using kernel uptime counters like
> polling /proc/uptime or checking the return value of times etc. on
> Linux - but then again that would break cross platform design.
> I'll have do some poking around in the POSIX system calls... anyway,
> I'll put it on my to-check-out list :)

I'm no POSIX guru, so please correct me where I'm wrong:

When not using pthreads, the timing mechanisms provided by linux are sleep,
nanosleep, select, pselect, poll, and gettimeofday. All the calls take a
timeout lapse as parameter, which is preferred since taking an absolute,
system time as parameter would lead to race conditions upon system time
changes. But of course, to get a current time reference, all you have is
gettimeofday which returns system time. So if you use gettimeofday as a
starting point to calculate timeout lapses, you are, in fact, getting the
same race conditions. If you are not using calculations but comparisons (as
OLSRD doeas according to your description), the problem is even worse, since
you are almost guaranteed to be in trouble upon a system change, as opposed
to a race condition which is supposed to happen very seldom. (now that I
think about it, race conditions are in fact WORSE, since are harder to
reproduce and thus to chase down).

The funny thing is that all this would be solved if only there was a system
call that would return some sort of MONOTONIC clock, such as the uptime you
mention. By the way, /proc/uptime not only is non portable, but also has not
enough resolution for certain applications.

Another thing to note is that using those calls some behaviours cannot be
implemented. For example, say you want a task executed every 10 seconds. You
don't care if execution time varies by +-1 second, but you want each
execution to happen within its time window, that is, execution N must happen
sometime between 10*N-1 and 10*N+1.

You need a call that takes absolute time to do that. If you use a call that
takes a timeout lapse, the task execution time and loop overhead, no matter
how small they are, will be adding up, and when you are at iteration
10000000, this added error maybe way larger than the +-1 second window. Of
course you can try to measure the task execution time and loop overhead and
correct the timeout lapse, but then you'll be using gettimeofday and
intriducing race conditions when system time changes, and even if you ignore
that, there is something you'll never be able to measure: the execution time
of gettimeofday itself. And that will add up again to cause the same problem
eventually.

So here comes pthreads. You have pthread_cond_timedwait which takes an
absolute time. Is everything solved?. No. In the first place, it takes time
as SYSTEM time, thus introducing race conditions again. For example. Say you
want to wait for 1 second for a condition to happen:

1- T = gettimeofday();
2- T = t + 1
3- pthread_cond_timedwait(T)

If someone moves back one hour the system time between steps 1 and 3, you'll
end up waiting for one hour and one second. And moving back the system time
happens here at least once a year (daylight savings).

And, if you wanna see something funny, have a look at the
pthread_cond_timedwait implementation. It takes T, calls gettimeofday() and
susbtract to get a timeout lapse, and then calls nanosleep. Fantastic,
another race condition.

When I realized all of this, it seemed unbelievable, because it means every
program that uses pthread_cond_timedwait out there has race conditions when
time changes. Yeah, sure the system time usually does not change, or is NTP
adjusted by a few seconds which should be no problem (otherwise you should
be using a tru real time system), but again, at least once in a year the
time is set back one hour, and one probability in one gazillion that some
processes in my server "freeze" for one hour when that happens is simply
unacceptable.

I ended up doing the following: and program startup, get current time and
save it as a global value T. Each time I call gettimeofday(), I susbstract T
in order to get a relatime time, and use it throughout the program. Of
course, whenever I must call pthread_cond_timedwait, I add T to the relative
value. The catch here is that system time changes are done ALWAYS through my
program, which means I know when that happens and can adjust T accordingly
so everything works fine. This allows me to work with relative times, but
does not solve the race conditions. And, of course, having all system time
changes go through a certain process is feasible only in very specific
situations, mostly embedded systems.

Anyway, I casually checked the latest glibc distributed by Red Hat and was
much pleased to see that a new clock has been implemented for
pthread_cond_timedwait. It is CLOCK_MONOTONIC (as specified in POSIX
1003.1j) and it is exactly what I wanted. However, I believe kernel 2.6.x is
necessary (since you need that clock to be implemented in the kernel) and
I'm stuck with 2.4.x in my application. And, of course, this is a highly
non-portable solution.

I'll be eagerly waiting for you analysis of the problem in OLSRD.

Nacho.

-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.6.12 - Release Date: 14/01/2005