[Olsr-dev] info plugin still send blocking

Joe Ayers (spam-protected)
Wed Dec 6 16:52:05 CET 2017


The tests with this patch are successful.

Before:   4+ sequential failures reproduced in less than an hour each time

After:  same test setup, added an additional 15 second repeating on-node
"echo /all |  nc localhost 2006" to further stress test.   Ran for over 11
hours with no failures, still running.

Good to go from my perspective.

Joe AE6XE

On Mon, Dec 4, 2017 at 11:59 PM, Henning Rogge <(spam-protected)> wrote:

> Hi,
>
> could you test the following patch?
>
> Henning
>
> On Mon, Dec 4, 2017 at 7:53 PM, Joe Ayers <(spam-protected)> wrote:
> > Correction, full strace log file URL is:
> >
> > https://drive.google.com/file/d/1TGW5VFpcKppbd82eT72qf6TqTtgqy
> -0j/view?usp=sharing
> >
> >
> >
> > On Mon, Dec 4, 2017 at 10:36 AM, Joe Ayers <(spam-protected)> wrote:
> >> In reference to:
> >>
> >> " * A timer was added and each time it expires each non-empty buffer
> >>   * in this structure will try to write data into a "non-blocking"
> >>   * socket until all data is sent, so that no blocking occurs."
> >>
> >> A blocking event can be reliably reproduced in olsr 0.9.6.2 in OpenWRT
> >> Chaos Calmer.  The node drops off the mesh and stops responding (not a
> >> good thing when it's remote on a tower :) ).
> >>
> >> Test scenario:
> >>
> >> - Node_A LAN laptop, "echo /all | nc node_B 9090"  sleeps 2 seconds
> >> and repeats (reproduces more quickly, but could be a single hit)
> >> - Node_A has RF link to Node_B with ~70% LQ/NLQ (maybe marginal LQ/NLQ
> >> is a non-factor?)
> >> - Node_B has "echo /nei | nc localhost 2006" sleeps 5 seconds and
> repeats
> >>
> >> After a few minutes, Node_B blocks on send in strace,  subsequently
> >> waited ~10 min and SIGTERM'd  (search for SIGTERM to find in full log
> >> file URL below).
> >>
> >> clock_gettime(CLOCK_MONOTONIC, {281, 819282014}) = 0
> >> accept(13, {sa_family=AF_INET, sin_port=htons(32965),
> >> sin_addr=inet_addr("10.34.163.239")}, [16]) = 17
> >> _newselect(18, [17], NULL, NULL, {0, 20000}) = 1 (in [17], left {0,
> 19984})
> >> recv(17, "/all\n", 1024, MSG_DONTWAIT)  = 5
> >> time(NULL)                              = 1512345153
> >> clock_gettime(CLOCK_MONOTONIC, {281, 825309152}) = 0
> >> ...
> >> clock_gettime(CLOCK_MONOTONIC, {282, 416388829}) = 0
> >> _newselect(18, NULL, [17], NULL, {0, 0}) = 1 (out [17], left {0, 0})
> >> send(17, "{\"pid\": 1726,\"systemTime\": 15123"..., 386040, 0) = 141904
> >>
> >> OLSR no longer functions at this point in time, remaining TCP
> >> connections go to CLOSE_WAIT until socket listen limit reached.
> >> Shouldn't there be a MSG_DONTWAIT flag in the send to be non-blocking?
> >>
> >> Full strace log:
> >> https://drive.google.com/file/d/0B2bEy75HhwWhVE1BQ3BUdHY3azg/
> view?usp=sharing
> >>
> >> Joe AE6XE
> >
> > --
> > Olsr-dev mailing list
> > (spam-protected)
> > https://lists.olsr.org/mailman/listinfo/olsr-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.olsr.org/pipermail/olsr-dev/attachments/20171206/d9dc827d/attachment.html>


More information about the Olsr-dev mailing list