[Olsr-dev] info plugin still send blocking

Henning Rogge (spam-protected)
Tue Dec 5 12:36:48 CET 2017


On Tue, Dec 5, 2017 at 12:29 PM, Ferry Huberts <(spam-protected)> wrote:
> Yes, sorry to reply so slow.
> That patch should probably work well.
> To be fair, this was already present in the plugins before the conversion to
> shared info, I just traced it there.

Yes, most likely...

it is quite difficult to grasp why a "send()" could block if select()
told you "you can send"... but it can happen.

Henning

> On 05/12/17 08:59, Henning Rogge wrote:
>>
>> Hi,
>>
>> could you test the following patch?
>>
>> Henning
>>
>> On Mon, Dec 4, 2017 at 7:53 PM, Joe Ayers <(spam-protected)> wrote:
>>>
>>> Correction, full strace log file URL is:
>>>
>>>
>>> https://drive.google.com/file/d/1TGW5VFpcKppbd82eT72qf6TqTtgqy-0j/view?usp=sharing
>>>
>>>
>>>
>>> On Mon, Dec 4, 2017 at 10:36 AM, Joe Ayers <(spam-protected)> wrote:
>>>>
>>>> In reference to:
>>>>
>>>> " * A timer was added and each time it expires each non-empty buffer
>>>>    * in this structure will try to write data into a "non-blocking"
>>>>    * socket until all data is sent, so that no blocking occurs."
>>>>
>>>> A blocking event can be reliably reproduced in olsr 0.9.6.2 in OpenWRT
>>>> Chaos Calmer.  The node drops off the mesh and stops responding (not a
>>>> good thing when it's remote on a tower :) ).
>>>>
>>>> Test scenario:
>>>>
>>>> - Node_A LAN laptop, "echo /all | nc node_B 9090"  sleeps 2 seconds
>>>> and repeats (reproduces more quickly, but could be a single hit)
>>>> - Node_A has RF link to Node_B with ~70% LQ/NLQ (maybe marginal LQ/NLQ
>>>> is a non-factor?)
>>>> - Node_B has "echo /nei | nc localhost 2006" sleeps 5 seconds and
>>>> repeats
>>>>
>>>> After a few minutes, Node_B blocks on send in strace,  subsequently
>>>> waited ~10 min and SIGTERM'd  (search for SIGTERM to find in full log
>>>> file URL below).
>>>>
>>>> clock_gettime(CLOCK_MONOTONIC, {281, 819282014}) = 0
>>>> accept(13, {sa_family=AF_INET, sin_port=htons(32965),
>>>> sin_addr=inet_addr("10.34.163.239")}, [16]) = 17
>>>> _newselect(18, [17], NULL, NULL, {0, 20000}) = 1 (in [17], left {0,
>>>> 19984})
>>>> recv(17, "/all\n", 1024, MSG_DONTWAIT)  = 5
>>>> time(NULL)                              = 1512345153
>>>> clock_gettime(CLOCK_MONOTONIC, {281, 825309152}) = 0
>>>> ...
>>>> clock_gettime(CLOCK_MONOTONIC, {282, 416388829}) = 0
>>>> _newselect(18, NULL, [17], NULL, {0, 0}) = 1 (out [17], left {0, 0})
>>>> send(17, "{\"pid\": 1726,\"systemTime\": 15123"..., 386040, 0) = 141904
>>>>
>>>> OLSR no longer functions at this point in time, remaining TCP
>>>> connections go to CLOSE_WAIT until socket listen limit reached.
>>>> Shouldn't there be a MSG_DONTWAIT flag in the send to be non-blocking?
>>>>
>>>> Full strace log:
>>>>
>>>> https://drive.google.com/file/d/0B2bEy75HhwWhVE1BQ3BUdHY3azg/view?usp=sharing
>>>>
>>>> Joe AE6XE
>>>
>>>
>>> --
>>> Olsr-dev mailing list
>>> (spam-protected)
>>> https://lists.olsr.org/mailman/listinfo/olsr-dev
>>>
>>>
>
> --
> Ferry Huberts



More information about the Olsr-dev mailing list