<div dir="ltr">The tests with this patch are successful. <div><br></div><div>Before: 4+ sequential failures reproduced in less than an hour each time </div><div><br></div><div>After: same test setup, added an additional 15 second repeating on-node "echo /all | nc localhost 2006" to further stress test. Ran for over 11 hours with no failures, still running.</div><div><br></div><div>Good to go from my perspective.</div><div><br></div><div>Joe AE6XE</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Dec 4, 2017 at 11:59 PM, Henning Rogge <span dir="ltr"><<a href="mailto:hrogge@gmail.com" target="_blank">hrogge@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
could you test the following patch?<br>
<br>
Henning<br>
<br>
On Mon, Dec 4, 2017 at 7:53 PM, Joe Ayers <<a href="mailto:joe@ayerscasa.com">joe@ayerscasa.com</a>> wrote:<br>
> Correction, full strace log file URL is:<br>
><br>
> <a href="https://drive.google.com/file/d/1TGW5VFpcKppbd82eT72qf6TqTtgqy-0j/view?usp=sharing" rel="noreferrer" target="_blank">https://drive.google.com/file/<wbr>d/<wbr>1TGW5VFpcKppbd82eT72qf6TqTtgqy<wbr>-0j/view?usp=sharing</a><br>
><br>
><br>
><br>
> On Mon, Dec 4, 2017 at 10:36 AM, Joe Ayers <<a href="mailto:joe@ayerscasa.com">joe@ayerscasa.com</a>> wrote:<br>
>> In reference to:<br>
>><br>
>> " * A timer was added and each time it expires each non-empty buffer<br>
>> * in this structure will try to write data into a "non-blocking"<br>
>> * socket until all data is sent, so that no blocking occurs."<br>
>><br>
>> A blocking event can be reliably reproduced in olsr 0.9.6.2 in OpenWRT<br>
>> Chaos Calmer. The node drops off the mesh and stops responding (not a<br>
>> good thing when it's remote on a tower :) ).<br>
>><br>
>> Test scenario:<br>
>><br>
>> - Node_A LAN laptop, "echo /all | nc node_B 9090" sleeps 2 seconds<br>
>> and repeats (reproduces more quickly, but could be a single hit)<br>
>> - Node_A has RF link to Node_B with ~70% LQ/NLQ (maybe marginal LQ/NLQ<br>
>> is a non-factor?)<br>
>> - Node_B has "echo /nei | nc localhost 2006" sleeps 5 seconds and repeats<br>
>><br>
>> After a few minutes, Node_B blocks on send in strace, subsequently<br>
>> waited ~10 min and SIGTERM'd (search for SIGTERM to find in full log<br>
>> file URL below).<br>
>><br>
>> clock_gettime(CLOCK_MONOTONIC, {281, 819282014}) = 0<br>
>> accept(13, {sa_family=AF_INET, sin_port=htons(32965),<br>
>> sin_addr=inet_addr("10.34.163.<wbr>239")}, [16]) = 17<br>
>> _newselect(18, [17], NULL, NULL, {0, 20000}) = 1 (in [17], left {0, 19984})<br>
>> recv(17, "/all\n", 1024, MSG_DONTWAIT) = 5<br>
>> time(NULL) = 1512345153<br>
>> clock_gettime(CLOCK_MONOTONIC, {281, 825309152}) = 0<br>
>> ...<br>
>> clock_gettime(CLOCK_MONOTONIC, {282, 416388829}) = 0<br>
>> _newselect(18, NULL, [17], NULL, {0, 0}) = 1 (out [17], left {0, 0})<br>
>> send(17, "{\"pid\": 1726,\"systemTime\": 15123"..., 386040, 0) = 141904<br>
>><br>
>> OLSR no longer functions at this point in time, remaining TCP<br>
>> connections go to CLOSE_WAIT until socket listen limit reached.<br>
>> Shouldn't there be a MSG_DONTWAIT flag in the send to be non-blocking?<br>
>><br>
>> Full strace log:<br>
>> <a href="https://drive.google.com/file/d/0B2bEy75HhwWhVE1BQ3BUdHY3azg/view?usp=sharing" rel="noreferrer" target="_blank">https://drive.google.com/file/<wbr>d/<wbr>0B2bEy75HhwWhVE1BQ3BUdHY3azg/<wbr>view?usp=sharing</a><br>
>><br>
>> Joe AE6XE<br>
<span class="HOEnZb"><font color="#888888">><br>
> --<br>
> Olsr-dev mailing list<br>
> <a href="mailto:Olsr-dev@lists.olsr.org">Olsr-dev@lists.olsr.org</a><br>
> <a href="https://lists.olsr.org/mailman/listinfo/olsr-dev" rel="noreferrer" target="_blank">https://lists.olsr.org/<wbr>mailman/listinfo/olsr-dev</a><br>
</font></span></blockquote></div><br></div>