[Olsr-dev] [PATCH v1 1/1] lq_packet: only report a neighbour once in a hello message

Conrad Lara (spam-protected)
Thu Jan 19 23:00:14 CET 2017


---- Ferry Huberts <(spam-protected)> wrote: 
>> 
>
>On 19/01/17 21:15, (spam-protected) wrote:
>> Has it been determined why 3d2fd73a5528a9c7cdccd088f2dcca80b37e66b9
>> broke this?
>>
>
>It dropped the check for UNSPEC
>
>That's why I asked Henning to look into it, since he knows best why he
>did that.
>
>> If it is indeed the cause and we are in a rush to solve this why not
>> just revert 3d2fd73a5528a9c7cdccd088f2dcca80b37e66b9?
>>
>
>
>That is also a possibility I've considered.
>I want to hear from Henning first though.
>
>In the meantime we can continue to investigate alternatives; we now have
>2 patches to deal with it on the TX side, plus the revert, so 3 options.
>

Its fair to investigate alternatives, but that doesn't mean the alternative needs to be pushed into mainline and a release package made before the full review is done.

>> This is broken but let's not rush fixes to resolve it.
>>
>> If there is a need for a drastic quick fix revert
>> 3d2fd73a5528a9c7cdccd088f2dcca80b37e66b9 that attempt to handle
>> fragmented packets and make sure to solve that issue correctly
>> without impacting the rest of the code (BTW: the RFC says we should
>> [its a should not a must]  avoid having fragments in the first place
>> )
>
>Yes, well Henning didn't do his change for no good reason, duh.
>If you go over (I think) 62 neighbours then you _will_ have fragmented
>HELLOs. No avoiding that at all.
>

Fair.

Point is this is the commit that has caused all the issues since 0.6.7.
Generaly in my years of coding when something breaks after a change it is the breaking change that gets revisited not a change elsewhere until that change proves its the only way to do it.

>
>
>>
>> There is no point to push a new release that is still knowingly
>> broken, even if this goes out with the TX fix this hour it will not
>> fix anyone out there that does not upgrade all the impacted devices
>> in the network.
>
>And how is it broken after my fix?
>Only when mixing versions, true.
>
It is also not fully resolved ("We need to make a patch on the RX side as well".) I belive that makes it fair to say it is broken still.
It may be mitigated, but it isn't fully resolved (which is where I normaly move support tickets from broken to fixed). Untill the RX patch is made it isn't fully known how this will be all handled, perhaps this isn't even needed with the RX patch and perhaps it is.

This issue continues to be discovered just how deep it is. Some solutiosn are not viable (changing the hellow message format to add indexs) others are less known.  

Its broken right now and I'm sure everyone wants to see it fixed, I know I do. I haven't hit it yet in production and do not want to hit it in production, but I also belive in caution and taking time when working around signficantly complex issues.

>>
>> It's taken since 0.6.7 to even find out this exists,  its likely not
>> the most frequency occurring problem impacting users. The small
>
>The effects can be invisible in some network configurations.
>We only saw it because we were digging in a development setup.
>In other setups it can be quite visible. It all depends.
>

Seems to cause this you have to have 2 diffrent interfaces on the same L2 medium (EG 2x wired network interface  or 2x wifi cards on the same SSID on the same RF channel)
(been doing 1x wifi and 1x wired to two devices for ages on 0.6.7 with it working flawlessly)
I don't have any hard data as to how often  a setup like that occurs, I would think the WIFI be more common (as the wired would be a wiring fault IMHO but we still have to handle it gracefully) In the case of WIFI using same RF channel on two devices isn't the best setup anyways.  I'm sure it happens but it is subject to a lot of interference.

>
>> group it's impacting can surely experiment with a development patch
>> manually if they absolutely need it.
>>
>> There are also other temporary resolutions available (avoiding two
>> interfaces on the same L2 broadcast domain, blocking packets, etc)
>>
>
>
>PS. would you mind introducing yourself? I think I've not seen you on
>this mailing list before.

Sorry, too many email accounts on the phone and still in the process of transfering everything over to the personal domain name (long process)

>
>
>> Sent from my iPhone
>>
>>> On Jan 19, 2017, at 11:46 AM, Ferry Huberts <(spam-protected)>
>>> wrote:
>>>
>>>
>>>
>>>> On 19/01/17 16:36, Teco Boot wrote: I'm still testing. I want to
>>>> see a 100% fix on both TX (eliminate duplicates that are not
>>>> correct or relevant) and on RX (don't use LQ where
>>>> link-type=UNSPEC. The RFC is my guide.
>>>
>>> Eliminating (these kinds of) duplicates is NOT ALLOWED per the RFC.
>>> The RFC very clearly states that the ENTIRE neighbourhood must be
>>> sent, and neighbours not reachable over the sending interface must
>>> be UNSPEC. In the example setup that results in one SYM neighbour
>>> and one UNSPEC neighbour, which is entirely correct.
>>>
>>> There is not enough information in the messages to fully fix this,
>>> accept it. This can only be fixed if the hellos contain the
>>> interface indexes with every record; then these duplicates will
>>> cease to be duplicates. That would require an RFC change, and we
>>> will not go there.
>>>
>>>
>>>>
>>>> I'm facing lots of SPFs where topology is stable. The glitches
>>>> bite.
>>>>
>>>> I tested a bit with fragments. I want to make sure that the RX
>>>> patch doesn't break something.
>>>
>>>
>>> There is no RX patch (yet), so what are your talking about?
>>>
>>>
>>>>
>>>> I don't see a reason to hurry.
>>>
>>> There is VERY good reason to hurry: this has been broken since
>>> 3d2fd73a5528a9c7cdccd088f2dcca80b37e66b9, which is every release
>>> since - and including - v0.6.7.
>>>
>>> The oneliner patch solves it, or at least mitigates it.
>>>
>>> We need to make a patch on the RX side as well, as we already
>>> agreed and as Henning indicated earlier today.
>>>
>>>
>>> I am going to push my TX fix tomorrow and I will work on the RX
>>> patch after next week.
>>>
>>>
>>> I will also put the fix on our release branch, and coordinate with
>>> Henning to do a release of that branch.
>>>
>>>
>>> F
>>>
>>>>
>>>> Teco
>>>>
>>>>
>>>>> Op 19 jan. 2017, om 15:00 heeft Henning Rogge
>>>>> <(spam-protected)> het volgende geschreven:
>>>>>
>>>>> On Thu, Jan 19, 2017 at 2:57 PM, Ferry Huberts
>>>>> <(spam-protected)> wrote:
>>>>>
>>>>>
>>>>> On 19/01/17 13:33, Henning Rogge wrote: Hi,
>>>>>
>>>>> back from (two) business trips...
>>>>>
>>>>> On Mon, Jan 16, 2017 at 6:14 PM, Ferry Huberts
>>>>> <(spam-protected) <mailto:(spam-protected)>> wrote:
>>>>>
>>>>> I think I may have stumbled onto a MUCH easier fix to this
>>>>> problem.
>>>>>
>>>>>
>>>>> The actual problem is quite involved (with fragmented hello
>>>>> messages etc) but it all boils down to a later UNSPEC link
>>>>> overwriting a previous SYM/ASYM/other link on the receiving
>>>>> end.
>>>>>
>>>>> Since the neighbours in a hello message are already ordered
>>>>> when the hello message is sent out, the following patch also
>>>>> solves it.
>>>>>
>>>>> Opinions?
>>>>>
>>>>>
>>>>> I like the idea of solve the problem by sorting, but I worry
>>>>> that doing so on the TX side is a bit "brittle"... its easy to
>>>>> break because there is nothing on the TX side that needs this
>>>>> order. And there will always be older Olsrd installations on
>>>>> many community networks.
>>>>>
>>>>> How difficult would it be to do the sorting on the RX side in
>>>>> deserialize_hello() (process_package.c: 318..)?
>>>>>
>>>>> Instead of building them into a single linked list, have an
>>>>> array with MAX_LINK lists, one for each link type... and then
>>>>> concatenate this lists before parsing in the right order.
>>>>>
>>>>> This way we also fix the problem appearing again because of an
>>>>> outdated Olsrd sending bad data, right?
>>>>>
>>>>>
>>>>>
>>>>> If we do that as a follow-up patch? Would that be ok?
>>>>>
>>>>> Sounds good... quick fix now to solve the problem for the
>>>>> current code, cleanup later to make it future proof.
>>>>>
>>>>> Henning
>>>>
>>>
>>> -- Ferry Huberts
>>>
>>> -- Olsr-dev mailing list (spam-protected)
>>> https://lists.olsr.org/mailman/listinfo/olsr-dev
>>
>>
>> On 19/01/17 16:36, Teco Boot wrote:
>>> I'm still testing. I want to see a 100% fix on both TX (eliminate
>>> duplicates that are not correct or relevant) and on RX (don't use
>>> LQ where link-type=UNSPEC. The RFC is my guide.
>>
>> Eliminating (these kinds of) duplicates is NOT ALLOWED per the RFC.
>> The RFC very clearly states that the ENTIRE neighbourhood must be
>> sent, and neighbours not reachable over the sending interface must
>> be UNSPEC. In the example setup that results in one SYM neighbour
>> and one UNSPEC neighbour, which is entirely correct.
>>
>> There is not enough information in the messages to fully fix this,
>> accept it. This can only be fixed if the hellos contain the
>> interface indexes with every record; then these duplicates will cease
>> to be duplicates. That would require an RFC change, and we will not
>> go there.
>>
>>
>>>
>>> I'm facing lots of SPFs where topology is stable. The glitches
>>> bite.
>>>
>>> I tested a bit with fragments. I want to make sure that the RX
>>> patch doesn't break something.
>>
>>
>> There is no RX patch (yet), so what are your talking about?
>>
>>
>>>
>>> I don't see a reason to hurry.
>>
>> There is VERY good reason to hurry: this has been broken since
>> 3d2fd73a5528a9c7cdccd088f2dcca80b37e66b9, which is every release
>> since - and including - v0.6.7.
>>
>> The oneliner patch solves it, or at least mitigates it.
>>
>> We need to make a patch on the RX side as well, as we already agreed
>> and as Henning indicated earlier today.
>>
>>
>> I am going to push my TX fix tomorrow and I will work on the RX
>> patch after next week.
>>
>>
>> I will also put the fix on our release branch, and coordinate with
>> Henning to do a release of that branch.
>>
>>
>> F
>>
>>>
>>> Teco
>>>
>>>
>>>> Op 19 jan. 2017, om 15:00 heeft Henning Rogge <(spam-protected)>
>>>> het volgende geschreven:
>>>>
>>>> On Thu, Jan 19, 2017 at 2:57 PM, Ferry Huberts
>>>> <(spam-protected)> wrote:
>>>>
>>>>
>>>> On 19/01/17 13:33, Henning Rogge wrote: Hi,
>>>>
>>>> back from (two) business trips...
>>>>
>>>> On Mon, Jan 16, 2017 at 6:14 PM, Ferry Huberts
>>>> <(spam-protected) <mailto:(spam-protected)>> wrote:
>>>>
>>>> I think I may have stumbled onto a MUCH easier fix to this
>>>> problem.
>>>>
>>>>
>>>> The actual problem is quite involved (with fragmented hello
>>>> messages etc) but it all boils down to a later UNSPEC link
>>>> overwriting a previous SYM/ASYM/other link on the receiving end.
>>>>
>>>> Since the neighbours in a hello message are already ordered when
>>>> the hello message is sent out, the following patch also solves
>>>> it.
>>>>
>>>> Opinions?
>>>>
>>>>
>>>> I like the idea of solve the problem by sorting, but I worry
>>>> that doing so on the TX side is a bit "brittle"... its easy to
>>>> break because there is nothing on the TX side that needs this
>>>> order. And there will always be older Olsrd installations on
>>>> many community networks.
>>>>
>>>> How difficult would it be to do the sorting on the RX side in
>>>> deserialize_hello() (process_package.c: 318..)?
>>>>
>>>> Instead of building them into a single linked list, have an
>>>> array with MAX_LINK lists, one for each link type... and then
>>>> concatenate this lists before parsing in the right order.
>>>>
>>>> This way we also f
>> ix the problem appearing again because of an outdated
>>>> Olsrd sending bad data, right?
>>>>
>>>>
>>>>
>>>> If we do that as a follow-up patch? Would that be ok?
>>>>
>>>> Sounds good... quick fix now to solve the problem for the
>>>> current code, cleanup later to make it future proof.
>>>>
>>>> Henning
>>>
>>
>
>-- 
>Ferry Huberts
>
>-- 
>Olsr-dev mailing list
>(spam-protected)
>https://lists.olsr.org/mailman/listinfo/olsr-dev
>



More information about the Olsr-dev mailing list