1. 03 9月, 2016 2 次提交
    • J
      tipc: rate limit broadcast retransmissions · 7c4a54b9
      Jon Paul Maloy 提交于
      As cluster sizes grow, so does the amount of identical or overlapping
      broadcast NACKs generated by the packet receivers. This often leads to
      'NACK crunches' resulting in huge numbers of redundant retransmissions
      of the same packet ranges.
      
      In this commit, we introduce rate control of broadcast retransmissions,
      so that a retransmitted range cannot be retransmitted again until after
      at least 10 ms. This reduces the frequency of duplicate, redundant
      retransmissions by an order of magnitude, while having a significant
      positive impact on overall throughput and scalability.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4a54b9
    • J
      tipc: transfer broadcast nacks in link state messages · 02d11ca2
      Jon Paul Maloy 提交于
      When we send broadcasts in clusters of more 70-80 nodes, we sometimes
      see the broadcast link resetting because of an excessive number of
      retransmissions. This is caused by a combination of two factors:
      
      1) A 'NACK crunch", where loss of broadcast packets is discovered
         and NACK'ed by several nodes simultaneously, leading to multiple
         redundant broadcast retransmissions.
      
      2) The fact that the NACKS as such also are sent as broadcast, leading
         to excessive load and packet loss on the transmitting switch/bridge.
      
      This commit deals with the latter problem, by moving sending of
      broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
      to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
      bits in word 8 of the said message for this purpose, and introduce a
      new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
      backwards compatible.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d11ca2
  2. 27 8月, 2016 7 次提交
  3. 26 8月, 2016 1 次提交
  4. 24 8月, 2016 1 次提交
  5. 19 8月, 2016 3 次提交
    • R
      tipc: add peer removal functionality · b3404022
      Richard Alpe 提交于
      Add TIPC_NL_PEER_REMOVE netlink command. This command can remove
      an offline peer node from the internal data structures.
      
      This will be supported by the tipc user space tool in iproute2.
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3404022
    • J
      tipc: ensure that link congestion and wakeup use same criteria · 5a0950c2
      Jon Paul Maloy 提交于
      When a link is attempted woken up after congestion, it uses a different,
      more generous criteria than when it was originally declared congested.
      This has the effect that the link, and the sending process, sometimes
      will be woken up unnecessarily, just to immediately return to congestion
      when it turns out there is not not enough space in its send queue to
      host the pending message. This is a waste of CPU cycles.
      
      We now change the function link_prepare_wakeup() to use exactly the same
      criteria as tipc_link_xmit(). However, since we are now excluding the
      window limit from the wakeup calculation, and the current backlog limit
      for the lowest level is too small to house even a single maximum-size
      message, we have to expand this limit. We do this by evaluating an
      alternative, minimum value during the setting of the importance limits.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a0950c2
    • J
      tipc: make bearer packet filtering generic · 0d051bf9
      Jon Paul Maloy 提交于
      In commit 5b7066c3 ("tipc: stricter filtering of packets in bearer
      layer") we introduced a method of filtering out messages while a bearer
      is being reset, to avoid that links may be re-created and come back in
      working state while we are still in the process of shutting them down.
      
      This solution works well, but is limited to only work with L2 media, which
      is insufficient with the increasing use of UDP as carrier media.
      
      We now replace this solution with a more generic one, by introducing a
      new flag "up" in the generic struct tipc_bearer. This field will be set
      and reset at the same locations as with the previous solution, while
      the packet filtering is moved to the generic code for the sending side.
      On the receiving side, the filtering is still done in media specific
      code, but now including the UDP bearer.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d051bf9
  6. 16 8月, 2016 1 次提交
    • V
      tipc: fix NULL pointer dereference in shutdown() · d2fbdf76
      Vegard Nossum 提交于
      tipc_msg_create() can return a NULL skb and if so, we shouldn't try to
      call tipc_node_xmit_skb() on it.
      
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          CPU: 3 PID: 30298 Comm: trinity-c0 Not tainted 4.7.0-rc7+ #19
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          task: ffff8800baf09980 ti: ffff8800595b8000 task.ti: ffff8800595b8000
          RIP: 0010:[<ffffffff830bb46b>]  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
          RSP: 0018:ffff8800595bfce8  EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003023b0e0
          RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff83d12580
          RBP: ffff8800595bfd78 R08: ffffed000b2b7f32 R09: 0000000000000000
          R10: fffffbfff0759725 R11: 0000000000000000 R12: 1ffff1000b2b7f9f
          R13: ffff8800595bfd58 R14: ffffffff83d12580 R15: dffffc0000000000
          FS:  00007fcdde242700(0000) GS:ffff88011af80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fcddde1db10 CR3: 000000006874b000 CR4: 00000000000006e0
          DR0: 00007fcdde248000 DR1: 00007fcddd73d000 DR2: 00007fcdde248000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
          Stack:
           0000000000000018 0000000000000018 0000000041b58ab3 ffffffff83954208
           ffffffff830bb400 ffff8800595bfd30 ffffffff8309d767 0000000000000018
           0000000000000018 ffff8800595bfd78 ffffffff8309da1a 00000000810ee611
          Call Trace:
           [<ffffffff830c84a3>] tipc_shutdown+0x553/0x880
           [<ffffffff825b4a3b>] SyS_shutdown+0x14b/0x170
           [<ffffffff8100334c>] do_syscall_64+0x19c/0x410
           [<ffffffff83295ca5>] entry_SYSCALL64_slow_path+0x25/0x25
          Code: 90 00 b4 0b 83 c7 00 f1 f1 f1 f1 4c 8d 6d e0 c7 40 04 00 00 00 f4 c7 40 08 f3 f3 f3 f3 48 89 d8 48 c1 e8 03 c7 45 b4 00 00 00 00 <80> 3c 30 00 75 78 48 8d 7b 08 49 8d 75 c0 48 b8 00 00 00 00 00
          RIP  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
           RSP <ffff8800595bfce8>
          ---[ end trace 57b0484e351e71f1 ]---
      
      I feel like we should maybe return -ENOMEM or -ENOBUFS, but I'm not sure
      userspace is equipped to handle that. Anyway, this is better than a GPF
      and looks somewhat consistent with other tipc_msg_create() callers.
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2fbdf76
  7. 11 8月, 2016 1 次提交
  8. 31 7月, 2016 1 次提交
  9. 27 7月, 2016 5 次提交
  10. 12 7月, 2016 3 次提交
    • J
      tipc: reset all unicast links when broadcast send link fails · 1fc07f3e
      Jon Paul Maloy 提交于
      In test situations with many nodes and a heavily stressed system we have
      observed that the transmission broadcast link may fail due to an
      excessive number of retransmissions of the same packet. In such
      situations we need to reset all unicast links to all peers, in order to
      reset and re-synchronize the broadcast link.
      
      In this commit, we add a new function tipc_bearer_reset_all() to be used
      in such situations. The function scans across all bearers and resets all
      their pertaining links.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fc07f3e
    • J
      tipc: ensure correct broadcast send buffer release when peer is lost · a71eb720
      Jon Paul Maloy 提交于
      After a new receiver peer has been added to the broadcast transmission
      link, we allow immediate transmission of new broadcast packets, trusting
      that the new peer will not accept the packets until it has received the
      previously sent unicast broadcast initialiation message. In the same
      way, the sender must not accept any acknowledges until it has itself
      received the broadcast initialization from the peer, as well as
      confirmation of the reception of its own initialization message.
      
      Furthermore, when a receiver peer goes down, the sender has to produce
      the missing acknowledges from the lost peer locally, in order ensure
      correct release of the buffers that were expected to be acknowledged by
      the said peer.
      
      In a highly stressed system we have observed that contact with a peer
      may come up and be lost before the above mentioned broadcast initial-
      ization and confirmation have been received. This leads to the locally
      produced acknowledges being rejected, and the non-acknowledged buffers
      to linger in the broadcast link transmission queue until it fills up
      and the link goes into permanent congestion.
      
      In this commit, we remedy this by temporarily setting the corresponding
      broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
      state to true before we issue the local acknowledges. This ensures that
      those acknowledges will always be accepted. The mentioned state values
      are restored immediately afterwards when the link is reset.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a71eb720
    • J
      tipc: extend broadcast link initialization criteria · 2d18ac4b
      Jon Paul Maloy 提交于
      At first contact between two nodes, an endpoint might sometimes have
      time to send out a LINK_PROTOCOL/STATE packet before it has received
      the broadcast initialization packet from the peer, i.e., before it has
      received a valid broadcast packet number to add to the 'bc_ack' field
      of the protocol message.
      
      This means that the peer endpoint will receive a protocol packet with an
      invalid broadcast acknowledge value of 0. Under unlucky circumstances
      this may lead to the original, already received acknowledge value being
      overwritten, so that the whole broadcast link goes stale after a while.
      
      We fix this by delaying the setting of the link field 'bc_peer_is_up'
      until we know that the peer really has received our own broadcast
      initialization message. The latter is always sent out as the first
      unicast message on a link, and always with seqeunce number 1. Because
      of this, we only need to look for a non-zero unicast acknowledge value
      in the arriving STATE messages, and once that is confirmed we know we
      are safe and can set the mentioned field. Before this moment, we must
      ignore all broadcast acknowledges from the peer.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d18ac4b
  11. 02 7月, 2016 1 次提交
    • R
      tipc: fix nl compat regression for link statistics · 55e77a3e
      Richard Alpe 提交于
      Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
      of the link name where left out.
      
      Making the output of tipc-config -ls look something like:
      Link statistics:
      dcast-link
      1:data0-1.1.2:data0
      1:data0-1.1.3:data0
      
      Also, for the record, the patch that introduce this regression
      claims "Sending the whole object out can cause a leak". Which isn't
      very likely as this is a compat layer, where the data we are parsing
      is generated by us and we know the string to be NULL terminated. But
      you can of course never be to secure.
      
      Fixes: 5d2be142 (tipc: fix an infoleak in tipc_nl_compat_link_dump)
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55e77a3e
  12. 29 6月, 2016 2 次提交
  13. 27 6月, 2016 1 次提交
  14. 23 6月, 2016 1 次提交
    • J
      tipc: unclone unbundled buffers before forwarding · 27777daa
      Jon Paul Maloy 提交于
      When extracting an individual message from a received "bundle" buffer,
      we just create a clone of the base buffer, and adjust it to point into
      the right position of the linearized data area of the latter. This works
      well for regular message reception, but during periods of extremely high
      load it may happen that an extracted buffer, e.g, a connection probe, is
      reversed and forwarded through an external interface while the preceding
      extracted message is still unhandled. When this happens, the header or
      data area of the preceding message will be partially overwritten by a
      MAC header, leading to unpredicatable consequences, such as a link
      reset.
      
      We now fix this by ensuring that the msg_reverse() function never
      returns a cloned buffer, and that the returned buffer always contains
      sufficient valid head and tail room to be forwarded.
      Reported-by: NErik Hugne <erik.hugne@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27777daa
  15. 18 6月, 2016 2 次提交
    • J
      tipc: fix socket timer deadlock · f1d048f2
      Jon Paul Maloy 提交于
      We sometimes observe a 'deadly embrace' type deadlock occurring
      between mutually connected sockets on the same node. This happens
      when the one-hour peer supervision timers happen to expire
      simultaneously in both sockets.
      
      The scenario is as follows:
      
      CPU 1:                          CPU 2:
      --------                        --------
      tipc_sk_timeout(sk1)            tipc_sk_timeout(sk2)
        lock(sk1.slock)                 lock(sk2.slock)
        msg_create(probe)               msg_create(probe)
        unlock(sk1.slock)               unlock(sk2.slock)
        tipc_node_xmit_skb()            tipc_node_xmit_skb()
          tipc_node_xmit()                tipc_node_xmit()
            tipc_sk_rcv(sk2)                tipc_sk_rcv(sk1)
              lock(sk2.slock)                 lock((sk1.slock)
              filter_rcv()                    filter_rcv()
                tipc_sk_proto_rcv()             tipc_sk_proto_rcv()
                  msg_create(probe_rsp)           msg_create(probe_rsp)
                  tipc_sk_respond()               tipc_sk_respond()
                    tipc_node_xmit_skb()            tipc_node_xmit_skb()
                      tipc_node_xmit()                tipc_node_xmit()
                        tipc_sk_rcv(sk1)                tipc_sk_rcv(sk2)
                          lock((sk1.slock)                lock((sk2.slock)
                          ===> DEADLOCK                   ===> DEADLOCK
      
      Further analysis reveals that there are three different locations in the
      socket code where tipc_sk_respond() is called within the context of the
      socket lock, with ensuing risk of similar deadlocks.
      
      We now solve this by passing a buffer queue along with all upcalls where
      sk_lock.slock may potentially be held. Response or rejected message
      buffers are accumulated into this queue instead of being sent out
      directly, and only sent once we know we are safely outside the slock
      context.
      Reported-by: NGUNA <gbalasun@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1d048f2
    • D
      tipc: potential shift wrapping bug in map_set() · 0350cb48
      Dan Carpenter 提交于
      "up_map" is a u64 type but we're not using the high 32 bits.
      
      Fixes: 35c55c98 ('tipc: add neighbor monitoring framework')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0350cb48
  16. 16 6月, 2016 3 次提交
    • Y
      tipc: eliminate uninitialized variable warning · c91522f8
      Ying Xue 提交于
      net/tipc/link.c: In function ‘tipc_link_timeout’:
      net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
      
      Fixes: 42b18f60 ("tipc: refactor function tipc_link_timeout()")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91522f8
    • Y
      tipc: fix suspicious RCU usage · 66d95b67
      Ying Xue 提交于
      When run tipcTS&tipcTC test suite, the following complaint appears:
      
      [   56.926168] ===============================
      [   56.926169] [ INFO: suspicious RCU usage. ]
      [   56.926171] 4.7.0-rc1+ #160 Not tainted
      [   56.926173] -------------------------------
      [   56.926174] net/tipc/bearer.c:408 suspicious rcu_dereference_protected() usage!
      [   56.926175]
      [   56.926175] other info that might help us debug this:
      [   56.926175]
      [   56.926177]
      [   56.926177] rcu_scheduler_active = 1, debug_locks = 1
      [   56.926179] 3 locks held by swapper/4/0:
      [   56.926180]  #0:  (((&req->timer))){+.-...}, at: [<ffffffff810e79b5>] call_timer_fn+0x5/0x340
      [   56.926203]  #1:  (&(&req->lock)->rlock){+.-...}, at: [<ffffffffa000c29b>] disc_timeout+0x1b/0xd0 [tipc]
      [   56.926212]  #2:  (rcu_read_lock){......}, at: [<ffffffffa00055e0>] tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
      [   56.926218]
      [   56.926218] stack backtrace:
      [   56.926221] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.7.0-rc1+ #160
      [   56.926222] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [   56.926224]  0000000000000000 ffff880016803d28 ffffffff813c4423 ffff8800154252c0
      [   56.926227]  0000000000000001 ffff880016803d58 ffffffff810b7512 ffff8800124d8120
      [   56.926230]  ffff880013f8a160 ffff8800132b5ccc ffff8800124d8120 ffff880016803d88
      [   56.926234] Call Trace:
      [   56.926235]  <IRQ>  [<ffffffff813c4423>] dump_stack+0x67/0x94
      [   56.926250]  [<ffffffff810b7512>] lockdep_rcu_suspicious+0xe2/0x120
      [   56.926256]  [<ffffffffa00051f1>] tipc_l2_send_msg+0x131/0x1c0 [tipc]
      [   56.926261]  [<ffffffffa000567c>] tipc_bearer_xmit_skb+0x14c/0x2e0 [tipc]
      [   56.926266]  [<ffffffffa00055e0>] ? tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
      [   56.926273]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
      [   56.926278]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
      [   56.926283]  [<ffffffffa000c2d6>] disc_timeout+0x56/0xd0 [tipc]
      [   56.926288]  [<ffffffff810e7a68>] call_timer_fn+0xb8/0x340
      [   56.926291]  [<ffffffff810e79b5>] ? call_timer_fn+0x5/0x340
      [   56.926296]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
      [   56.926300]  [<ffffffff810e8f4a>] run_timer_softirq+0x23a/0x390
      [   56.926306]  [<ffffffff810f89ff>] ? clockevents_program_event+0x7f/0x130
      [   56.926316]  [<ffffffff819727c3>] __do_softirq+0xc3/0x4a2
      [   56.926323]  [<ffffffff8106ba5a>] irq_exit+0x8a/0xb0
      [   56.926327]  [<ffffffff81972456>] smp_apic_timer_interrupt+0x46/0x60
      [   56.926331]  [<ffffffff81970a49>] apic_timer_interrupt+0x89/0x90
      [   56.926333]  <EOI>  [<ffffffff81027fda>] ? default_idle+0x2a/0x1a0
      [   56.926340]  [<ffffffff81027fd8>] ? default_idle+0x28/0x1a0
      [   56.926342]  [<ffffffff810289cf>] arch_cpu_idle+0xf/0x20
      [   56.926345]  [<ffffffff810adf0f>] default_idle_call+0x2f/0x50
      [   56.926347]  [<ffffffff810ae145>] cpu_startup_entry+0x215/0x3e0
      [   56.926353]  [<ffffffff81040ad9>] start_secondary+0xf9/0x100
      
      The warning appears as rtnl_dereference() is wrongly used in
      tipc_l2_send_msg() under RCU read lock protection. Instead the proper
      usage should be that rcu_dereference_rtnl() is called here.
      
      Fixes: 5b7066c3 ("tipc: stricter filtering of packets in bearer layer")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66d95b67
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
  17. 09 6月, 2016 2 次提交
    • J
      tipc: change node timer unit from jiffies to ms · 5ca509fc
      Jon Paul Maloy 提交于
      The node keepalive interval is recalculated at each timer expiration
      to catch any changes in the link tolerance, and stored in a field in
      struct tipc_node. We use jiffies as unit for the stored value.
      
      This is suboptimal, because it makes the calculation unnecessary
      complex, including two unit conversions. The conversions also lead to
      a rounding error that causes the link "abort limit" to be 3 in the
      normal case, instead of 4, as intended. This again leads to unnecessary
      link resets when the network is pushed close to its limit, e.g., in an
      environment with hundreds of nodes or namesapces.
      
      In this commit, we do instead let the keepalive value be calculated and
      stored in milliseconds, so that there is only one conversion and the
      rounding error is eliminated.
      
      We also remove a redundant "keepalive" field in struct tipc_link. This
      is remnant from the previous implementation.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ca509fc
    • J
      tipc: correct error in node fsm · c4282ca7
      Jon Paul Maloy 提交于
      commit 88e8ac70 ("tipc: reduce transmission rate of reset messages
      when link is down") revealed a flaw in the node FSM, as defined in
      the log of commit 66996b6c ("tipc: extend node FSM").
      
      We see the following scenario:
      1: Node B receives a RESET message from node A before its link endpoint
         is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
         event will not change the node FSM state, but the (distinct) link FSM
         will move to state RESETTING.
      2: As an effect of the previous event, the local endpoint on B will
         declare node A lost, and post the event SELF_DOWN to the its node
         FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
         that no messages will be accepted from A until it receives another
         RESET message that confirms that A's endpoint has been reset. This
         is  wasteful, since we know this as a fact already from the first
         received RESET, but worse is that the link instance's FSM has not
         wasted this information, but instead moved on to state ESTABLISHING,
         meaning that it repeatedly sends out ACTIVATE messages to the reset
         peer A.
      3: Node A will receive one of the ACTIVATE messages, move its link FSM
         to state ESTABLISHED, and start repeatedly sending out STATE messages
         to node B.
      4: Node B will consistently drop these messages, since it can only accept
         accept a RESET according to its node FSM.
      5: After four lost STATE messages node A will reset its link and start
         repeatedly sending out RESET messages to B.
      6: Because of the reduced send rate for RESET messages, it is very
         likely that A will receive an ACTIVATE (which is sent out at a much
         higher frequency) before it gets the chance to send a RESET, and A
         may hence quickly move back to state ESTABLISHED and continue sending
         out STATE messages, which will again be dropped by B.
      7: GOTO 5.
      8: After having repeated the cycle 5-7 a number of times, node A will
         by chance get in between with sending a RESET, and the situation is
         resolved.
      
      Unfortunately, we have seen that it may take a substantial amount of
      time before this vicious loop is broken, sometimes in the order of
      minutes.
      
      We correct this by making a small correction to the node FSM: When a
      node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
      moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
      SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
      need to wait for RESET confirmation from of an endpoint that we alread
      know has been reset. It also means that node B in the scenario above
      will not be dropping incoming STATE messages, and the link can come up
      immediately.
      
      Finally, a symmetry comparison reveals that the  FSM has a similar
      error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
      Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
      to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
      of this logical error, we choose fix this one, too.
      
      The node FSM looks as follows after those changes:
      
                                 +----------------------------------------+
                                 |                           PEER_DOWN_EVT|
                                 |                                        |
        +------------------------+----------------+                       |
        |SELF_DOWN_EVT           |                |                       |
        |                        |                |                       |
        |              +-----------+          +-----------+               |
        |              |NODE_      |          |NODE_      |               |
        |   +----------|FAILINGOVER|<---------|SYNCHING   |-----------+   |
        |   |SELF_     +-----------+ FAILOVER_+-----------+   PEER_   |   |
        |   |DOWN_EVT   |          A BEGIN_EVT  A         |   DOWN_EVT|   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |FAILOVER_ |FAILOVER_   |SYNCH_   |SYNCH_     |   |
        |   |           |END_EVT   |BEGIN_EVT   |BEGIN_EVT|END_EVT    |   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |         +--------------+        |           |   |
        |   |           +-------->|   SELF_UP_   |<-------+           |   |
        |   |   +-----------------|   PEER_UP    |----------------+   |   |
        |   |   |SELF_DOWN_EVT    +--------------+   PEER_DOWN_EVT|   |   |
        |   |   |                    A        A                   |   |   |
        |   |   |                    |        |                   |   |   |
        |   |   |         PEER_UP_EVT|        |SELF_UP_EVT        |   |   |
        |   |   |                    |        |                   |   |   |
        V   V   V                    |        |                   V   V   V
      +------------+       +-----------+    +-----------+       +------------+
      |SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
      |PEER_LEAVING|       |PEER_COMING|    |SELF_COMING|       |SELF_LEAVING|
      +------------+       +-----------+    +-----------+       +------------+
             |               |       A        A       |                |
             |               |       |        |       |                |
             |       SELF_   |       |SELF_   |PEER_  |PEER_           |
             |       DOWN_EVT|       |UP_EVT  |UP_EVT |DOWN_EVT        |
             |               |       |        |       |                |
             |               |       |        |       |                |
             |               |    +--------------+    |                |
             |PEER_DOWN_EVT  +--->|  SELF_DOWN_  |<---+   SELF_DOWN_EVT|
             +------------------->|  PEER_DOWN   |<--------------------+
                                  +--------------+
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4282ca7
  18. 03 6月, 2016 1 次提交
  19. 26 5月, 2016 1 次提交
  20. 20 5月, 2016 1 次提交