1. 27 5月, 2020 2 次提交
    • T
      tipc: enable broadcast retrans via unicast · a91d55d1
      Tuong Lien 提交于
      In some environment, broadcast traffic is suppressed at high rate (i.e.
      a kind of bandwidth limit setting). When it is applied, TIPC broadcast
      can still run successfully. However, when it comes to a high load, some
      packets will be dropped first and TIPC tries to retransmit them but the
      packet retransmission is intentionally broadcast too, so making things
      worse and not helpful at all.
      
      This commit enables the broadcast retransmission via unicast which only
      retransmits packets to the specific peer that has really reported a gap
      i.e. not broadcasting to all nodes in the cluster, so will prevent from
      being suppressed, and also reduce some overheads on the other peers due
      to duplicates, finally improve the overall TIPC broadcast performance.
      
      Note: the functionality can be turned on/off via the sysctl file:
      
      echo 1 > /proc/sys/net/tipc/bc_retruni
      echo 0 > /proc/sys/net/tipc/bc_retruni
      
      Default is '0', i.e. the broadcast retransmission still works as usual.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a91d55d1
    • T
      tipc: introduce Gap ACK blocks for broadcast link · d7626b5a
      Tuong Lien 提交于
      As achieved through commit 9195948f ("tipc: improve TIPC throughput
      by Gap ACK blocks"), we apply the same mechanism for the broadcast link
      as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
      consist of two parts built for both the broadcast and unicast types:
      
       31                       16 15                        0
      +-------------+-------------+-------------+-------------+
      |  bgack_cnt  |  ugack_cnt  |            len            |
      +-------------+-------------+-------------+-------------+  -
      |            gap            |            ack            |   |
      +-------------+-------------+-------------+-------------+    > bc gacks
      :                           :                           :   |
      +-------------+-------------+-------------+-------------+  -
      |            gap            |            ack            |   |
      +-------------+-------------+-------------+-------------+    > uc gacks
      :                           :                           :   |
      +-------------+-------------+-------------+-------------+  -
      
      which is "automatically" backward-compatible.
      
      We also increase the max number of Gap ACK blocks to 128, allowing upto
      64 blocks per type (total buffer size = 516 bytes).
      
      Besides, the 'tipc_link_advance_transmq()' function is refactored which
      is applicable for both the unicast and broadcast cases now, so some old
      functions can be removed and the code is optimized.
      
      With the patch, TIPC broadcast is more robust regardless of packet loss
      or disorder, latency, ... in the underlying network. Its performance is
      boost up significantly.
      For example, experiment with a 5% packet loss rate results:
      
      $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
      real    0m 42.46s
      user    0m 1.16s
      sys     0m 17.67s
      
      Without the patch:
      
      $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
      real    8m 27.94s
      user    0m 0.55s
      sys     0m 2.38s
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7626b5a
  2. 11 12月, 2019 2 次提交
    • T
      tipc: fix potential hanging after b/rcast changing · dca4a17d
      Tuong Lien 提交于
      In commit c55c8eda ("tipc: smooth change between replicast and
      broadcast"), we allow instant switching between replicast and broadcast
      by sending a dummy 'SYN' packet on the last used link to synchronize
      packets on the links. The 'SYN' message is an object of link congestion
      also, so if that happens, a 'SOCK_WAKEUP' will be scheduled to be sent
      back to the socket...
      However, in that commit, we simply use the same socket 'cong_link_cnt'
      counter for both the 'SYN' & normal payload message sending. Therefore,
      if both the replicast & broadcast links are congested, the counter will
      be not updated correctly but overwritten by the latter congestion.
      Later on, when the 'SOCK_WAKEUP' messages are processed, the counter is
      reduced one by one and eventually overflowed. Consequently, further
      activities on the socket will only wait for the false congestion signal
      to disappear but never been met.
      
      Because sending the 'SYN' message is vital for the mechanism, it should
      be done anyway. This commit fixes the issue by marking the message with
      an error code e.g. 'TIPC_ERR_NO_PORT', so its sending should not face a
      link congestion, there is no need to touch the socket 'cong_link_cnt'
      either. In addition, in the event of any error (e.g. -ENOBUFS), we will
      purge the entire payload message queue and make a return immediately.
      
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca4a17d
    • J
      tipc: introduce variable window congestion control · 16ad3f40
      Jon Maloy 提交于
      We introduce a simple variable window congestion control for links.
      The algorithm is inspired by the Reno algorithm, covering both 'slow
      start', 'congestion avoidance', and 'fast recovery' modes.
      
      - We introduce hard lower and upper window limits per link, still
        different and configurable per bearer type.
      
      - We introduce a 'slow start theshold' variable, initially set to
        the maximum window size.
      
      - We let a link start at the minimum congestion window, i.e. in slow
        start mode, and then let is grow rapidly (+1 per rceived ACK) until
        it reaches the slow start threshold and enters congestion avoidance
        mode.
      
      - In congestion avoidance mode we increment the congestion window for
        each window-size number of acked packets, up to a possible maximum
        equal to the configured maximum window.
      
      - For each non-duplicate NACK received, we drop back to fast recovery
        mode, by setting the both the slow start threshold to and the
        congestion window to (current_congestion_window / 2).
      
      - If the timeout handler finds that the transmit queue has not moved
        since the previous timeout, it drops the link back to slow start
        and forces a probe containing the last sent sequence number to the
        sent to the peer, so that this can discover the stale situation.
      
      This change does in reality have effect only on unicast ethernet
      transport, as we have seen that there is no room whatsoever for
      increasing the window max size for the UDP bearer.
      For now, we also choose to keep the limits for the broadcast link
      unchanged and equal.
      
      This algorithm seems to give a 50-100% throughput improvement for
      messages larger than MTU.
      Suggested-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16ad3f40
  3. 23 11月, 2019 1 次提交
  4. 09 11月, 2019 1 次提交
    • T
      tipc: introduce TIPC encryption & authentication · fc1b6d6d
      Tuong Lien 提交于
      This commit offers an option to encrypt and authenticate all messaging,
      including the neighbor discovery messages. The currently most advanced
      algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
      encryption/decryption is done at the bearer layer, just before leaving
      or after entering TIPC.
      
      Supported features:
      - Encryption & authentication of all TIPC messages (header + data);
      - Two symmetric-key modes: Cluster and Per-node;
      - Automatic key switching;
      - Key-expired revoking (sequence number wrapped);
      - Lock-free encryption/decryption (RCU);
      - Asynchronous crypto, Intel AES-NI supported;
      - Multiple cipher transforms;
      - Logs & statistics;
      
      Two key modes:
      - Cluster key mode: One single key is used for both TX & RX in all
      nodes in the cluster.
      - Per-node key mode: Each nodes in the cluster has one specific TX key.
      For RX, a node requires its peers' TX key to be able to decrypt the
      messages from those peers.
      
      Key setting from user-space is performed via netlink by a user program
      (e.g. the iproute2 'tipc' tool).
      
      Internal key state machine:
      
                                       Attach    Align(RX)
                                           +-+   +-+
                                           | V   | V
              +---------+      Attach     +---------+
              |  IDLE   |---------------->| PENDING |(user = 0)
              +---------+                 +---------+
                 A   A                   Switch|  A
                 |   |                         |  |
                 |   | Free(switch/revoked)    |  |
           (Free)|   +----------------------+  |  |Timeout
                 |              (TX)        |  |  |(RX)
                 |                          |  |  |
                 |                          |  v  |
              +---------+      Switch     +---------+
              | PASSIVE |<----------------| ACTIVE  |
              +---------+       (RX)      +---------+
              (user = 1)                  (user >= 1)
      
      The number of TFMs is 10 by default and can be changed via the procfs
      'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
      also used to print the crypto statistics at runtime:
      
      echo 0xfff1 > /proc/sys/net/tipc/max_tfms
      
      The patch defines a new TIPC version (v7) for the encryption message (-
      backward compatibility as well). The message is basically encapsulated
      as follows:
      
         +----------------------------------------------------------+
         | TIPCv7 encryption  | Original TIPCv2    | Authentication |
         | header             | packet (encrypted) | Tag            |
         +----------------------------------------------------------+
      
      The throughput is about ~40% for small messages (compared with non-
      encryption) and ~9% for large messages. With the support from hardware
      crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
      upto ~85% for small messages and ~55% for large messages.
      
      By default, the new feature is inactive (i.e. no encryption) until user
      sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
      in the kernel configuration to enable/disable the new code when needed.
      
      MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc1b6d6d
  5. 19 8月, 2019 1 次提交
    • J
      tipc: clean up skb list lock handling on send path · e654f9f5
      Jon Maloy 提交于
      The policy for handling the skb list locks on the send and receive paths
      is simple.
      
      - On the send path we never need to grab the lock on the 'xmitq' list
        when the destination is an exernal node.
      
      - On the receive path we always need to grab the lock on the 'inputq'
        list, irrespective of source node.
      
      However, when transmitting node local messages those will eventually
      end up on the receive path of a local socket, meaning that the argument
      'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in  the
      function tipc_sk_rcv(). This has been handled by always initializing
      the spinlock of the 'xmitq' list at message creation, just in case it
      may end up on the receive path later, and despite knowing that the lock
      in most cases never will be used.
      
      This approach is inaccurate and confusing, and has also concealed the
      fact that the stated 'no lock grabbing' policy for the send path is
      violated in some cases.
      
      We now clean up this by never initializing the lock at message creation,
      instead doing this at the moment we find that the message actually will
      enter the receive path. At the same time we fix the four locations
      where we incorrectly access the spinlock on the send/error path.
      
      This patch also reverts commit d12cffe9 ("tipc: ensure head->lock
      is initialised") which has now become redundant.
      
      CC: Eric Dumazet <edumazet@google.com>
      Reported-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e654f9f5
  6. 09 8月, 2019 1 次提交
    • J
      tipc: add loopback device tracking · 6c9081a3
      John Rutherford 提交于
      Since node internal messages are passed directly to the socket, it is not
      possible to observe those messages via tcpdump or wireshark.
      
      We now remedy this by making it possible to clone such messages and send
      the clones to the loopback interface.  The clones are dropped at reception
      and have no functional role except making the traffic visible.
      
      The feature is enabled if network taps are active for the loopback device.
      pcap filtering restrictions require the messages to be presented to the
      receiving side of the loopback device.
      
      v3 - Function dev_nit_active used to check for network taps.
         - Procedure netif_rx_ni used to send cloned messages to loopback device.
      Signed-off-by: NJohn Rutherford <john.rutherford@dektech.com.au>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c9081a3
  7. 26 6月, 2019 1 次提交
  8. 05 4月, 2019 1 次提交
  9. 27 3月, 2019 1 次提交
  10. 22 3月, 2019 1 次提交
    • H
      tipc: fix a null pointer deref · 08e046c8
      Hoang Le 提交于
      In commit c55c8eda ("tipc: smooth change between replicast and
      broadcast") we introduced new method to eliminate the risk of message
      reordering that happen in between different nodes.
      Unfortunately, we forgot checking at receiving side to ignore intra node.
      
      We fix this by checking and returning if arrived message from intra node.
      
      syzbot report:
      
      ==================================================================
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 01/01/2011
      RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
      Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
       24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
       <80> 3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
      RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
      RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
      RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
      RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
      R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
      R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
      FS:  000000000106a880(0000) GS:ffff8880ae800000(0000)
       knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
       tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
       tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
       __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
       tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       ___sys_sendmsg+0x806/0x930 net/socket.c:2260
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
       __do_sys_sendmsg net/socket.c:2307 [inline]
       __se_sys_sendmsg net/socket.c:2305 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401c9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
       48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
       <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
      RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
      R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
      Modules linked in:
      ---[ end trace ba79875754e1708f ]---
      
      Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08e046c8
  11. 20 3月, 2019 2 次提交
    • H
      tipc: smooth change between replicast and broadcast · c55c8eda
      Hoang Le 提交于
      Currently, a multicast stream may start out using replicast, because
      there are few destinations, and then it should ideally switch to
      L2/broadcast IGMP/multicast when the number of destinations grows beyond
      a certain limit. The opposite should happen when the number decreases
      below the limit.
      
      To eliminate the risk of message reordering caused by method change,
      a sending socket must stick to a previously selected method until it
      enters an idle period of 5 seconds. Means there is a 5 seconds pause
      in the traffic from the sender socket.
      
      If the sender never makes such a pause, the method will never change,
      and transmission may become very inefficient as the cluster grows.
      
      With this commit, we allow such a switch between replicast and
      broadcast without any need for a traffic pause.
      
      Solution is to send a dummy message with only the header, also with
      the SYN bit set, via broadcast or replicast. For the data message,
      the SYN bit is set and sending via replicast or broadcast (inverse
      method with dummy).
      
      Then, at receiving side any messages follow first SYN bit message
      (data or dummy message), they will be held in deferred queue until
      another pair (dummy or data message) arrived in other link.
      
      v2: reverse christmas tree declaration
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c55c8eda
    • H
      tipc: support broadcast/replicast configurable for bc-link · 02ec6caf
      Hoang Le 提交于
      Currently, a multicast stream uses either broadcast or replicast as
      transmission method, based on the ratio between number of actual
      destinations nodes and cluster size.
      
      However, when an L2 interface (e.g., VXLAN) provides pseudo
      broadcast support, this becomes very inefficient, as it blindly
      replicates multicast packets to all cluster/subnet nodes,
      irrespective of whether they host actual target sockets or not.
      
      The TIPC multicast algorithm is able to distinguish real destination
      nodes from other nodes, and hence provides a smarter and more
      efficient method for transferring multicast messages than
      pseudo broadcast can do.
      
      Because of this, we now make it possible for users to force
      the broadcast link to permanently switch to using replicast,
      irrespective of which capabilities the bearer provides,
      or pretend to provide.
      Conversely, we also make it possible to force the broadcast link
      to always use true broadcast. While maybe less useful in
      deployed systems, this may at least be useful for testing the
      broadcast algorithm in small clusters.
      
      We retain the current AUTOSELECT ability, i.e., to let the broadcast link
      automatically select which algorithm to use, and to switch back and forth
      between broadcast and replicast as the ratio between destination
      node number and cluster size changes. This remains the default method.
      
      Furthermore, we make it possible to configure the threshold ratio for
      such switches. The default ratio is now set to 10%, down from 25% in the
      earlier implementation.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ec6caf
  12. 04 9月, 2018 1 次提交
  13. 28 7月, 2018 1 次提交
  14. 08 3月, 2018 1 次提交
  15. 02 12月, 2017 1 次提交
    • J
      tipc: fall back to smaller MTU if allocation of local send skb fails · 4c94cc2d
      Jon Maloy 提交于
      When sending node local messages the code is using an 'mtu' of 66060
      bytes to avoid unnecessary fragmentation. During situations of low
      memory tipc_msg_build() may sometimes fail to allocate such large
      buffers, resulting in unnecessary send failures. This can easily be
      remedied by falling back to a smaller MTU, and then reassemble the
      buffer chain as if the message were arriving from a remote node.
      
      At the same time, we change the initial MTU setting of the broadcast
      link to a lower value, so that large messages always are fragmented
      into smaller buffers even when we run in single node mode. Apart from
      obtaining the same advantage as for the 'fallback' solution above, this
      turns out to give a significant performance improvement. This can
      probably be explained with the __pskb_copy() operation performed on the
      buffer for each recipient during reception. We found the optimal value
      for this, considering the most relevant skb pool, to be 3744 bytes.
      Acked-by: NYing Xue <ying.xue@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c94cc2d
  16. 13 10月, 2017 1 次提交
  17. 09 10月, 2017 1 次提交
  18. 21 1月, 2017 4 次提交
  19. 04 1月, 2017 1 次提交
    • J
      tipc: reduce risk of user starvation during link congestion · 365ad353
      Jon Paul Maloy 提交于
      The socket code currently handles link congestion by either blocking
      and trying to send again when the congestion has abated, or just
      returning to the user with -EAGAIN and let him re-try later.
      
      This mechanism is prone to starvation, because the wakeup algorithm is
      non-atomic. During the time the link issues a wakeup signal, until the
      socket wakes up and re-attempts sending, other senders may have come
      in between and occupied the free buffer space in the link. This in turn
      may lead to a socket having to make many send attempts before it is
      successful. In extremely loaded systems we have observed latency times
      of several seconds before a low-priority socket is able to send out a
      message.
      
      In this commit, we simplify this mechanism and reduce the risk of the
      described scenario happening. When a message is attempted sent via a
      congested link, we now let it be added to the link's backlog queue
      anyway, thus permitting an oversubscription of one message per source
      socket. We still create a wakeup item and return an error code, hence
      instructing the sender to block or stop sending. Only when enough space
      has been freed up in the link's backlog queue do we issue a wakeup event
      that allows the sender to continue with the next message, if any.
      
      The fact that a socket now can consider a message sent even when the
      link returns a congestion code means that the sending socket code can
      be simplified. Also, since this is a good opportunity to get rid of the
      obsolete 'mtu change' condition in the three socket send functions, we
      now choose to refactor those functions completely.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      365ad353
  20. 30 10月, 2016 1 次提交
    • J
      tipc: fix broadcast link synchronization problem · 06bd2b1e
      Jon Paul Maloy 提交于
      In commit 2d18ac4b ("tipc: extend broadcast link initialization
      criteria") we tried to fix a problem with the initial synchronization
      of broadcast link acknowledge values. Unfortunately that solution is
      not sufficient to solve the issue.
      
      We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
      non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
      initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
      broadcast acknowledge numbers, leading to premature opening of the
      broadcast link. When the bypassed packets finally arrive, they are
      inadvertently accepted, and the already correctly initialized
      acknowledge number in the broadcast receive link is overwritten by
      the invalid (zero) value of the said packets. After this the broadcast
      link goes stale.
      
      We now fix this by marking the packets where we know the acknowledge
      value is or may be invalid, and then ignoring the acks from those.
      
      To this purpose, we claim an unused bit in the header to indicate that
      the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
      synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
      packets, plus those LINK_PROTOCOL packets sent out before the broadcast
      links are fully synchronized.
      
      This minor protocol update is fully backwards compatible.
      Reported-by: NJohn Thompson <thompa.atl@gmail.com>
      Tested-by: NJohn Thompson <thompa.atl@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06bd2b1e
  21. 03 9月, 2016 1 次提交
    • J
      tipc: transfer broadcast nacks in link state messages · 02d11ca2
      Jon Paul Maloy 提交于
      When we send broadcasts in clusters of more 70-80 nodes, we sometimes
      see the broadcast link resetting because of an excessive number of
      retransmissions. This is caused by a combination of two factors:
      
      1) A 'NACK crunch", where loss of broadcast packets is discovered
         and NACK'ed by several nodes simultaneously, leading to multiple
         redundant broadcast retransmissions.
      
      2) The fact that the NACKS as such also are sent as broadcast, leading
         to excessive load and packet loss on the transmitting switch/bridge.
      
      This commit deals with the latter problem, by moving sending of
      broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
      to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
      bits in word 8 of the said message for this purpose, and introduce a
      new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
      backwards compatible.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d11ca2
  22. 07 3月, 2016 1 次提交
  23. 21 11月, 2015 1 次提交
  24. 24 10月, 2015 11 次提交
    • J
      tipc: clean up unused code and structures · 2af5ae37
      Jon Paul Maloy 提交于
      After the previous changes in this series, we can now remove some
      unused code and structures, both in the broadcast, link aggregation
      and link code.
      
      There are no functional changes in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2af5ae37
    • J
      tipc: eliminate link's reference to owner node · c72fa872
      Jon Paul Maloy 提交于
      With the recent commit series, we have established a one-way dependency
      between the link aggregation (struct tipc_node) instances and their
      pertaining tipc_link instances. This has enabled quite significant code
      and structure simplifications.
      
      In this commit, we eliminate the field 'owner', which points to an
      instance of struct tipc_node, from struct tipc_link, and replace it with
      a pointer to struct net, which is the only external reference now needed
      by a link instance.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72fa872
    • J
      tipc: introduce jumbo frame support for broadcast · 959e1781
      Jon Paul Maloy 提交于
      Until now, we have only been supporting a fix MTU size of 1500 bytes
      for all broadcast media, irrespective of their actual capability.
      
      We now make the broadcast MTU adaptable to the carrying media, i.e.,
      we use the smallest MTU supported by any of the interfaces attached
      to TIPC.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      959e1781
    • J
      tipc: simplify bearer level broadcast · b06b281e
      Jon Paul Maloy 提交于
      Until now, we have been keeping track of the exact set of broadcast
      destinations though the help structure tipc_node_map. This leads us to
      have to maintain a whole infrastructure for supporting this, including
      a pseudo-bearer and a number of functions to manipulate both the bearers
      and the node map correctly. Apart from the complexity, this approach is
      also limiting, as struct tipc_node_map only can support cluster local
      broadcast if we want to avoid it becoming excessively large. We want to
      eliminate this limitation, in order to enable introduction of scoped
      multicast in the future.
      
      A closer analysis reveals that it is unnecessary maintaining this "full
      set" overview; it is sufficient to keep a counter per bearer, indicating
      how many nodes can be reached via this bearer at the moment. The protocol
      is now robust enough to handle transitional discrepancies between the
      nominal number of reachable destinations, as expected by the broadcast
      protocol itself, and the number which is actually reachable at the
      moment. The initial broadcast synchronization, in conjunction with the
      retransmission mechanism, ensures that all packets will eventually be
      acknowledged by the correct set of destinations.
      
      This commit introduces these changes.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b06b281e
    • J
      tipc: let broadcast packet reception use new link receive function · 52666986
      Jon Paul Maloy 提交于
      The code path for receiving broadcast packets is currently distinct
      from the unicast path. This leads to unnecessary code and data
      duplication, something that can be avoided with some effort.
      
      We now introduce separate per-peer tipc_link instances for handling
      broadcast packet reception. Each receive link keeps a pointer to the
      common, single, broadcast link instance, and can hence handle release
      and retransmission of send buffers as if they belonged to the own
      instance.
      
      Furthermore, we let each unicast link instance keep a reference to both
      the pertaining broadcast receive link, and to the common send link.
      This makes it possible for the unicast links to easily access data for
      broadcast link synchronization, as well as for carrying acknowledges for
      received broadcast packets.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52666986
    • J
      tipc: introduce capability bit for broadcast synchronization · fd556f20
      Jon Paul Maloy 提交于
      Until now, we have tried to support both the newer, dedicated broadcast
      synchronization mechanism along with the older, less safe, RESET_MSG/
      ACTIVATE_MSG based one. The latter method has turned out to be a hazard
      in a highly dynamic cluster, so we find it safer to disable it completely
      when we find that the former mechanism is supported by the peer node.
      
      For this purpose, we now introduce a new capabability bit,
      TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
      syncronization is supported by the present node. The new bit is conveyed
      between peers in the 'capabilities' field of neighbor discovery messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd556f20
    • J
      tipc: let broadcast transmission use new link transmit function · 2f566124
      Jon Paul Maloy 提交于
      This commit simplifies the broadcast link transmission function, by
      leveraging previous changes to the link transmission function and the
      broadcast transmission link life cycle.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f566124
    • J
      tipc: use explicit allocation of broadcast send link · 32301906
      Jon Paul Maloy 提交于
      The broadcast link instance (struct tipc_link) used for sending is
      currently aggregated into struct tipc_bclink. This means that we cannot
      use the regular tipc_link_create() function for initiating the link, but
      do instead have to initiate numerous fields directly from the
      bcast_init() function.
      
      We want to reduce dependencies between the broadcast functionality
      and the inner workings of tipc_link. In this commit, we introduce
      a new function tipc_bclink_create() to link.c, and allocate the
      instance of the link separately using this function.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32301906
    • J
      tipc: create broadcast transmission link at namespace init · 5fd9fd63
      Jon Paul Maloy 提交于
      The broadcast transmission link is currently instantiated when the
      network subsystem is started, i.e., on order from user space via netlink.
      
      This forces the broadcast transmission code to do unnecessary tests for
      the existence of the transmission link, as well in single mode node as
      in network mode.
      
      In this commit, we do instead create the link during initialization of
      the name space, and remove it when it is stopped. The fact that the
      transmission link now has a guaranteed longer life cycle than any of its
      potential clients paves the way for further code simplifcations
      and optimizations.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fd9fd63
    • J
      tipc: move broadcast link lock to struct tipc_net · 0043550b
      Jon Paul Maloy 提交于
      The broadcast lock will need to be acquired outside bcast.c in a later
      commit. For this reason, we move the lock to struct tipc_net. Consistent
      with the changes in the previous commit, we also introducee two new
      functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
      currently using tipc_bclink_lock()/unlock() will be phased out during
      the coming commits in this series.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0043550b
    • J
      tipc: move bcast definitions to bcast.c · 6beb19a6
      Jon Paul Maloy 提交于
      Currently, a number of structure and function definitions related
      to the broadcast functionality are unnecessarily exposed in the file
      bcast.h. This obscures the fact that the external interface towards
      the broadcast link in fact is very narrow, and causes unnecessary
      recompilations of other files when anything changes in those
      definitions.
      
      In this commit, we move as many of those definitions as is currently
      possible to the file bcast.c.
      
      We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
      since the name does not correctly describe the contents of this
      struct, and will do so even less in the future, and because we want
      to use the term 'link' more appropriately in the functionality
      introduced later in this series.
      
      Finally, we rename a couple of functions, such as tipc_bclink_xmit()
      and others that will be kept in the future, to include the term 'bcast'
      instead.
      
      There are no functional changes in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6beb19a6