1. 14 7月, 2020 1 次提交
  2. 09 7月, 2020 1 次提交
    • H
      tipc: fix retransmission on unicast links · a34f8291
      Hamish Martin 提交于
      A scenario has been observed where a 'bc_init' message for a link is not
      retransmitted if it fails to be received by the peer. This leads to the
      peer never establishing the link fully and it discarding all other data
      received on the link. In this scenario the message is lost in transit to
      the peer.
      
      The issue is traced to the 'nxt_retr' field of the skb not being
      initialised for links that aren't a bc_sndlink. This leads to the
      comparison in tipc_link_advance_transmq() that gates whether to attempt
      retransmission of a message performing in an undesirable way.
      Depending on the relative value of 'jiffies', this comparison:
          time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr)
      may return true or false given that 'nxt_retr' remains at the
      uninitialised value of 0 for non bc_sndlinks.
      
      This is most noticeable shortly after boot when jiffies is initialised
      to a high value (to flush out rollover bugs) and we compare a jiffies of,
      say, 4294940189 to zero. In that case time_before returns 'true' leading
      to the skb not being retransmitted.
      
      The fix is to ensure that all skbs have a valid 'nxt_retr' time set for
      them and this is achieved by refactoring the setting of this value into
      a central function.
      With this fix, transmission losses of 'bc_init' messages do not stall
      the link establishment forever because the 'bc_init' message is
      retransmitted and the link eventually establishes correctly.
      
      Fixes: 382f598f ("tipc: reduce duplicate packets for unicast traffic")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NHamish Martin <hamish.martin@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a34f8291
  3. 20 6月, 2020 1 次提交
  4. 17 6月, 2020 1 次提交
    • H
      tipc: update a binding service via broadcast · cad2929d
      Hoang Huu Le 提交于
      Currently, updating binding table (add service binding to
      name table/withdraw a service binding) is being sent over replicast.
      However, if we are scaling up clusters to > 100 nodes/containers this
      method is less affection because of looping through nodes in a cluster one
      by one.
      
      It is worth to use broadcast to update a binding service. This way, the
      binding table can be updated on all peer nodes in one shot.
      
      Broadcast is used when all peer nodes, as indicated by a new capability
      flag TIPC_NAMED_BCAST, support reception of this message type.
      
      Four problems need to be considered when introducing this feature.
      1) When establishing a link to a new peer node we still update this by a
      unicast 'bulk' update. This may lead to race conditions, where a later
      broadcast publication/withdrawal bypass the 'bulk', resulting in
      disordered publications, or even that a withdrawal may arrive before the
      corresponding publication. We solve this by adding an 'is_last_bulk' bit
      in the last bulk messages so that it can be distinguished from all other
      messages. Only when this message has arrived do we open up for reception
      of broadcast publications/withdrawals.
      
      2) When a first legacy node is added to the cluster all distribution
      will switch over to use the legacy 'replicast' method, while the
      opposite happens when the last legacy node leaves the cluster. This
      entails another risk of message disordering that has to be handled. We
      solve this by adding a sequence number to the broadcast/replicast
      messages, so that disordering can be discovered and corrected. Note
      however that we don't need to consider potential message loss or
      duplication at this protocol level.
      
      3) Bulk messages don't contain any sequence numbers, and will always
      arrive in order. Hence we must exempt those from the sequence number
      control and deliver them unconditionally. We solve this by adding a new
      'is_bulk' bit in those messages so that they can be recognized.
      
      4) Legacy messages, which don't contain any new bits or sequence
      numbers, but neither can arrive out of order, also need to be exempt
      from the initial synchronization and sequence number check, and
      delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
      to all new messages so that those can be distinguished from legacy
      messages and the latter delivered directly.
      
      v1->v2:
       - fix warning issue reported by kbuild test robot <lkp@intel.com>
       - add santiy check to drop the publication message with a sequence
      number that is lower than the agreed synch point
      Signed-off-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NHoang Huu Le <hoang.h.le@dektech.com.au>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cad2929d
  5. 27 5月, 2020 4 次提交
    • T
      tipc: add support for broadcast rcv stats dumping · 03b6fefd
      Tuong Lien 提交于
      This commit enables dumping the statistics of a broadcast-receiver link
      like the traditional 'broadcast-link' one (which is for broadcast-
      sender). The link dumping can be triggered via netlink (e.g. the
      iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
      indicator.
      
      The name of a broadcast-receiver link of a specific peer will be in the
      format: 'broadcast-link:<peer-id>'.
      
      For example:
      
      Link <broadcast-link:1001002>
        Window:50 packets
        RX packets:7841 fragments:2408/440 bundles:0/0
        TX packets:0 fragments:0/0 bundles:0/0
        RX naks:0 defs:124 dups:0
        TX naks:21 acks:0 retrans:0
        Congestion link:0  Send queue max:0 avg:0
      
      In addition, the broadcast-receiver link statistics can be reset in the
      usual way via netlink by specifying that link name in command.
      
      Note: the 'tipc_link_name_ext()' is removed because the link name can
      now be retrieved simply via the 'l->name'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03b6fefd
    • T
      tipc: enable broadcast retrans via unicast · a91d55d1
      Tuong Lien 提交于
      In some environment, broadcast traffic is suppressed at high rate (i.e.
      a kind of bandwidth limit setting). When it is applied, TIPC broadcast
      can still run successfully. However, when it comes to a high load, some
      packets will be dropped first and TIPC tries to retransmit them but the
      packet retransmission is intentionally broadcast too, so making things
      worse and not helpful at all.
      
      This commit enables the broadcast retransmission via unicast which only
      retransmits packets to the specific peer that has really reported a gap
      i.e. not broadcasting to all nodes in the cluster, so will prevent from
      being suppressed, and also reduce some overheads on the other peers due
      to duplicates, finally improve the overall TIPC broadcast performance.
      
      Note: the functionality can be turned on/off via the sysctl file:
      
      echo 1 > /proc/sys/net/tipc/bc_retruni
      echo 0 > /proc/sys/net/tipc/bc_retruni
      
      Default is '0', i.e. the broadcast retransmission still works as usual.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a91d55d1
    • T
      tipc: add back link trace events · c6ed7a5c
      Tuong Lien 提交于
      In the previous commit ("tipc: add Gap ACK blocks support for broadcast
      link"), we have removed the following link trace events due to the code
      changes:
      
      - tipc_link_bc_ack
      - tipc_link_retrans
      
      This commit adds them back along with some minor changes to adapt to
      the new code.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6ed7a5c
    • T
      tipc: introduce Gap ACK blocks for broadcast link · d7626b5a
      Tuong Lien 提交于
      As achieved through commit 9195948f ("tipc: improve TIPC throughput
      by Gap ACK blocks"), we apply the same mechanism for the broadcast link
      as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
      consist of two parts built for both the broadcast and unicast types:
      
       31                       16 15                        0
      +-------------+-------------+-------------+-------------+
      |  bgack_cnt  |  ugack_cnt  |            len            |
      +-------------+-------------+-------------+-------------+  -
      |            gap            |            ack            |   |
      +-------------+-------------+-------------+-------------+    > bc gacks
      :                           :                           :   |
      +-------------+-------------+-------------+-------------+  -
      |            gap            |            ack            |   |
      +-------------+-------------+-------------+-------------+    > uc gacks
      :                           :                           :   |
      +-------------+-------------+-------------+-------------+  -
      
      which is "automatically" backward-compatible.
      
      We also increase the max number of Gap ACK blocks to 128, allowing upto
      64 blocks per type (total buffer size = 516 bytes).
      
      Besides, the 'tipc_link_advance_transmq()' function is refactored which
      is applicable for both the unicast and broadcast cases now, so some old
      functions can be removed and the code is optimized.
      
      With the patch, TIPC broadcast is more robust regardless of packet loss
      or disorder, latency, ... in the underlying network. Its performance is
      boost up significantly.
      For example, experiment with a 5% packet loss rate results:
      
      $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
      real    0m 42.46s
      user    0m 1.16s
      sys     0m 17.67s
      
      Without the patch:
      
      $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
      real    8m 27.94s
      user    0m 0.55s
      sys     0m 2.38s
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7626b5a
  6. 16 4月, 2020 1 次提交
    • T
      tipc: fix incorrect increasing of link window · edadedf1
      Tuong Lien 提交于
      In commit 16ad3f40 ("tipc: introduce variable window congestion
      control"), we allow link window to change with the congestion avoidance
      algorithm. However, there is a bug that during the slow-start if packet
      retransmission occurs, the link will enter the fast-recovery phase, set
      its window to the 'ssthresh' which is never less than 300, so the link
      window suddenly increases to that limit instead of decreasing.
      
      Consequently, two issues have been observed:
      
      - For broadcast-link: it can leave a gap between the link queues that a
      new packet will be inserted and sent before the previous ones, i.e. not
      in-order.
      
      - For unicast: the algorithm does not work as expected, the link window
      jumps to the slow-start threshold whereas packet retransmission occurs.
      
      This commit fixes the issues by avoiding such the link window increase,
      but still decreasing if the 'ssthresh' is lowered.
      
      Fixes: 16ad3f40 ("tipc: introduce variable window congestion control")
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edadedf1
  7. 18 12月, 2019 1 次提交
    • J
      tipc: don't send gap blocks in ACK messages · b7ffa045
      Jon Maloy 提交于
      In the commit referred to below we eliminated sending of the 'gap'
      indicator in regular ACK messages, reserving this to explicit NACK
      ditto.
      
      Unfortunately we missed to also eliminate building of the 'gap block'
      area in ACK messages. This area is meant to report gaps in the
      received packet sequence following the initial gap, so that lost
      packets can be retransmitted earlier and received out-of-sequence
      packets can be released earlier. However, the interpretation of those
      blocks is dependent on a complete and correct sequence of gaps and
      acks. Hence, when the initial gap indicator is missing a single gap
      block will be interpreted as an acknowledgment of all preceding
      packets. This may lead to packets being released prematurely from the
      sender's transmit queue, with easily predicatble consequences.
      
      We now fix this by not building any gap block area if there is no
      initial gap to report.
      
      Fixes: commit 02288248 ("tipc: eliminate gap indicator from ACK messages")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7ffa045
  8. 11 12月, 2019 3 次提交
    • J
      tipc: introduce variable window congestion control · 16ad3f40
      Jon Maloy 提交于
      We introduce a simple variable window congestion control for links.
      The algorithm is inspired by the Reno algorithm, covering both 'slow
      start', 'congestion avoidance', and 'fast recovery' modes.
      
      - We introduce hard lower and upper window limits per link, still
        different and configurable per bearer type.
      
      - We introduce a 'slow start theshold' variable, initially set to
        the maximum window size.
      
      - We let a link start at the minimum congestion window, i.e. in slow
        start mode, and then let is grow rapidly (+1 per rceived ACK) until
        it reaches the slow start threshold and enters congestion avoidance
        mode.
      
      - In congestion avoidance mode we increment the congestion window for
        each window-size number of acked packets, up to a possible maximum
        equal to the configured maximum window.
      
      - For each non-duplicate NACK received, we drop back to fast recovery
        mode, by setting the both the slow start threshold to and the
        congestion window to (current_congestion_window / 2).
      
      - If the timeout handler finds that the transmit queue has not moved
        since the previous timeout, it drops the link back to slow start
        and forces a probe containing the last sent sequence number to the
        sent to the peer, so that this can discover the stale situation.
      
      This change does in reality have effect only on unicast ethernet
      transport, as we have seen that there is no room whatsoever for
      increasing the window max size for the UDP bearer.
      For now, we also choose to keep the limits for the broadcast link
      unchanged and equal.
      
      This algorithm seems to give a 50-100% throughput improvement for
      messages larger than MTU.
      Suggested-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16ad3f40
    • J
      tipc: eliminate more unnecessary nacks and retransmissions · d3b09995
      Jon Maloy 提交于
      When we increase the link tranmsit window we often observe the following
      scenario:
      
      1) A STATE message bypasses a sequence of traffic packets and arrives
         far ahead of those to the receiver. STATE messages contain a
         'peers_nxt_snt' field to indicate which was the last packet sent
         from the peer. This mechanism is intended as a last resort for the
         receiver to detect missing packets, e.g., during very low traffic
         when there is no packet flow to help early loss detection.
      3) The receiving link compares the 'peer_nxt_snt' field to its own
         'rcv_nxt', finds that there is a gap, and immediately sends a
         NACK message back to the peer.
      4) When this NACKs arrives at the sender, all the requested
         retransmissions are performed, since it is a first-time request.
      
      Just like in the scenario described in the previous commit this leads
      to many redundant retransmissions, with decreased throughput as a
      consequence.
      
      We fix this by adding two more conditions before we send a NACK in
      this sitution. First, the deferred queue must be empty, so we cannot
      assume that the potential packet loss has already been detected by
      other means. Second, we check the 'peers_snd_nxt' field only in probe/
      probe_reply messages, thus turning this into a true mechanism of last
      resort as it was really meant to be.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3b09995
    • J
      tipc: eliminate gap indicator from ACK messages · 02288248
      Jon Maloy 提交于
      When we increase the link send window we sometimes observe the
      following scenario:
      
      1) A packet #N arrives out of order far ahead of a sequence of older
         packets which are still under way. The packet is added to the
         deferred queue.
      2) The missing packets arrive in sequence, and for each 16th of them
         an ACK is sent back to the receiver, as it should be.
      3) When building those ACK messages, it is checked if there is a gap
         between the link's 'rcv_nxt' and the first packet in the deferred
         queue. This is always the case until packet number #N-1 arrives, and
         a 'gap' indicator is added, effectively turning them into NACK
         messages.
      4) When those NACKs arrive at the sender, all the requested
         retransmissions are done, since it is a first-time request.
      
      This sometimes leads to a huge amount of redundant retransmissions,
      causing a drop in max throughput. This problem gets worse when we
      in a later commit introduce variable window congestion control,
      since it drops the link back to 'fast recovery' much more often
      than necessary.
      
      We now fix this by not sending any 'gap' indicator in regular ACK
      messages. We already have a mechanism for sending explicit NACKs
      in place, and this is sufficient to keep up the packet flow.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02288248
  9. 23 11月, 2019 1 次提交
  10. 09 11月, 2019 1 次提交
    • T
      tipc: introduce TIPC encryption & authentication · fc1b6d6d
      Tuong Lien 提交于
      This commit offers an option to encrypt and authenticate all messaging,
      including the neighbor discovery messages. The currently most advanced
      algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
      encryption/decryption is done at the bearer layer, just before leaving
      or after entering TIPC.
      
      Supported features:
      - Encryption & authentication of all TIPC messages (header + data);
      - Two symmetric-key modes: Cluster and Per-node;
      - Automatic key switching;
      - Key-expired revoking (sequence number wrapped);
      - Lock-free encryption/decryption (RCU);
      - Asynchronous crypto, Intel AES-NI supported;
      - Multiple cipher transforms;
      - Logs & statistics;
      
      Two key modes:
      - Cluster key mode: One single key is used for both TX & RX in all
      nodes in the cluster.
      - Per-node key mode: Each nodes in the cluster has one specific TX key.
      For RX, a node requires its peers' TX key to be able to decrypt the
      messages from those peers.
      
      Key setting from user-space is performed via netlink by a user program
      (e.g. the iproute2 'tipc' tool).
      
      Internal key state machine:
      
                                       Attach    Align(RX)
                                           +-+   +-+
                                           | V   | V
              +---------+      Attach     +---------+
              |  IDLE   |---------------->| PENDING |(user = 0)
              +---------+                 +---------+
                 A   A                   Switch|  A
                 |   |                         |  |
                 |   | Free(switch/revoked)    |  |
           (Free)|   +----------------------+  |  |Timeout
                 |              (TX)        |  |  |(RX)
                 |                          |  |  |
                 |                          |  v  |
              +---------+      Switch     +---------+
              | PASSIVE |<----------------| ACTIVE  |
              +---------+       (RX)      +---------+
              (user = 1)                  (user >= 1)
      
      The number of TFMs is 10 by default and can be changed via the procfs
      'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
      also used to print the crypto statistics at runtime:
      
      echo 0xfff1 > /proc/sys/net/tipc/max_tfms
      
      The patch defines a new TIPC version (v7) for the encryption message (-
      backward compatibility as well). The message is basically encapsulated
      as follows:
      
         +----------------------------------------------------------+
         | TIPCv7 encryption  | Original TIPCv2    | Authentication |
         | header             | packet (encrypted) | Tag            |
         +----------------------------------------------------------+
      
      The throughput is about ~40% for small messages (compared with non-
      encryption) and ~9% for large messages. With the support from hardware
      crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
      upto ~85% for small messages and ~55% for large messages.
      
      By default, the new feature is inactive (i.e. no encryption) until user
      sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
      in the kernel configuration to enable/disable the new code when needed.
      
      MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc1b6d6d
  11. 07 11月, 2019 2 次提交
    • T
      tipc: eliminate the dummy packet in link synching · d0d605c5
      Tuong Lien 提交于
      When preparing tunnel packets for the link failover or synchronization,
      as for the safe algorithm, we added a dummy packet on the pair link but
      never sent it out. In the case of failover, the pair link will be reset
      anyway. But for link synching, it will always result in retransmission
      of the dummy packet after that.
      We have also observed that such the retransmission at the early stage
      when a new node comes in a large cluster will take some time and hard
      to be done, leading to the repeated retransmit failures and the link is
      reset.
      
      Since in commit 4929a932 ("tipc: optimize link synching mechanism")
      we have already built a dummy 'TUNNEL_PROTOCOL' message on the new link
      for the synchronization, there's no need for the dummy on the pair one,
      this commit will skip it when the new mechanism takes in place. In case
      nothing exists in the pair link's transmq, the link synching will just
      start and stop shortly on the peer side.
      
      The patch is backward compatible.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Tested-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0d605c5
    • H
      tipc: reduce sensitive to retransmit failures · 426071f1
      Hoang Le 提交于
      With huge cluster (e.g >200nodes), the amount of that flow:
      gap -> retransmit packet -> acked will take time in case of STATE_MSG
      dropped/delayed because a lot of traffic. This lead to 1.5 sec tolerance
      value criteria made link easy failure around 2nd, 3rd of failed
      retransmission attempts.
      
      Instead of re-introduced criteria of 99 faled retransmissions to fix the
      issue, we increase failure detection timer to ten times tolerance value.
      
      Fixes: 77cf8edb ("tipc: simplify stale link failure criteria")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: Jon
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      426071f1
  12. 04 11月, 2019 1 次提交
    • T
      tipc: improve message bundling algorithm · 06e7c70c
      Tuong Lien 提交于
      As mentioned in commit e95584a8 ("tipc: fix unlimited bundling of
      small messages"), the current message bundling algorithm is inefficient
      that can generate bundles of only one payload message, that causes
      unnecessary overheads for both the sender and receiver.
      
      This commit re-designs the 'tipc_msg_make_bundle()' function (now named
      as 'tipc_msg_try_bundle()'), so that when a message comes at the first
      place, we will just check & keep a reference to it if the message is
      suitable for bundling. The message buffer will be put into the link
      backlog queue and processed as normal. Later on, when another one comes
      we will make a bundle with the first message if possible and so on...
      This way, a bundle if really needed will always consist of at least two
      payload messages. Otherwise, we let the first buffer go its way without
      any need of bundling, so reduce the overheads to zero.
      
      Moreover, since now we have both the messages in hand, we can even
      optimize the 'tipc_msg_bundle()' function, make bundle of a very large
      (size ~ MSS) and small messages which is not with the current algorithm
      e.g. [1400-byte message] + [10-byte message] (MTU = 1500).
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06e7c70c
  13. 29 10月, 2019 1 次提交
  14. 02 10月, 2019 1 次提交
    • T
      tipc: fix unlimited bundling of small messages · e95584a8
      Tuong Lien 提交于
      We have identified a problem with the "oversubscription" policy in the
      link transmission code.
      
      When small messages are transmitted, and the sending link has reached
      the transmit window limit, those messages will be bundled and put into
      the link backlog queue. However, bundles of data messages are counted
      at the 'CRITICAL' level, so that the counter for that level, instead of
      the counter for the real, bundled message's level is the one being
      increased.
      Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
      to be tested against the unchanged counter for their own level, while
      contributing to an unrestrained increase at the CRITICAL backlog level.
      
      This leaves a gap in congestion control algorithm for small messages
      that can result in starvation for other users or a "real" CRITICAL
      user. Even that eventually can lead to buffer exhaustion & link reset.
      
      We fix this by keeping a 'target_bskb' buffer pointer at each levels,
      then when bundling, we only bundle messages at the same importance
      level only. This way, we know exactly how many slots a certain level
      have occupied in the queue, so can manage level congestion accurately.
      
      By bundling messages at the same level, we even have more benefits. Let
      consider this:
      - One socket sends 64-byte messages at the 'CRITICAL' level;
      - Another sends 4096-byte messages at the 'LOW' level;
      
      When a 64-byte message comes and is bundled the first time, we put the
      overhead of message bundle to it (+ 40-byte header, data copy, etc.)
      for later use, but the next message can be a 4096-byte one that cannot
      be bundled to the previous one. This means the last bundle carries only
      one payload message which is totally inefficient, as for the receiver
      also! Later on, another 64-byte message comes, now we make a new bundle
      and the same story repeats...
      
      With the new bundling algorithm, this will not happen, the 64-byte
      messages will be bundled together even when the 4096-byte message(s)
      comes in between. However, if the 4096-byte messages are sent at the
      same level i.e. 'CRITICAL', the bundling algorithm will again cause the
      same overhead.
      
      Also, the same will happen even with only one socket sending small
      messages at a rate close to the link transmit's one, so that, when one
      message is bundled, it's transmitted shortly. Then, another message
      comes, a new bundle is created and so on...
      
      We will solve this issue radically by another patch.
      
      Fixes: 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reported-by: NHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e95584a8
  15. 19 8月, 2019 1 次提交
    • J
      tipc: clean up skb list lock handling on send path · e654f9f5
      Jon Maloy 提交于
      The policy for handling the skb list locks on the send and receive paths
      is simple.
      
      - On the send path we never need to grab the lock on the 'xmitq' list
        when the destination is an exernal node.
      
      - On the receive path we always need to grab the lock on the 'inputq'
        list, irrespective of source node.
      
      However, when transmitting node local messages those will eventually
      end up on the receive path of a local socket, meaning that the argument
      'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in  the
      function tipc_sk_rcv(). This has been handled by always initializing
      the spinlock of the 'xmitq' list at message creation, just in case it
      may end up on the receive path later, and despite knowing that the lock
      in most cases never will be used.
      
      This approach is inaccurate and confusing, and has also concealed the
      fact that the stated 'no lock grabbing' policy for the send path is
      violated in some cases.
      
      We now clean up this by never initializing the lock at message creation,
      instead doing this at the moment we find that the message actually will
      enter the receive path. At the same time we fix the four locations
      where we incorrectly access the spinlock on the send/error path.
      
      This patch also reverts commit d12cffe9 ("tipc: ensure head->lock
      is initialised") which has now become redundant.
      
      CC: Eric Dumazet <edumazet@google.com>
      Reported-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e654f9f5
  16. 17 8月, 2019 1 次提交
    • T
      tipc: fix false detection of retransmit failures · 71204231
      Tuong Lien 提交于
      This commit eliminates the use of the link 'stale_limit' & 'prev_from'
      (besides the already removed - 'stale_cnt') variables in the detection
      of repeated retransmit failures as there is no proper way to initialize
      them to avoid a false detection, i.e. it is not really a retransmission
      failure but due to a garbage values in the variables.
      
      Instead, a jiffies variable will be added to individual skbs (like the
      way we restrict the skb retransmissions) in order to mark the first skb
      retransmit time. Later on, at the next retransmissions, the timestamp
      will be checked to see if the skb in the link transmq is "too stale",
      that is, the link tolerance time has passed, so that a link reset will
      be ordered. Note, just checking on the first skb in the queue is fine
      enough since it must be the oldest one.
      A counter is also added to keep track the actual skb retransmissions'
      number for later checking when the failure happens.
      
      The downside of this approach is that the skb->cb[] buffer is about to
      be exhausted, however it is always able to allocate another memory area
      and keep a reference to it when needed.
      
      Fixes: 77cf8edb ("tipc: simplify stale link failure criteria")
      Reported-by: NHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71204231
  17. 02 8月, 2019 1 次提交
    • J
      tipc: reduce risk of wakeup queue starvation · 7c5b4205
      Jon Maloy 提交于
      In commit 365ad353 ("tipc: reduce risk of user starvation during
      link congestion") we allowed senders to add exactly one list of extra
      buffers to the link backlog queues during link congestion (aka
      "oversubscription"). However, the criteria for when to stop adding
      wakeup messages to the input queue when the overload abates is
      inaccurate, and may cause starvation problems during very high load.
      
      Currently, we stop adding wakeup messages after 10 total failed attempts
      where we find that there is no space left in the backlog queue for a
      certain importance level. The counter for this is accumulated across all
      levels, which may lead the algorithm to leave the loop prematurely,
      although there may still be plenty of space available at some levels.
      The result is sometimes that messages near the wakeup queue tail are not
      added to the input queue as they should be.
      
      We now introduce a more exact algorithm, where we keep adding wakeup
      messages to a level as long as the backlog queue has free slots for
      the corresponding level, and stop at the moment there are no more such
      slots or when there are no more wakeup messages to dequeue.
      
      Fixes: 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reported-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c5b4205
  18. 26 7月, 2019 2 次提交
    • T
      tipc: fix changeover issues due to large packet · 2320bcda
      Tuong Lien 提交于
      In conjunction with changing the interfaces' MTU (e.g. especially in
      the case of a bonding) where the TIPC links are brought up and down
      in a short time, a couple of issues were detected with the current link
      changeover mechanism:
      
      1) When one link is up but immediately forced down again, the failover
      procedure will be carried out in order to failover all the messages in
      the link's transmq queue onto the other working link. The link and node
      state is also set to FAILINGOVER as part of the process. The message
      will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
      bytes (= the message header size). There is no problem if the original
      message size is not larger than the link's MTU - 40, and indeed this is
      the max size of a normal payload messages. However, in the situation
      above, because the link has just been up, the messages in the link's
      transmq are almost SYNCH_MSGs which had been generated by the link
      synching procedure, then their size might reach the max value already!
      When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
      will exceed the link's MTU. As a result, the messages are dropped
      silently and the failover procedure will never end up, the link will
      not be able to exit the FAILINGOVER state, so cannot be re-established.
      
      2) The same scenario above can happen more easily in case the MTU of
      the links is set differently or when changing. In that case, as long as
      a large message in the failure link's transmq queue was built and
      fragmented with its link's MTU > the other link's one, the issue will
      happen (there is no need of a link synching in advance).
      
      3) The link synching procedure also faces with the same issue but since
      the link synching is only started upon receipt of a SYNCH_MSG, dropping
      the message will not result in a state deadlock, but it is not expected
      as design.
      
      The 1) & 3) issues are resolved by the last commit that only a dummy
      SYNCH_MSG (i.e. without data) is generated at the link synching, so the
      size of a FAILOVER_MSG if any then will never exceed the link's MTU.
      
      For the 2) issue, the only solution is trying to fragment the messages
      in the failure link's transmq queue according to the working link's MTU
      so they can be failovered then. A new function is made to accomplish
      this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
      original message size is too large, it will be fragmented & reassembled
      at the receiving side.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2320bcda
    • T
      tipc: optimize link synching mechanism · 4929a932
      Tuong Lien 提交于
      This commit along with the next one are to resolve the issues with the
      link changeover mechanism. See that commit for details.
      
      Basically, for the link synching, from now on, we will send only one
      single ("dummy") SYNCH message to peer. The SYNCH message does not
      contain any data, just a header conveying the synch point to the peer.
      
      A new node capability flag ("TIPC_TUNNEL_ENHANCED") is introduced for
      backward compatible!
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Suggested-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4929a932
  19. 02 7月, 2019 1 次提交
  20. 26 6月, 2019 3 次提交
  21. 19 6月, 2019 1 次提交
    • T
      tipc: fix issues with early FAILOVER_MSG from peer · d0f84d08
      Tuong Lien 提交于
      It appears that a FAILOVER_MSG can come from peer even when the failure
      link is resetting (i.e. just after the 'node_write_unlock()'...). This
      means the failover procedure on the node has not been started yet.
      The situation is as follows:
      
               node1                                node2
        linkb          linka                  linka        linkb
          |              |                      |            |
          |              |                      x failure    |
          |              |                  RESETTING        |
          |              |                      |            |
          |              x failure            RESET          |
          |          RESETTING             FAILINGOVER       |
          |              |   (FAILOVER_MSG)     |            |
          |<-------------------------------------------------|
          | *FAILINGOVER |                      |            |
          |              | (dummy FAILOVER_MSG) |            |
          |------------------------------------------------->|
          |            RESET                    |            | FAILOVER_END
          |         FAILINGOVER               RESET          |
          .              .                      .            .
          .              .                      .            .
          .              .                      .            .
      
      Once this happens, the link failover procedure will be triggered
      wrongly on the receiving node since the node isn't in FAILINGOVER state
      but then another link failover will be carried out.
      The consequences are:
      
      1) A peer might get stuck in FAILINGOVER state because the 'sync_point'
      was set, reset and set incorrectly, the criteria to end the failover
      would not be met, it could keep waiting for a message that has already
      received.
      
      2) The early FAILOVER_MSG(s) could be queued in the link failover
      deferdq but would be purged or not pulled out because the 'drop_point'
      was not set correctly.
      
      3) The early FAILOVER_MSG(s) could be dropped too.
      
      4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state
      shortly, but later on it would be restarted.
      
      The same situation can also happen when the link is in PEER_RESET state
      and a FAILOVER_MSG arrives.
      
      The commit resolves the issues by forcing the link down immediately, so
      the failover procedure will be started normally (which is the same as
      when receiving a FAILOVER_MSG and the link is in up state).
      
      Also, the function "tipc_node_link_failover()" is toughen to avoid such
      a situation from happening.
      Acked-by: NJon Maloy <jon.maloy@ericsson.se>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0f84d08
  22. 18 6月, 2019 1 次提交
  23. 04 5月, 2019 1 次提交
    • T
      tipc: fix missing Name entries due to half-failover · c0b14a08
      Tuong Lien 提交于
      TIPC link can temporarily fall into "half-establish" that only one of
      the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
      messages, whereas the other link endpoint is not up (e.g. immediately
      when the endpoint receives ACTIVATE_MSG, the network interface goes
      down...).
      
      This is a normal situation and will be settled because the link
      endpoint will be eventually brought down after the link tolerance time.
      
      However, the situation will become worse when the second link is
      established before the first link endpoint goes down,
      For example:
      
         1. Both links <1A-2A>, <1B-2B> down
         2. Link endpoint 2A up, but 1A still down (e.g. due to network
            disturbance, wrong session, etc.)
         3. Link <1B-2B> up
         4. Link endpoint 2A down (e.g. due to link tolerance timeout)
         5. Node B starts failover onto link <1B-2B>
      
         ==> Node A does never start link failover.
      
      When the "half-failover" situation happens, two consequences have been
      observed:
      
      a) Peer link/node gets stuck in FAILINGOVER state;
      b) Traffic or user messages that peer node is trying to failover onto
      the second link can be partially or completely dropped by this node.
      
      The consequence a) was actually solved by commit c140eb16 ("tipc:
      fix failover problem"), but that commit didn't cover the b). It's due
      to the fact that the tunnel link endpoint has never been prepared for a
      failover, so the 'l->drop_point' (and the other data...) is not set
      correctly. When a TUNNEL_MSG from peer node arrives on the link,
      depending on the inner message's seqno and the current 'l->drop_point'
      value, the message can be dropped (- treated as a duplicate message) or
      processed.
      At this early stage, the traffic messages from peer are likely to be
      NAME_DISTRIBUTORs, this means some name table entries will be missed on
      the node forever!
      
      The commit resolves the issue by starting the FAILOVER process on this
      node as well. Another benefit from this solution is that we ensure the
      link will not be re-established until the failover ends.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0b14a08
  24. 28 4月, 2019 2 次提交
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
    • M
      netlink: make nla_nest_start() add NLA_F_NESTED flag · ae0be8de
      Michal Kubecek 提交于
      Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
      netlink based interfaces (including recently added ones) are still not
      setting it in kernel generated messages. Without the flag, message parsers
      not aware of attribute semantics (e.g. wireshark dissector or libmnl's
      mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
      the structure of their contents.
      
      Unfortunately we cannot just add the flag everywhere as there may be
      userspace applications which check nlattr::nla_type directly rather than
      through a helper masking out the flags. Therefore the patch renames
      nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
      as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
      are rewritten to use nla_nest_start().
      
      Except for changes in include/net/netlink.h, the patch was generated using
      this semantic patch:
      
      @@ expression E1, E2; @@
      -nla_nest_start(E1, E2)
      +nla_nest_start_noflag(E1, E2)
      
      @@ expression E1, E2; @@
      -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
      +nla_nest_start(E1, E2)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae0be8de
  25. 17 4月, 2019 1 次提交
    • T
      tipc: fix link established but not in session · f7a93780
      Tuong Lien 提交于
      According to the link FSM, when a link endpoint got RESET_MSG (- a
      traditional one without the stopping bit) from its peer, it moves to
      PEER_RESET state and raises a LINK_DOWN event which then resets the
      link itself. Its state will become ESTABLISHING after the reset event
      and the link will be re-established soon after this endpoint starts to
      send ACTIVATE_MSG to the peer.
      
      There is no problem with this mechanism, however the link resetting has
      cleared the link 'in_session' flag (along with the other important link
      data such as: the link 'mtu') that was correctly set up at the 1st step
      (i.e. when this endpoint received the peer RESET_MSG). As a result, the
      link will become ESTABLISHED, but the 'in_session' flag is not set, and
      all STATE_MSG from its peer will be dropped at the link_validate_msg().
      It means the link not synced and will sooner or later face a failure.
      
      Since the link reset action is obviously needed for a new link session
      (this is also true in the other situations), the problem here is that
      the link is re-established a bit too early when the link endpoints are
      not really in-sync yet. The commit forces a resync as already done in
      the previous commit 91986ee1 ("tipc: fix link session and
      re-establish issues") by simply varying the link 'peer_session' value
      at the link_reset().
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7a93780
  26. 05 4月, 2019 3 次提交
    • T
      tipc: adapt link failover for new Gap-ACK algorithm · 58ee86b8
      Tuong Lien 提交于
      In commit 0ae955e2656d ("tipc: improve TIPC throughput by Gap ACK
      blocks"), we enhance the link transmq by releasing as many packets as
      possible with the multi-ACKs from peer node. This also means the queue
      is now non-linear and the peer link deferdq becomes vital.
      
      Whereas, in the case of link failover, all messages in the link transmq
      need to be transmitted as tunnel messages in such a way that message
      sequentiality and cardinality per sender is preserved. This requires us
      to maintain the link deferdq somehow, so that when the tunnel messages
      arrive, the inner user messages along with the ones in the deferdq will
      be delivered to upper layer correctly.
      
      The commit accomplishes this by defining a new queue in the TIPC link
      structure to hold the old link deferdq when link failover happens and
      process it upon receipt of tunnel messages.
      
      Also, in the case of link syncing, the link deferdq will not be purged
      to avoid unnecessary retransmissions that in the worst case will fail
      because the packets might have been freed on the sending side.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58ee86b8
    • T
      tipc: reduce duplicate packets for unicast traffic · 382f598f
      Tuong Lien 提交于
      For unicast transmission, the current NACK sending althorithm is over-
      active that forces the sending side to retransmit a packet that is not
      really lost but just arrived at the receiving side with some delay, or
      even retransmit same packets that have already been retransmitted
      before. As a result, many duplicates are observed also under normal
      condition, ie. without packet loss.
      
      One example case is: node1 transmits 1 2 3 4 10 5 6 7 8 9, when node2
      receives packet #10, it puts into the deferdq. When the packet #5 comes
      it sends NACK with gap [6 - 9]. However, shortly after that, when
      packet #6 arrives, it pulls out packet #10 from the deferfq, but it is
      still out of order, so it makes another NACK with gap [7 - 9] and so on
      ... Finally, node1 has to retransmit the packets 5 6 7 8 9 a number of
      times, but in fact all the packets are not lost at all, so duplicates!
      
      This commit reduces duplicates by changing the condition to send NACK,
      also restricting the retransmissions on individual packets via a timer
      of about 1ms. However, it also needs to say that too tricky condition
      for NACKs or too long timeout value for retransmissions will result in
      performance reducing! The criterias in this commit are found to be
      effective for both the requirements to reduce duplicates but not affect
      performance.
      
      The tipc_link_rcv() is also improved to only dequeue skb from the link
      deferdq if it is expected (ie. its seqno <= rcv_nxt).
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      382f598f
    • T
      tipc: improve TIPC throughput by Gap ACK blocks · 9195948f
      Tuong Lien 提交于
      During unicast link transmission, it's observed very often that because
      of one or a few lost/dis-ordered packets, the sending side will fastly
      reach the send window limit and must wait for the packets to be arrived
      at the receiving side or in the worst case, a retransmission must be
      done first. The sending side cannot release a lot of subsequent packets
      in its transmq even though all of them might have already been received
      by the receiving side.
      That is, one or two packets dis-ordered/lost and dozens of packets have
      to wait, this obviously reduces the overall throughput!
      
      This commit introduces an algorithm to overcome this by using "Gap ACK
      blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
      that describes the link deferdq where packets have been got by the
      receiving side but with gaps, for example:
      
            link deferdq: [1 2 3 4      10 11      13 14 15       20]
      --> Gap ACK blocks:       <4, 5>,   <11, 1>,      <15, 4>, <20, 0>
      
      The Gap ACK blocks will be sent to the sending side along with the
      traditional ACK or NACK message. Immediately when receiving the message
      the sending side will now not only release from its transmq the packets
      ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
      be enqueued and transmitted.
      In addition, the sending side can now do "multi-retransmissions"
      according to the Gaps reported in the Gap ACK blocks.
      
      The new algorithm as verified helps greatly improve the TIPC throughput
      especially under packet loss condition.
      
      So far, a maximum of 32 blocks is quite enough without any "Too few Gap
      ACK blocks" reports with a 5.0% packet loss rate, however this number
      can be increased in the furture if needed.
      
      Also, the patch is backward compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9195948f
  27. 20 3月, 2019 1 次提交
    • H
      tipc: support broadcast/replicast configurable for bc-link · 02ec6caf
      Hoang Le 提交于
      Currently, a multicast stream uses either broadcast or replicast as
      transmission method, based on the ratio between number of actual
      destinations nodes and cluster size.
      
      However, when an L2 interface (e.g., VXLAN) provides pseudo
      broadcast support, this becomes very inefficient, as it blindly
      replicates multicast packets to all cluster/subnet nodes,
      irrespective of whether they host actual target sockets or not.
      
      The TIPC multicast algorithm is able to distinguish real destination
      nodes from other nodes, and hence provides a smarter and more
      efficient method for transferring multicast messages than
      pseudo broadcast can do.
      
      Because of this, we now make it possible for users to force
      the broadcast link to permanently switch to using replicast,
      irrespective of which capabilities the bearer provides,
      or pretend to provide.
      Conversely, we also make it possible to force the broadcast link
      to always use true broadcast. While maybe less useful in
      deployed systems, this may at least be useful for testing the
      broadcast algorithm in small clusters.
      
      We retain the current AUTOSELECT ability, i.e., to let the broadcast link
      automatically select which algorithm to use, and to switch back and forth
      between broadcast and replicast as the ratio between destination
      node number and cluster size changes. This remains the default method.
      
      Furthermore, we make it possible to configure the threshold ratio for
      such switches. The default ratio is now set to 10%, down from 25% in the
      earlier implementation.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ec6caf
  28. 12 2月, 2019 1 次提交
    • T
      tipc: fix link session and re-establish issues · 91986ee1
      Tuong Lien 提交于
      When a link endpoint is re-created (e.g. after a node reboot or
      interface reset), the link session number is varied by random, the peer
      endpoint will be synced with this new session number before the link is
      re-established.
      
      However, there is a shortcoming in this mechanism that can lead to the
      link never re-established or faced with a failure then. It happens when
      the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
      well as the 'in_session' flag have been set, but suddenly this link
      endpoint leaves. When it comes back with a random session number, there
      are two situations possible:
      
      1/ If the random session number is larger than (or equal to) the
      previous one, the peer endpoint will be updated with this new session
      upon receipt of a RESET_MSG from this endpoint, and the link can be re-
      established as normal. Otherwise, all the RESET_MSGs from this endpoint
      will be rejected by the peer. In turn, when this link endpoint receives
      one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
      to send STATE_MSGs, but again these messages will be dropped by the
      peer due to wrong session.
      The peer link endpoint can still become ESTABLISHED after receiving a
      traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
      NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
      will be forced down sooner or later!
      
      Even in case the random session number is larger than the previous one,
      it can be that the ACTIVATE_MSG from the peer arrives first, and this
      link endpoint moves quickly to ESTABLISHED without sending out any
      RESET_MSG yet. Consequently, the peer link will not be updated with the
      new session number, and the same link failure scenario as above will
      happen.
      
      2/ Another situation can be that, the peer link endpoint was reset due
      to any reasons in the meantime, its link state was set to RESET from
      ESTABLISHING but still in session, i.e. the 'in_session' flag is not
      reset...
      Now, if the random session number from this endpoint is less than the
      previous one, all the RESET_MSGs from this endpoint will be rejected by
      the peer. In the other direction, when this link endpoint receives a
      RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
      ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
      As a result, the link cannot be re-established but gets stuck with this
      link endpoint in state ESTABLISHING and the peer in RESET!
      
      Solution:
      
      ===========
      
      This link endpoint should not go directly to ESTABLISHED when getting
      ACTIVATE_MSG from the peer which may belong to the old session if the
      link was re-created. To ensure the session to be correct before the
      link is re-established, the peer endpoint in ESTABLISHING state will
      send back the last session number in ACTIVATE_MSG for a verification at
      this endpoint. Then, if needed, a new and more appropriate session
      number will be regenerated to force a re-synch first.
      
      In addition, when a link in ESTABLISHING state is reset, its state will
      move to RESET according to the link FSM, along with resetting the
      'in_session' flag (and the other data) as a normal link reset, it will
      also be deleted if requested.
      
      The solution is backward compatible.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91986ee1