1. 19 9月, 2020 2 次提交
    • T
      tipc: add automatic session key exchange · 1ef6f7c9
      Tuong Lien 提交于
      With support from the master key option in the previous commit, it
      becomes easy to make frequent updates/exchanges of session keys between
      authenticated cluster nodes.
      Basically, there are two situations where the key exchange will take in
      place:
      
      - When a new node joins the cluster (with the master key), it will need
        to get its peer's TX key, so that be able to decrypt further messages
        from that peer.
      
      - When a new session key is generated (by either user manual setting or
        later automatic rekeying feature), the key will be distributed to all
        peer nodes in the cluster.
      
      A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
      /KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
      using the master key before sending out. Upon receipt of the message it
      will be decrypted in the same way as regular messages, then attached as
      the sender's RX key in the receiver node.
      
      In this way, the key exchange is reliable by the link layer, as well as
      security, integrity and authenticity by the crypto layer.
      
      Also, the forward security will be easily achieved by user changing the
      master key actively but this should not be required very frequently.
      
      The key exchange feature is independent on the presence of a master key
      Note however that the master key still is needed for new nodes to be
      able to join the cluster. It is also optional, and can be turned off/on
      via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].
      
      Backward compatibility is guaranteed because for nodes that do not have
      master key support, key exchange using master key ie. tx_key = 0 if any
      will be shortly discarded at the message validation step. In other
      words, the key exchange feature will be automatically disabled to those
      nodes.
      
      v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
      error in node.c. The function only exists when built with the TIPC
      "CONFIG_TIPC_CRYPTO" option.
      
      v3: use 'info->extack' for a message emitted due to netlink operations
      instead (- David's comment).
      Reported-by: Nkernel test robot <lkp@intel.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ef6f7c9
    • T
      tipc: introduce encryption master key · daef1ee3
      Tuong Lien 提交于
      In addition to the supported cluster & per-node encryption keys for the
      en/decryption of TIPC messages, we now introduce one option for user to
      set a cluster key as 'master key', which is simply a symmetric key like
      the former but has a longer life cycle. It has two purposes:
      
      - Authentication of new member nodes in the cluster. New nodes, having
        no knowledge of current session keys in the cluster will still be
        able to join the cluster as long as they know the master key. This is
        because all neighbor discovery (LINK_CONFIG) messages must be
        encrypted with this key.
      
      - Encryption of session encryption keys during automatic exchange and
        update of those.This is a feature we will introduce in a later commit
        in this series.
      
      We insert the new key into the currently unused slot 0 in the key array
      and start using it immediately once the user has set it.
      After joining, a node only knowing the master key should be fully
      communicable to existing nodes in the cluster, although those nodes may
      have their own session keys activated (i.e. not the master one). To
      support this, we define a 'grace period', starting from the time a node
      itself reports having no RX keys, so the existing nodes will use the
      master key for encryption instead. The grace period can be extended but
      will automatically stop after e.g. 5 seconds without a new report. This
      is also the basis for later key exchanging feature as the new node will
      be impossible to decrypt anything without the support from master key.
      
      For user to set a master key, we define a new netlink flag -
      'TIPC_NLA_NODE_KEY_MASTER', so it can be added to the current 'set key'
      netlink command to specify the setting key to be a master key.
      
      Above all, the traditional cluster/per-node key mechanism is guaranteed
      to work when user comes not to use this master key option. This is also
      compatible to legacy nodes without the feature supported.
      
      Even this master key can be updated without any interruption of cluster
      connectivity but is so is needed, this has to be coordinated and set by
      the user.
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      daef1ee3
  2. 20 6月, 2020 1 次提交
  3. 17 6月, 2020 1 次提交
    • H
      tipc: update a binding service via broadcast · cad2929d
      Hoang Huu Le 提交于
      Currently, updating binding table (add service binding to
      name table/withdraw a service binding) is being sent over replicast.
      However, if we are scaling up clusters to > 100 nodes/containers this
      method is less affection because of looping through nodes in a cluster one
      by one.
      
      It is worth to use broadcast to update a binding service. This way, the
      binding table can be updated on all peer nodes in one shot.
      
      Broadcast is used when all peer nodes, as indicated by a new capability
      flag TIPC_NAMED_BCAST, support reception of this message type.
      
      Four problems need to be considered when introducing this feature.
      1) When establishing a link to a new peer node we still update this by a
      unicast 'bulk' update. This may lead to race conditions, where a later
      broadcast publication/withdrawal bypass the 'bulk', resulting in
      disordered publications, or even that a withdrawal may arrive before the
      corresponding publication. We solve this by adding an 'is_last_bulk' bit
      in the last bulk messages so that it can be distinguished from all other
      messages. Only when this message has arrived do we open up for reception
      of broadcast publications/withdrawals.
      
      2) When a first legacy node is added to the cluster all distribution
      will switch over to use the legacy 'replicast' method, while the
      opposite happens when the last legacy node leaves the cluster. This
      entails another risk of message disordering that has to be handled. We
      solve this by adding a sequence number to the broadcast/replicast
      messages, so that disordering can be discovered and corrected. Note
      however that we don't need to consider potential message loss or
      duplication at this protocol level.
      
      3) Bulk messages don't contain any sequence numbers, and will always
      arrive in order. Hence we must exempt those from the sequence number
      control and deliver them unconditionally. We solve this by adding a new
      'is_bulk' bit in those messages so that they can be recognized.
      
      4) Legacy messages, which don't contain any new bits or sequence
      numbers, but neither can arrive out of order, also need to be exempt
      from the initial synchronization and sequence number check, and
      delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
      to all new messages so that those can be distinguished from legacy
      messages and the latter delivered directly.
      
      v1->v2:
       - fix warning issue reported by kbuild test robot <lkp@intel.com>
       - add santiy check to drop the publication message with a sequence
      number that is lower than the agreed synch point
      Signed-off-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NHoang Huu Le <hoang.h.le@dektech.com.au>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cad2929d
  4. 27 5月, 2020 3 次提交
    • T
      tipc: add test for Nagle algorithm effectiveness · 0a3e060f
      Tuong Lien 提交于
      When streaming in Nagle mode, we try to bundle small messages from user
      as many as possible if there is one outstanding buffer, i.e. not ACK-ed
      by the receiving side, which helps boost up the overall throughput. So,
      the algorithm's effectiveness really depends on when Nagle ACK comes or
      what the specific network latency (RTT) is, compared to the user's
      message sending rate.
      
      In a bad case, the user's sending rate is low or the network latency is
      small, there will not be many bundles, so making a Nagle ACK or waiting
      for it is not meaningful.
      For example: a user sends its messages every 100ms and the RTT is 50ms,
      then for each messages, we require one Nagle ACK but then there is only
      one user message sent without any bundles.
      
      In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
      but now the user sends messages in medium size, then there will not be
      any difference at all, that says 3 x 1000-byte data messages if bundled
      will still result in 3 bundles with MTU = 1500.
      
      When Nagle is ineffective, the delay in user message sending is clearly
      wasted instead of sending directly.
      
      Besides, adding Nagle ACKs will consume some processor load on both the
      sending and receiving sides.
      
      This commit adds a test on the effectiveness of the Nagle algorithm for
      an individual connection in the network on which it actually runs.
      Particularly, upon receipt of a Nagle ACK we will compare the number of
      bundles in the backlog queue to the number of user messages which would
      be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
      mode will be kept for further message sending. Otherwise, we will leave
      Nagle and put a 'penalty' on the connection, so it will have to spend
      more 'one-way' messages before being able to re-enter Nagle.
      
      In addition, the 'ack-required' bit is only set when really needed that
      the number of Nagle ACKs will be reduced during Nagle mode.
      
      Testing with benchmark showed that with the patch, there was not much
      difference in throughput for small messages since the tool continuously
      sends messages without a break, so Nagle would still take in effect.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a3e060f
    • T
      tipc: add support for broadcast rcv stats dumping · 03b6fefd
      Tuong Lien 提交于
      This commit enables dumping the statistics of a broadcast-receiver link
      like the traditional 'broadcast-link' one (which is for broadcast-
      sender). The link dumping can be triggered via netlink (e.g. the
      iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
      indicator.
      
      The name of a broadcast-receiver link of a specific peer will be in the
      format: 'broadcast-link:<peer-id>'.
      
      For example:
      
      Link <broadcast-link:1001002>
        Window:50 packets
        RX packets:7841 fragments:2408/440 bundles:0/0
        TX packets:0 fragments:0/0 bundles:0/0
        RX naks:0 defs:124 dups:0
        TX naks:21 acks:0 retrans:0
        Congestion link:0  Send queue max:0 avg:0
      
      In addition, the broadcast-receiver link statistics can be reset in the
      usual way via netlink by specifying that link name in command.
      
      Note: the 'tipc_link_name_ext()' is removed because the link name can
      now be retrieved simply via the 'l->name'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03b6fefd
    • T
      tipc: introduce Gap ACK blocks for broadcast link · d7626b5a
      Tuong Lien 提交于
      As achieved through commit 9195948f ("tipc: improve TIPC throughput
      by Gap ACK blocks"), we apply the same mechanism for the broadcast link
      as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
      consist of two parts built for both the broadcast and unicast types:
      
       31                       16 15                        0
      +-------------+-------------+-------------+-------------+
      |  bgack_cnt  |  ugack_cnt  |            len            |
      +-------------+-------------+-------------+-------------+  -
      |            gap            |            ack            |   |
      +-------------+-------------+-------------+-------------+    > bc gacks
      :                           :                           :   |
      +-------------+-------------+-------------+-------------+  -
      |            gap            |            ack            |   |
      +-------------+-------------+-------------+-------------+    > uc gacks
      :                           :                           :   |
      +-------------+-------------+-------------+-------------+  -
      
      which is "automatically" backward-compatible.
      
      We also increase the max number of Gap ACK blocks to 128, allowing upto
      64 blocks per type (total buffer size = 516 bytes).
      
      Besides, the 'tipc_link_advance_transmq()' function is refactored which
      is applicable for both the unicast and broadcast cases now, so some old
      functions can be removed and the code is optimized.
      
      With the patch, TIPC broadcast is more robust regardless of packet loss
      or disorder, latency, ... in the underlying network. Its performance is
      boost up significantly.
      For example, experiment with a 5% packet loss rate results:
      
      $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
      real    0m 42.46s
      user    0m 1.16s
      sys     0m 17.67s
      
      Without the patch:
      
      $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
      real    8m 27.94s
      user    0m 0.55s
      sys     0m 2.38s
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7626b5a
  5. 27 3月, 2020 1 次提交
  6. 09 11月, 2019 1 次提交
    • T
      tipc: introduce TIPC encryption & authentication · fc1b6d6d
      Tuong Lien 提交于
      This commit offers an option to encrypt and authenticate all messaging,
      including the neighbor discovery messages. The currently most advanced
      algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
      encryption/decryption is done at the bearer layer, just before leaving
      or after entering TIPC.
      
      Supported features:
      - Encryption & authentication of all TIPC messages (header + data);
      - Two symmetric-key modes: Cluster and Per-node;
      - Automatic key switching;
      - Key-expired revoking (sequence number wrapped);
      - Lock-free encryption/decryption (RCU);
      - Asynchronous crypto, Intel AES-NI supported;
      - Multiple cipher transforms;
      - Logs & statistics;
      
      Two key modes:
      - Cluster key mode: One single key is used for both TX & RX in all
      nodes in the cluster.
      - Per-node key mode: Each nodes in the cluster has one specific TX key.
      For RX, a node requires its peers' TX key to be able to decrypt the
      messages from those peers.
      
      Key setting from user-space is performed via netlink by a user program
      (e.g. the iproute2 'tipc' tool).
      
      Internal key state machine:
      
                                       Attach    Align(RX)
                                           +-+   +-+
                                           | V   | V
              +---------+      Attach     +---------+
              |  IDLE   |---------------->| PENDING |(user = 0)
              +---------+                 +---------+
                 A   A                   Switch|  A
                 |   |                         |  |
                 |   | Free(switch/revoked)    |  |
           (Free)|   +----------------------+  |  |Timeout
                 |              (TX)        |  |  |(RX)
                 |                          |  |  |
                 |                          |  v  |
              +---------+      Switch     +---------+
              | PASSIVE |<----------------| ACTIVE  |
              +---------+       (RX)      +---------+
              (user = 1)                  (user >= 1)
      
      The number of TFMs is 10 by default and can be changed via the procfs
      'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
      also used to print the crypto statistics at runtime:
      
      echo 0xfff1 > /proc/sys/net/tipc/max_tfms
      
      The patch defines a new TIPC version (v7) for the encryption message (-
      backward compatibility as well). The message is basically encapsulated
      as follows:
      
         +----------------------------------------------------------+
         | TIPCv7 encryption  | Original TIPCv2    | Authentication |
         | header             | packet (encrypted) | Tag            |
         +----------------------------------------------------------+
      
      The throughput is about ~40% for small messages (compared with non-
      encryption) and ~9% for large messages. With the support from hardware
      crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
      upto ~85% for small messages and ~55% for large messages.
      
      By default, the new feature is inactive (i.e. no encryption) until user
      sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
      in the kernel configuration to enable/disable the new code when needed.
      
      MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc1b6d6d
  7. 04 11月, 2019 1 次提交
    • T
      tipc: improve message bundling algorithm · 06e7c70c
      Tuong Lien 提交于
      As mentioned in commit e95584a8 ("tipc: fix unlimited bundling of
      small messages"), the current message bundling algorithm is inefficient
      that can generate bundles of only one payload message, that causes
      unnecessary overheads for both the sender and receiver.
      
      This commit re-designs the 'tipc_msg_make_bundle()' function (now named
      as 'tipc_msg_try_bundle()'), so that when a message comes at the first
      place, we will just check & keep a reference to it if the message is
      suitable for bundling. The message buffer will be put into the link
      backlog queue and processed as normal. Later on, when another one comes
      we will make a bundle with the first message if possible and so on...
      This way, a bundle if really needed will always consist of at least two
      payload messages. Otherwise, we let the first buffer go its way without
      any need of bundling, so reduce the overheads to zero.
      
      Moreover, since now we have both the messages in hand, we can even
      optimize the 'tipc_msg_bundle()' function, make bundle of a very large
      (size ~ MSS) and small messages which is not with the current algorithm
      e.g. [1400-byte message] + [10-byte message] (MTU = 1500).
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06e7c70c
  8. 31 10月, 2019 1 次提交
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97
  9. 30 10月, 2019 1 次提交
    • H
      tipc: improve throughput between nodes in netns · f73b1281
      Hoang Le 提交于
      Currently, TIPC transports intra-node user data messages directly
      socket to socket, hence shortcutting all the lower layers of the
      communication stack. This gives TIPC very good intra node performance,
      both regarding throughput and latency.
      
      We now introduce a similar mechanism for TIPC data traffic across
      network namespaces located in the same kernel. On the send path, the
      call chain is as always accompanied by the sending node's network name
      space pointer. However, once we have reliably established that the
      receiving node is represented by a namespace on the same host, we just
      replace the namespace pointer with the receiving node/namespace's
      ditto, and follow the regular socket receive patch though the receiving
      node. This technique gives us a throughput similar to the node internal
      throughput, several times larger than if we let the traffic go though
      the full network stacks. As a comparison, max throughput for 64k
      messages is four times larger than TCP throughput for the same type of
      traffic.
      
      To meet any security concerns, the following should be noted.
      
      - All nodes joining a cluster are supposed to have been be certified
      and authenticated by mechanisms outside TIPC. This is no different for
      nodes/namespaces on the same host; they have to auto discover each
      other using the attached interfaces, and establish links which are
      supervised via the regular link monitoring mechanism. Hence, a kernel
      local node has no other way to join a cluster than any other node, and
      have to obey to policies set in the IP or device layers of the stack.
      
      - Only when a sender has established with 100% certainty that the peer
      node is located in a kernel local namespace does it choose to let user
      data messages, and only those, take the crossover path to the receiving
      node/namespace.
      
      - If the receiving node/namespace is removed, its namespace pointer
      is invalidated at all peer nodes, and their neighbor link monitoring
      will eventually note that this node is gone.
      
      - To ensure the "100% certainty" criteria, and prevent any possible
      spoofing, received discovery messages must contain a proof that the
      sender knows a common secret. We use the hash mix of the sending
      node/namespace for this purpose, since it can be accessed directly by
      all other namespaces in the kernel. Upon reception of a discovery
      message, the receiver checks this proof against all the local
      namespaces'hash_mix:es. If it finds a match, that, along with a
      matching node id and cluster id, this is deemed sufficient proof that
      the peer node in question is in a local namespace, and a wormhole can
      be opened.
      
      - We should also consider that TIPC is intended to be a cluster local
      IPC mechanism (just like e.g. UNIX sockets) rather than a network
      protocol, and hence we think it can justified to allow it to shortcut the
      lower protocol layers.
      
      Regarding traceability, we should notice that since commit 6c9081a3
      ("tipc: add loopback device tracking") it is possible to follow the node
      internal packet flow by just activating tcpdump on the loopback
      interface. This will be true even for this mechanism; by activating
      tcpdump on the involved nodes' loopback interfaces their inter-name
      space messaging can easily be tracked.
      
      v2:
      - update 'net' pointer when node left/rejoined
      v3:
      - grab read/write lock when using node ref obj
      v4:
      - clone traffics between netns to loopback
      Suggested-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f73b1281
  10. 17 8月, 2019 1 次提交
    • T
      tipc: fix false detection of retransmit failures · 71204231
      Tuong Lien 提交于
      This commit eliminates the use of the link 'stale_limit' & 'prev_from'
      (besides the already removed - 'stale_cnt') variables in the detection
      of repeated retransmit failures as there is no proper way to initialize
      them to avoid a false detection, i.e. it is not really a retransmission
      failure but due to a garbage values in the variables.
      
      Instead, a jiffies variable will be added to individual skbs (like the
      way we restrict the skb retransmissions) in order to mark the first skb
      retransmit time. Later on, at the next retransmissions, the timestamp
      will be checked to see if the skb in the link transmq is "too stale",
      that is, the link tolerance time has passed, so that a link reset will
      be ordered. Note, just checking on the first skb in the queue is fine
      enough since it must be the oldest one.
      A counter is also added to keep track the actual skb retransmissions'
      number for later checking when the failure happens.
      
      The downside of this approach is that the skb->cb[] buffer is about to
      be exhausted, however it is always able to allocate another memory area
      and keep a reference to it when needed.
      
      Fixes: 77cf8edb ("tipc: simplify stale link failure criteria")
      Reported-by: NHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71204231
  11. 26 7月, 2019 2 次提交
    • T
      tipc: fix changeover issues due to large packet · 2320bcda
      Tuong Lien 提交于
      In conjunction with changing the interfaces' MTU (e.g. especially in
      the case of a bonding) where the TIPC links are brought up and down
      in a short time, a couple of issues were detected with the current link
      changeover mechanism:
      
      1) When one link is up but immediately forced down again, the failover
      procedure will be carried out in order to failover all the messages in
      the link's transmq queue onto the other working link. The link and node
      state is also set to FAILINGOVER as part of the process. The message
      will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
      bytes (= the message header size). There is no problem if the original
      message size is not larger than the link's MTU - 40, and indeed this is
      the max size of a normal payload messages. However, in the situation
      above, because the link has just been up, the messages in the link's
      transmq are almost SYNCH_MSGs which had been generated by the link
      synching procedure, then their size might reach the max value already!
      When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
      will exceed the link's MTU. As a result, the messages are dropped
      silently and the failover procedure will never end up, the link will
      not be able to exit the FAILINGOVER state, so cannot be re-established.
      
      2) The same scenario above can happen more easily in case the MTU of
      the links is set differently or when changing. In that case, as long as
      a large message in the failure link's transmq queue was built and
      fragmented with its link's MTU > the other link's one, the issue will
      happen (there is no need of a link synching in advance).
      
      3) The link synching procedure also faces with the same issue but since
      the link synching is only started upon receipt of a SYNCH_MSG, dropping
      the message will not result in a state deadlock, but it is not expected
      as design.
      
      The 1) & 3) issues are resolved by the last commit that only a dummy
      SYNCH_MSG (i.e. without data) is generated at the link synching, so the
      size of a FAILOVER_MSG if any then will never exceed the link's MTU.
      
      For the 2) issue, the only solution is trying to fragment the messages
      in the failure link's transmq queue according to the working link's MTU
      so they can be failovered then. A new function is made to accomplish
      this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
      original message size is too large, it will be fragmented & reassembled
      at the receiving side.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2320bcda
    • T
      tipc: optimize link synching mechanism · 4929a932
      Tuong Lien 提交于
      This commit along with the next one are to resolve the issues with the
      link changeover mechanism. See that commit for details.
      
      Basically, for the link synching, from now on, we will send only one
      single ("dummy") SYNCH message to peer. The SYNCH message does not
      contain any data, just a header conveying the synch point to the peer.
      
      A new node capability flag ("TIPC_TUNNEL_ENHANCED") is introduced for
      backward compatible!
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Suggested-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4929a932
  12. 26 6月, 2019 1 次提交
  13. 05 4月, 2019 2 次提交
    • T
      tipc: reduce duplicate packets for unicast traffic · 382f598f
      Tuong Lien 提交于
      For unicast transmission, the current NACK sending althorithm is over-
      active that forces the sending side to retransmit a packet that is not
      really lost but just arrived at the receiving side with some delay, or
      even retransmit same packets that have already been retransmitted
      before. As a result, many duplicates are observed also under normal
      condition, ie. without packet loss.
      
      One example case is: node1 transmits 1 2 3 4 10 5 6 7 8 9, when node2
      receives packet #10, it puts into the deferdq. When the packet #5 comes
      it sends NACK with gap [6 - 9]. However, shortly after that, when
      packet #6 arrives, it pulls out packet #10 from the deferfq, but it is
      still out of order, so it makes another NACK with gap [7 - 9] and so on
      ... Finally, node1 has to retransmit the packets 5 6 7 8 9 a number of
      times, but in fact all the packets are not lost at all, so duplicates!
      
      This commit reduces duplicates by changing the condition to send NACK,
      also restricting the retransmissions on individual packets via a timer
      of about 1ms. However, it also needs to say that too tricky condition
      for NACKs or too long timeout value for retransmissions will result in
      performance reducing! The criterias in this commit are found to be
      effective for both the requirements to reduce duplicates but not affect
      performance.
      
      The tipc_link_rcv() is also improved to only dequeue skb from the link
      deferdq if it is expected (ie. its seqno <= rcv_nxt).
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      382f598f
    • T
      tipc: improve TIPC throughput by Gap ACK blocks · 9195948f
      Tuong Lien 提交于
      During unicast link transmission, it's observed very often that because
      of one or a few lost/dis-ordered packets, the sending side will fastly
      reach the send window limit and must wait for the packets to be arrived
      at the receiving side or in the worst case, a retransmission must be
      done first. The sending side cannot release a lot of subsequent packets
      in its transmq even though all of them might have already been received
      by the receiving side.
      That is, one or two packets dis-ordered/lost and dozens of packets have
      to wait, this obviously reduces the overall throughput!
      
      This commit introduces an algorithm to overcome this by using "Gap ACK
      blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
      that describes the link deferdq where packets have been got by the
      receiving side but with gaps, for example:
      
            link deferdq: [1 2 3 4      10 11      13 14 15       20]
      --> Gap ACK blocks:       <4, 5>,   <11, 1>,      <15, 4>, <20, 0>
      
      The Gap ACK blocks will be sent to the sending side along with the
      traditional ACK or NACK message. Immediately when receiving the message
      the sending side will now not only release from its transmq the packets
      ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
      be enqueued and transmitted.
      In addition, the sending side can now do "multi-retransmissions"
      according to the Gaps reported in the Gap ACK blocks.
      
      The new algorithm as verified helps greatly improve the TIPC throughput
      especially under packet loss condition.
      
      So far, a maximum of 32 blocks is quite enough without any "Too few Gap
      ACK blocks" reports with a 5.0% packet loss rate, however this number
      can be increased in the furture if needed.
      
      Also, the patch is backward compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9195948f
  14. 20 3月, 2019 1 次提交
    • H
      tipc: smooth change between replicast and broadcast · c55c8eda
      Hoang Le 提交于
      Currently, a multicast stream may start out using replicast, because
      there are few destinations, and then it should ideally switch to
      L2/broadcast IGMP/multicast when the number of destinations grows beyond
      a certain limit. The opposite should happen when the number decreases
      below the limit.
      
      To eliminate the risk of message reordering caused by method change,
      a sending socket must stick to a previously selected method until it
      enters an idle period of 5 seconds. Means there is a 5 seconds pause
      in the traffic from the sender socket.
      
      If the sender never makes such a pause, the method will never change,
      and transmission may become very inefficient as the cluster grows.
      
      With this commit, we allow such a switch between replicast and
      broadcast without any need for a traffic pause.
      
      Solution is to send a dummy message with only the header, also with
      the SYN bit set, via broadcast or replicast. For the data message,
      the SYN bit is set and sending via replicast or broadcast (inverse
      method with dummy).
      
      Then, at receiving side any messages follow first SYN bit message
      (data or dummy message), they will be held in deferred queue until
      another pair (dummy or data message) arrived in other link.
      
      v2: reverse christmas tree declaration
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c55c8eda
  15. 12 2月, 2019 1 次提交
    • T
      tipc: fix link session and re-establish issues · 91986ee1
      Tuong Lien 提交于
      When a link endpoint is re-created (e.g. after a node reboot or
      interface reset), the link session number is varied by random, the peer
      endpoint will be synced with this new session number before the link is
      re-established.
      
      However, there is a shortcoming in this mechanism that can lead to the
      link never re-established or faced with a failure then. It happens when
      the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
      well as the 'in_session' flag have been set, but suddenly this link
      endpoint leaves. When it comes back with a random session number, there
      are two situations possible:
      
      1/ If the random session number is larger than (or equal to) the
      previous one, the peer endpoint will be updated with this new session
      upon receipt of a RESET_MSG from this endpoint, and the link can be re-
      established as normal. Otherwise, all the RESET_MSGs from this endpoint
      will be rejected by the peer. In turn, when this link endpoint receives
      one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
      to send STATE_MSGs, but again these messages will be dropped by the
      peer due to wrong session.
      The peer link endpoint can still become ESTABLISHED after receiving a
      traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
      NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
      will be forced down sooner or later!
      
      Even in case the random session number is larger than the previous one,
      it can be that the ACTIVATE_MSG from the peer arrives first, and this
      link endpoint moves quickly to ESTABLISHED without sending out any
      RESET_MSG yet. Consequently, the peer link will not be updated with the
      new session number, and the same link failure scenario as above will
      happen.
      
      2/ Another situation can be that, the peer link endpoint was reset due
      to any reasons in the meantime, its link state was set to RESET from
      ESTABLISHING but still in session, i.e. the 'in_session' flag is not
      reset...
      Now, if the random session number from this endpoint is less than the
      previous one, all the RESET_MSGs from this endpoint will be rejected by
      the peer. In the other direction, when this link endpoint receives a
      RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
      ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
      As a result, the link cannot be re-established but gets stuck with this
      link endpoint in state ESTABLISHING and the peer in RESET!
      
      Solution:
      
      ===========
      
      This link endpoint should not go directly to ESTABLISHED when getting
      ACTIVATE_MSG from the peer which may belong to the old session if the
      link was re-created. To ensure the session to be correct before the
      link is re-established, the peer endpoint in ESTABLISHING state will
      send back the last session number in ACTIVATE_MSG for a verification at
      this endpoint. Then, if needed, a new and more appropriate session
      number will be regenerated to force a re-synch first.
      
      In addition, when a link in ESTABLISHING state is reset, its state will
      move to RESET according to the link FSM, along with resetting the
      'in_session' flag (and the other data) as a normal link reset, it will
      also be deleted if requested.
      
      The solution is backward compatible.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91986ee1
  16. 12 11月, 2018 1 次提交
  17. 30 9月, 2018 2 次提交
  18. 24 3月, 2018 1 次提交
    • J
      tipc: handle collisions of 32-bit node address hash values · 25b0b9c4
      Jon Maloy 提交于
      When a 32-bit node address is generated from a 128-bit identifier,
      there is a risk of collisions which must be discovered and handled.
      
      We do this as follows:
      - We don't apply the generated address immediately to the node, but do
        instead initiate a 1 sec trial period to allow other cluster members
        to discover and handle such collisions.
      
      - During the trial period the node periodically sends out a new type
        of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
        to all the other nodes in the cluster.
      
      - When a node is receiving such a message, it must check that the
        presented 32-bit identifier either is unused, or was used by the very
        same peer in a previous session. In both cases it accepts the request
        by not responding to it.
      
      - If it finds that the same node has been up before using a different
        address, it responds with a DSC_TRIAL_FAIL_MSG containing that
        address.
      
      - If it finds that the address has already been taken by some other
        node, it generates a new, unused address and returns it to the
        requester.
      
      - During the trial period the requesting node must always be prepared
        to accept a failure message, i.e., a message where a peer suggests a
        different (or equal)  address to the one tried. In those cases it
        must apply the suggested value as trial address and restart the trial
        period.
      
      This algorithm ensures that in the vast majority of cases a node will
      have the same address before and after a reboot. If a legacy user
      configures the address explicitly, there will be no trial period and
      messages, so this protocol addition is completely backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25b0b9c4
  19. 02 12月, 2017 1 次提交
    • J
      tipc: fall back to smaller MTU if allocation of local send skb fails · 4c94cc2d
      Jon Maloy 提交于
      When sending node local messages the code is using an 'mtu' of 66060
      bytes to avoid unnecessary fragmentation. During situations of low
      memory tipc_msg_build() may sometimes fail to allocate such large
      buffers, resulting in unnecessary send failures. This can easily be
      remedied by falling back to a smaller MTU, and then reassemble the
      buffer chain as if the message were arriving from a remote node.
      
      At the same time, we change the initial MTU setting of the broadcast
      link to a lower value, so that large messages always are fragmented
      into smaller buffers even when we run in single node mode. Apart from
      obtaining the same advantage as for the 'fallback' solution above, this
      turns out to give a significant performance improvement. This can
      probably be explained with the __pskb_copy() operation performed on the
      buffer for each recipient during reception. We found the optimal value
      for this, considering the most relevant skb pool, to be 3744 bytes.
      Acked-by: NYing Xue <ying.xue@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c94cc2d
  20. 16 11月, 2017 1 次提交
    • J
      tipc: enforce valid ratio between skb truesize and contents · d618d09a
      Jon Maloy 提交于
      The socket level flow control is based on the assumption that incoming
      buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
      where the latter value is rounded off upwards to the nearest 1k number.
      This does empirically hold true for the device drivers we know, but we
      cannot trust that it will always be so, e.g., in a system with jumbo
      frames and very small packets.
      
      We now introduce a check for this condition at packet arrival, and if
      we find it to be false, we copy the packet to a new, smaller buffer,
      where the condition will be true. We expect this to affect only a small
      fraction of all incoming packets, if at all.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618d09a
  21. 11 11月, 2017 1 次提交
    • J
      tipc: improve link resiliency when rps is activated · 8d6e79d3
      Jon Maloy 提交于
      Currently, the TIPC RPS dissector is based only on the incoming packets'
      source node address, hence steering all traffic from a node to the same
      core. We have seen that this makes the links vulnerable to starvation
      and unnecessary resets when we turn down the link tolerance to very low
      values.
      
      To reduce the risk of this happening, we exempt probe and probe replies
      packets from the convergence to one core per source node. Instead, we do
      the opposite, - we try to diverge those packets across as many cores as
      possible, by randomizing the flow selector key.
      
      To make such packets identifiable to the dissector, we add a new
      'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
      set both for PROBE and PROBE_REPLY messages, and only for those.
      
      It should be noted that these packets are not part of any flow anyway,
      and only constitute a minuscule fraction of all packets sent across a
      link. Hence, there is no risk that this will affect overall performance.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d6e79d3
  22. 13 10月, 2017 10 次提交
    • J
      tipc: add multipoint-to-point flow control · 04d7b574
      Jon Maloy 提交于
      We already have point-to-multipoint flow control within a group. But
      we even need the opposite; -a scheme which can handle that potentially
      hundreds of sources may try to send messages to the same destination
      simultaneously without causing buffer overflow at the recipient. This
      commit adds such a mechanism.
      
      The algorithm works as follows:
      
      - When a member detects a new, joining member, it initially set its
        state to JOINED and advertises a minimum window to the new member.
        This window is chosen so that the new member can send exactly one
        maximum sized message, or several smaller ones, to the recipient
        before it must stop and wait for an additional advertisement. This
        minimum window ADV_IDLE is set to 65 1kB blocks.
      
      - When a member receives the first data message from a JOINED member,
        it changes the state of the latter to ACTIVE, and advertises a larger
        window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
        continue sending with minimal disturbances to the data flow.
      
      - The active members are kept in a dedicated linked list. Each time a
        message is received from an active member, it will be moved to the
        tail of that list. This way, we keep a record of which members have
        been most (tail) and least (head) recently active.
      
      - There is a maximum number (16) of permitted simultaneous active
        senders per receiver. When this limit is reached, the receiver will
        not advertise anything immediately to a new sender, but instead put
        it in a PENDING state, and add it to a corresponding queue. At the
        same time, it will pick the least recently active member, send it an
        advertisement RECLAIM message, and set this member to state
        RECLAIMING.
      
      - The reclaimee member has to respond with a REMIT message, meaning that
        it goes back to a send window of ADV_IDLE, and returns its unused
        advertised blocks beyond that value to the reclaiming member.
      
      - When the reclaiming member receives the REMIT message, it unlinks
        the reclaimee from its active list, resets its state to JOINED, and
        notes that it is now back at ADV_IDLE advertised blocks to that
        member. If there are still unread data messages sent out by
        reclaimee before the REMIT, the member goes into an intermediate
        state REMITTED, where it stays until the said messages have been
        consumed.
      
      - The returned advertised blocks can now be re-advertised to the
        pending member, which is now set to state ACTIVE and added to
        the active member list.
      
      - To be proactive, i.e., to minimize the risk that any member will
        end up in the pending queue, we start reclaiming resources already
        when the number of active members exceeds 3/4 of the permitted
        maximum.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d7b574
    • J
      tipc: guarantee that group broadcast doesn't bypass group unicast · 2f487712
      Jon Maloy 提交于
      We need a mechanism guaranteeing that group unicasts sent out from a
      socket are not bypassed by later sent broadcasts from the same socket.
      We do this as follows:
      
      - Each time a unicast is sent, we set a the broadcast method for the
        socket to "replicast" and "mandatory". This forces the first
        subsequent broadcast message to follow the same network and data path
        as the preceding unicast to a destination, hence preventing it from
        overtaking the latter.
      
      - In order to make the 'same data path' statement above true, we let
        group unicasts pass through the multicast link input queue, instead
        of as previously through the unicast link input queue.
      
      - In the first broadcast following a unicast, we set a new header flag,
        requiring all recipients to immediately acknowledge its reception.
      
      - During the period before all the expected acknowledges are received,
        the socket refuses to accept any more broadcast attempts, i.e., by
        blocking or returning EAGAIN. This period should typically not be
        longer than a few microseconds.
      
      - When all acknowledges have been received, the sending socket will
        open up for subsequent broadcasts, this time giving the link layer
        freedom to itself select the best transmission method.
      
      - The forced and/or abrupt transmission method changes described above
        may lead to broadcasts arriving out of order to the recipients. We
        remedy this by introducing code that checks and if necessary
        re-orders such messages at the receiving end.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f487712
    • J
      tipc: introduce group multicast messaging · 5b8dddb6
      Jon Maloy 提交于
      The previously introduced message transport to all group members is
      based on the tipc multicast service, but is logically a broadcast
      service within the group, and that is what we call it.
      
      We now add functionality for sending messages to all group members
      having a certain identity. Correspondingly, we call this feature 'group
      multicast'. The service is using unicast when only one destination is
      found, otherwise it will use the bearer broadcast service to transfer
      the messages. In the latter case, the receiving members filter arriving
      messages by looking at the intended destination instance. If there is
      no match, the message will be dropped, while still being considered
      received and read as seen by the flow control mechanism.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b8dddb6
    • J
      tipc: introduce group unicast messaging · 27bd9ec0
      Jon Maloy 提交于
      We now make it possible to send connectionless unicast messages
      within a communication group. To send a message, the sender can use
      either a direct port address, aka port identity, or an indirect port
      name to be looked up.
      
      This type of messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27bd9ec0
    • J
      tipc: introduce flow control for group broadcast messages · b7d42635
      Jon Maloy 提交于
      We introduce an end-to-end flow control mechanism for group broadcast
      messages. This ensures that no messages are ever lost because of
      destination receive buffer overflow, with minimal impact on performance.
      For now, the algorithm is based on the assumption that there is only one
      active transmitter at any moment in time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7d42635
    • J
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy 提交于
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae236fb2
    • J
      tipc: add second source address to recvmsg()/recvfrom() · 31c82a2d
      Jon Maloy 提交于
      With group communication, it becomes important for a message receiver to
      identify not only from which socket (identfied by a node:port tuple) the
      message was sent, but also the logical identity (type:instance) of the
      sending member.
      
      We fix this by adding a second instance of struct sockaddr_tipc to the
      source address area when a message is read. The extra address struct
      is filled in with data found in the received message header (type,) and
      in the local member representation struct (instance.)
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c82a2d
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • J
      tipc: refactor function filter_rcv() · 64ac5f59
      Jon Maloy 提交于
      In the following commits we will need to handle multiple incoming and
      rejected/returned buffers in the function socket.c::filter_rcv().
      As a preparation for this, we generalize the function by handling
      buffer queues instead of individual buffers. We also introduce a
      help function tipc_skb_reject(), and rename filter_rcv() to
      tipc_sk_filter_rcv() in line with other functions in socket.c.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64ac5f59
    • J
      tipc: add ability to order and receive topology events in driver · 14c04493
      Jon Maloy 提交于
      As preparation for introducing communication groups, we add the ability
      to issue topology subscriptions and receive topology events from kernel
      space. This will make it possible for group member sockets to keep track
      of other group members.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14c04493
  23. 21 1月, 2017 1 次提交
  24. 17 1月, 2017 1 次提交
  25. 04 1月, 2017 1 次提交
    • J
      tipc: reduce risk of user starvation during link congestion · 365ad353
      Jon Paul Maloy 提交于
      The socket code currently handles link congestion by either blocking
      and trying to send again when the congestion has abated, or just
      returning to the user with -EAGAIN and let him re-try later.
      
      This mechanism is prone to starvation, because the wakeup algorithm is
      non-atomic. During the time the link issues a wakeup signal, until the
      socket wakes up and re-attempts sending, other senders may have come
      in between and occupied the free buffer space in the link. This in turn
      may lead to a socket having to make many send attempts before it is
      successful. In extremely loaded systems we have observed latency times
      of several seconds before a low-priority socket is able to send out a
      message.
      
      In this commit, we simplify this mechanism and reduce the risk of the
      described scenario happening. When a message is attempted sent via a
      congested link, we now let it be added to the link's backlog queue
      anyway, thus permitting an oversubscription of one message per source
      socket. We still create a wakeup item and return an error code, hence
      instructing the sender to block or stop sending. Only when enough space
      has been freed up in the link's backlog queue do we issue a wakeup event
      that allows the sender to continue with the next message, if any.
      
      The fact that a socket now can consider a message sent even when the
      link returns a congestion code means that the sending socket code can
      be simplified. Also, since this is a good opportunity to get rid of the
      obsolete 'mtu change' condition in the three socket send functions, we
      now choose to refactor those functions completely.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      365ad353