1. 19 9月, 2020 1 次提交
    • T
      tipc: add automatic session key exchange · 1ef6f7c9
      Tuong Lien 提交于
      With support from the master key option in the previous commit, it
      becomes easy to make frequent updates/exchanges of session keys between
      authenticated cluster nodes.
      Basically, there are two situations where the key exchange will take in
      place:
      
      - When a new node joins the cluster (with the master key), it will need
        to get its peer's TX key, so that be able to decrypt further messages
        from that peer.
      
      - When a new session key is generated (by either user manual setting or
        later automatic rekeying feature), the key will be distributed to all
        peer nodes in the cluster.
      
      A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
      /KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
      using the master key before sending out. Upon receipt of the message it
      will be decrypted in the same way as regular messages, then attached as
      the sender's RX key in the receiver node.
      
      In this way, the key exchange is reliable by the link layer, as well as
      security, integrity and authenticity by the crypto layer.
      
      Also, the forward security will be easily achieved by user changing the
      master key actively but this should not be required very frequently.
      
      The key exchange feature is independent on the presence of a master key
      Note however that the master key still is needed for new nodes to be
      able to join the cluster. It is also optional, and can be turned off/on
      via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].
      
      Backward compatibility is guaranteed because for nodes that do not have
      master key support, key exchange using master key ie. tx_key = 0 if any
      will be shortly discarded at the message validation step. In other
      words, the key exchange feature will be automatically disabled to those
      nodes.
      
      v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
      error in node.c. The function only exists when built with the TIPC
      "CONFIG_TIPC_CRYPTO" option.
      
      v3: use 'info->extack' for a message emitted due to netlink operations
      instead (- David's comment).
      Reported-by: Nkernel test robot <lkp@intel.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ef6f7c9
  2. 17 6月, 2020 1 次提交
    • H
      tipc: update a binding service via broadcast · cad2929d
      Hoang Huu Le 提交于
      Currently, updating binding table (add service binding to
      name table/withdraw a service binding) is being sent over replicast.
      However, if we are scaling up clusters to > 100 nodes/containers this
      method is less affection because of looping through nodes in a cluster one
      by one.
      
      It is worth to use broadcast to update a binding service. This way, the
      binding table can be updated on all peer nodes in one shot.
      
      Broadcast is used when all peer nodes, as indicated by a new capability
      flag TIPC_NAMED_BCAST, support reception of this message type.
      
      Four problems need to be considered when introducing this feature.
      1) When establishing a link to a new peer node we still update this by a
      unicast 'bulk' update. This may lead to race conditions, where a later
      broadcast publication/withdrawal bypass the 'bulk', resulting in
      disordered publications, or even that a withdrawal may arrive before the
      corresponding publication. We solve this by adding an 'is_last_bulk' bit
      in the last bulk messages so that it can be distinguished from all other
      messages. Only when this message has arrived do we open up for reception
      of broadcast publications/withdrawals.
      
      2) When a first legacy node is added to the cluster all distribution
      will switch over to use the legacy 'replicast' method, while the
      opposite happens when the last legacy node leaves the cluster. This
      entails another risk of message disordering that has to be handled. We
      solve this by adding a sequence number to the broadcast/replicast
      messages, so that disordering can be discovered and corrected. Note
      however that we don't need to consider potential message loss or
      duplication at this protocol level.
      
      3) Bulk messages don't contain any sequence numbers, and will always
      arrive in order. Hence we must exempt those from the sequence number
      control and deliver them unconditionally. We solve this by adding a new
      'is_bulk' bit in those messages so that they can be recognized.
      
      4) Legacy messages, which don't contain any new bits or sequence
      numbers, but neither can arrive out of order, also need to be exempt
      from the initial synchronization and sequence number check, and
      delivered unconditionally. Therefore, we add another 'is_not_legacy' bit
      to all new messages so that those can be distinguished from legacy
      messages and the latter delivered directly.
      
      v1->v2:
       - fix warning issue reported by kbuild test robot <lkp@intel.com>
       - add santiy check to drop the publication message with a sequence
      number that is lower than the agreed synch point
      Signed-off-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NHoang Huu Le <hoang.h.le@dektech.com.au>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cad2929d
  3. 09 11月, 2019 3 次提交
    • T
      tipc: add support for AEAD key setting via netlink · e1f32190
      Tuong Lien 提交于
      This commit adds two netlink commands to TIPC in order for user to be
      able to set or remove AEAD keys:
      - TIPC_NL_KEY_SET
      - TIPC_NL_KEY_FLUSH
      
      When the 'KEY_SET' is given along with the key data, the key will be
      initiated and attached to TIPC crypto. On the other hand, the
      'KEY_FLUSH' command will remove all existing keys if any.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1f32190
    • T
      tipc: introduce TIPC encryption & authentication · fc1b6d6d
      Tuong Lien 提交于
      This commit offers an option to encrypt and authenticate all messaging,
      including the neighbor discovery messages. The currently most advanced
      algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
      encryption/decryption is done at the bearer layer, just before leaving
      or after entering TIPC.
      
      Supported features:
      - Encryption & authentication of all TIPC messages (header + data);
      - Two symmetric-key modes: Cluster and Per-node;
      - Automatic key switching;
      - Key-expired revoking (sequence number wrapped);
      - Lock-free encryption/decryption (RCU);
      - Asynchronous crypto, Intel AES-NI supported;
      - Multiple cipher transforms;
      - Logs & statistics;
      
      Two key modes:
      - Cluster key mode: One single key is used for both TX & RX in all
      nodes in the cluster.
      - Per-node key mode: Each nodes in the cluster has one specific TX key.
      For RX, a node requires its peers' TX key to be able to decrypt the
      messages from those peers.
      
      Key setting from user-space is performed via netlink by a user program
      (e.g. the iproute2 'tipc' tool).
      
      Internal key state machine:
      
                                       Attach    Align(RX)
                                           +-+   +-+
                                           | V   | V
              +---------+      Attach     +---------+
              |  IDLE   |---------------->| PENDING |(user = 0)
              +---------+                 +---------+
                 A   A                   Switch|  A
                 |   |                         |  |
                 |   | Free(switch/revoked)    |  |
           (Free)|   +----------------------+  |  |Timeout
                 |              (TX)        |  |  |(RX)
                 |                          |  |  |
                 |                          |  v  |
              +---------+      Switch     +---------+
              | PASSIVE |<----------------| ACTIVE  |
              +---------+       (RX)      +---------+
              (user = 1)                  (user >= 1)
      
      The number of TFMs is 10 by default and can be changed via the procfs
      'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
      also used to print the crypto statistics at runtime:
      
      echo 0xfff1 > /proc/sys/net/tipc/max_tfms
      
      The patch defines a new TIPC version (v7) for the encryption message (-
      backward compatibility as well). The message is basically encapsulated
      as follows:
      
         +----------------------------------------------------------+
         | TIPCv7 encryption  | Original TIPCv2    | Authentication |
         | header             | packet (encrypted) | Tag            |
         +----------------------------------------------------------+
      
      The throughput is about ~40% for small messages (compared with non-
      encryption) and ~9% for large messages. With the support from hardware
      crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
      upto ~85% for small messages and ~55% for large messages.
      
      By default, the new feature is inactive (i.e. no encryption) until user
      sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
      in the kernel configuration to enable/disable the new code when needed.
      
      MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc1b6d6d
    • T
      tipc: enable creating a "preliminary" node · 4cbf8ac2
      Tuong Lien 提交于
      When user sets RX key for a peer not existing on the own node, a new
      node entry is needed to which the RX key will be attached. However,
      since the peer node address (& capabilities) is unknown at that moment,
      only the node-ID is provided, this commit allows the creation of a node
      with only the data that we call as “preliminary”.
      
      A preliminary node is not the object of the “tipc_node_find()” but the
      “tipc_node_find_by_id()”. Once the first message i.e. LINK_CONFIG comes
      from that peer, and is successfully decrypted by the own node, the
      actual peer node data will be properly updated and the node will
      function as usual.
      
      In addition, the node timer always starts when a node object is created
      so if a preliminary node is not used, it will be cleaned up.
      
      The later encryption functions will also use the node timer and be able
      to create a preliminary node automatically when needed.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cbf8ac2
  4. 31 10月, 2019 1 次提交
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97
  5. 30 10月, 2019 1 次提交
    • H
      tipc: improve throughput between nodes in netns · f73b1281
      Hoang Le 提交于
      Currently, TIPC transports intra-node user data messages directly
      socket to socket, hence shortcutting all the lower layers of the
      communication stack. This gives TIPC very good intra node performance,
      both regarding throughput and latency.
      
      We now introduce a similar mechanism for TIPC data traffic across
      network namespaces located in the same kernel. On the send path, the
      call chain is as always accompanied by the sending node's network name
      space pointer. However, once we have reliably established that the
      receiving node is represented by a namespace on the same host, we just
      replace the namespace pointer with the receiving node/namespace's
      ditto, and follow the regular socket receive patch though the receiving
      node. This technique gives us a throughput similar to the node internal
      throughput, several times larger than if we let the traffic go though
      the full network stacks. As a comparison, max throughput for 64k
      messages is four times larger than TCP throughput for the same type of
      traffic.
      
      To meet any security concerns, the following should be noted.
      
      - All nodes joining a cluster are supposed to have been be certified
      and authenticated by mechanisms outside TIPC. This is no different for
      nodes/namespaces on the same host; they have to auto discover each
      other using the attached interfaces, and establish links which are
      supervised via the regular link monitoring mechanism. Hence, a kernel
      local node has no other way to join a cluster than any other node, and
      have to obey to policies set in the IP or device layers of the stack.
      
      - Only when a sender has established with 100% certainty that the peer
      node is located in a kernel local namespace does it choose to let user
      data messages, and only those, take the crossover path to the receiving
      node/namespace.
      
      - If the receiving node/namespace is removed, its namespace pointer
      is invalidated at all peer nodes, and their neighbor link monitoring
      will eventually note that this node is gone.
      
      - To ensure the "100% certainty" criteria, and prevent any possible
      spoofing, received discovery messages must contain a proof that the
      sender knows a common secret. We use the hash mix of the sending
      node/namespace for this purpose, since it can be accessed directly by
      all other namespaces in the kernel. Upon reception of a discovery
      message, the receiver checks this proof against all the local
      namespaces'hash_mix:es. If it finds a match, that, along with a
      matching node id and cluster id, this is deemed sufficient proof that
      the peer node in question is in a local namespace, and a wormhole can
      be opened.
      
      - We should also consider that TIPC is intended to be a cluster local
      IPC mechanism (just like e.g. UNIX sockets) rather than a network
      protocol, and hence we think it can justified to allow it to shortcut the
      lower protocol layers.
      
      Regarding traceability, we should notice that since commit 6c9081a3
      ("tipc: add loopback device tracking") it is possible to follow the node
      internal packet flow by just activating tcpdump on the loopback
      interface. This will be true even for this mechanism; by activating
      tcpdump on the involved nodes' loopback interfaces their inter-name
      space messaging can easily be tracked.
      
      v2:
      - update 'net' pointer when node left/rejoined
      v3:
      - grab read/write lock when using node ref obj
      v4:
      - clone traffics between netns to loopback
      Suggested-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f73b1281
  6. 26 7月, 2019 1 次提交
  7. 05 4月, 2019 1 次提交
    • T
      tipc: improve TIPC throughput by Gap ACK blocks · 9195948f
      Tuong Lien 提交于
      During unicast link transmission, it's observed very often that because
      of one or a few lost/dis-ordered packets, the sending side will fastly
      reach the send window limit and must wait for the packets to be arrived
      at the receiving side or in the worst case, a retransmission must be
      done first. The sending side cannot release a lot of subsequent packets
      in its transmq even though all of them might have already been received
      by the receiving side.
      That is, one or two packets dis-ordered/lost and dozens of packets have
      to wait, this obviously reduces the overall throughput!
      
      This commit introduces an algorithm to overcome this by using "Gap ACK
      blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
      that describes the link deferdq where packets have been got by the
      receiving side but with gaps, for example:
      
            link deferdq: [1 2 3 4      10 11      13 14 15       20]
      --> Gap ACK blocks:       <4, 5>,   <11, 1>,      <15, 4>, <20, 0>
      
      The Gap ACK blocks will be sent to the sending side along with the
      traditional ACK or NACK message. Immediately when receiving the message
      the sending side will now not only release from its transmq the packets
      ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
      be enqueued and transmitted.
      In addition, the sending side can now do "multi-retransmissions"
      according to the Gaps reported in the Gap ACK blocks.
      
      The new algorithm as verified helps greatly improve the TIPC throughput
      especially under packet loss condition.
      
      So far, a maximum of 32 blocks is quite enough without any "Too few Gap
      ACK blocks" reports with a 5.0% packet loss rate, however this number
      can be increased in the furture if needed.
      
      Also, the patch is backward compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9195948f
  8. 20 3月, 2019 1 次提交
  9. 20 12月, 2018 1 次提交
    • T
      tipc: enable tracepoints in tipc · b4b9771b
      Tuong Lien 提交于
      As for the sake of debugging/tracing, the commit enables tracepoints in
      TIPC along with some general trace_events as shown below. It also
      defines some 'tipc_*_dump()' functions that allow to dump TIPC object
      data whenever needed, that is, for general debug purposes, ie. not just
      for the trace_events.
      
      The following trace_events are now available:
      
      - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
        e.g. message type, user, droppable, skb truesize, cloned skb, etc.
      
      - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
        queues, e.g. TIPC link transmq, socket receive queue, etc.
      
      - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
        sk state, sk type, connection type, rmem_alloc, socket queues, etc.
      
      - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
        link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
      
      - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
        node state, active links, capabilities, link entries, etc.
      
      How to use:
      Put the trace functions at any places where we want to dump TIPC data
      or events.
      
      Note:
      a) The dump functions will generate raw data only, that is, to offload
      the trace event's processing, it can require a tool or script to parse
      the data but this should be simple.
      
      b) The trace_tipc_*_dump() should be reserved for a failure cases only
      (e.g. the retransmission failure case) or where we do not expect to
      happen too often, then we can consider enabling these events by default
      since they will almost not take any effects under normal conditions,
      but once the rare condition or failure occurs, we get the dumped data
      fully for post-analysis.
      
      For other trace purposes, we can reuse these trace classes as template
      but different events.
      
      c) A trace_event is only effective when we enable it. To enable the
      TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
      directory in the 'debugfs' file system. Normally, they are located at:
      
      /sys/kernel/debug/tracing/events/tipc/
      
      For example:
      
      To enable the tipc_link_dump event:
      
      echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
      
      To enable all the TIPC trace_events:
      
      echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
      
      To collect the trace data:
      
      cat trace
      
      or
      
      cat trace_pipe > /trace.out &
      
      To disable all the TIPC trace_events:
      
      echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
      
      To clear the trace buffer:
      
      echo > trace
      
      d) Like the other trace_events, the feature like 'filter' or 'trigger'
      is also usable for the tipc trace_events.
      For more details, have a look at:
      
      Documentation/trace/ftrace.txt
      
      MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Tested-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4b9771b
  10. 30 9月, 2018 1 次提交
  11. 12 7月, 2018 1 次提交
    • J
      tipc: add sequence number check for link STATE messages · 9012de50
      Jon Maloy 提交于
      Some switch infrastructures produce huge amounts of packet duplicates.
      This becomes a problem if those messages are STATE/NACK protocol
      messages, causing unnecessary retransmissions of already accepted
      packets.
      
      We now introduce a unique sequence number per STATE protocol message
      so that duplicates can be identified and ignored. This will also be
      useful when tracing such cases, and to avert replay attacks when TIPC
      is encrypted.
      
      For compatibility reasons we have to introduce a new capability flag
      TIPC_LINK_PROTO_SEQNO to handle this new feature.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9012de50
  12. 27 4月, 2018 1 次提交
  13. 20 4月, 2018 1 次提交
  14. 24 3月, 2018 3 次提交
    • J
      tipc: handle collisions of 32-bit node address hash values · 25b0b9c4
      Jon Maloy 提交于
      When a 32-bit node address is generated from a 128-bit identifier,
      there is a risk of collisions which must be discovered and handled.
      
      We do this as follows:
      - We don't apply the generated address immediately to the node, but do
        instead initiate a 1 sec trial period to allow other cluster members
        to discover and handle such collisions.
      
      - During the trial period the node periodically sends out a new type
        of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
        to all the other nodes in the cluster.
      
      - When a node is receiving such a message, it must check that the
        presented 32-bit identifier either is unused, or was used by the very
        same peer in a previous session. In both cases it accepts the request
        by not responding to it.
      
      - If it finds that the same node has been up before using a different
        address, it responds with a DSC_TRIAL_FAIL_MSG containing that
        address.
      
      - If it finds that the address has already been taken by some other
        node, it generates a new, unused address and returns it to the
        requester.
      
      - During the trial period the requesting node must always be prepared
        to accept a failure message, i.e., a message where a peer suggests a
        different (or equal)  address to the one tried. In those cases it
        must apply the suggested value as trial address and restart the trial
        period.
      
      This algorithm ensures that in the vast majority of cases a node will
      have the same address before and after a reboot. If a legacy user
      configures the address explicitly, there will be no trial period and
      messages, so this protocol addition is completely backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25b0b9c4
    • J
      tipc: add 128-bit node identifier · d50ccc2d
      Jon Maloy 提交于
      We add a 128-bit node identity, as an alternative to the currently used
      32-bit node address.
      
      For the sake of compatibility and to minimize message header changes
      we retain the existing 32-bit address field. When not set explicitly by
      the user, this field will be filled with a hash value generated from the
      much longer node identity, and be used as a shorthand value for the
      latter.
      
      We permit either the address or the identity to be set by configuration,
      but not both, so when the address value is set by a legacy user the
      corresponding 128-bit node identity is generated based on the that value.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d50ccc2d
    • J
      tipc: remove restrictions on node address values · 20263641
      Jon Maloy 提交于
      Nominally, TIPC organizes network nodes into a three-level network
      hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
      hierarchy is reflected in the node address format, - it is sub-divided
      into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
      
      However, the 'zone' and 'cluster' levels have in reality never been
      fully implemented,and never will be. The result of this has been
      that the first 20 bits the node identity structure have been wasted,
      and the usable node identity range within a cluster has been limited
      to 12 bits. This is starting to become a problem.
      
      In the following commits, we will need to be able to connect between
      nodes which are using the whole 32-bit value space of the node address.
      We therefore remove the restrictions on which values can be assigned
      to node identity, -it is from now on only a 32-bit integer with no
      assumed internal structure.
      
      Isolation between clusters is now achieved only by setting different
      values for the 'network id' field used during neighbor discovery, in
      practice leading to the latter becoming the new cluster identity.
      
      The rules for accepting discovery requests/responses from neighboring
      nodes now become:
      
      - If the user is using legacy address format on both peers, reception
        of discovery messages is subject to the legacy lookup domain check
        in addition to the cluster id check.
      
      - Otherwise, the discovery request/response is always accepted, provided
        both peers have the same network id.
      
      This secures backwards compatibility for users who have been using zone
      or cluster identities as cluster separators, instead of the intended
      'network id'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20263641
  15. 15 2月, 2018 1 次提交
  16. 13 10月, 2017 3 次提交
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • J
      tipc: add new function for sending multiple small messages · f70d37b7
      Jon Maloy 提交于
      We see an increasing need to send multiple single-buffer messages
      of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes.
      Instead of looping over the send queue and sending each buffer
      individually, as we do now, we add a new help function
      tipc_node_distr_xmit() to do this.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f70d37b7
    • J
      tipc: add ability to obtain node availability status from other files · 38077b8e
      Jon Maloy 提交于
      In the coming commits, functions at the socket level will need the
      ability to read the availability status of a given node. We therefore
      introduce a new function for this purpose, while renaming the existing
      static function currently having the wanted name.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38077b8e
  17. 21 1月, 2017 1 次提交
    • J
      tipc: make replicast a user selectable option · 01fd12bb
      Jon Paul Maloy 提交于
      If the bearer carrying multicast messages supports broadcast, those
      messages will be sent to all cluster nodes, irrespective of whether
      these nodes host any actual destinations socket or not. This is clearly
      wasteful if the cluster is large and there are only a few real
      destinations for the message being sent.
      
      In this commit we extend the eligibility of the newly introduced
      "replicast" transmit option. We now make it possible for a user to
      select which method he wants to be used, either as a mandatory setting
      via setsockopt(), or as a relative setting where we let the broadcast
      layer decide which method to use based on the ratio between cluster
      size and the message's actual number of destination nodes.
      
      In the latter case, a sending socket must stick to a previously
      selected method until it enters an idle period of at least 5 seconds.
      This eliminates the risk of message reordering caused by method change,
      i.e., when changes to cluster size or number of destinations would
      otherwise mandate a new method to be used.
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01fd12bb
  18. 03 9月, 2016 1 次提交
    • J
      tipc: transfer broadcast nacks in link state messages · 02d11ca2
      Jon Paul Maloy 提交于
      When we send broadcasts in clusters of more 70-80 nodes, we sometimes
      see the broadcast link resetting because of an excessive number of
      retransmissions. This is caused by a combination of two factors:
      
      1) A 'NACK crunch", where loss of broadcast packets is discovered
         and NACK'ed by several nodes simultaneously, leading to multiple
         redundant broadcast retransmissions.
      
      2) The fact that the NACKS as such also are sent as broadcast, leading
         to excessive load and packet loss on the transmitting switch/bridge.
      
      This commit deals with the latter problem, by moving sending of
      broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
      to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
      bits in word 8 of the said message for this purpose, and introduce a
      new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
      backwards compatible.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d11ca2
  19. 19 8月, 2016 1 次提交
  20. 27 7月, 2016 3 次提交
  21. 04 5月, 2016 2 次提交
    • J
      tipc: redesign connection-level flow control · 10724cc7
      Jon Paul Maloy 提交于
      There are two flow control mechanisms in TIPC; one at link level that
      handles network congestion, burst control, and retransmission, and one
      at connection level which' only remaining task is to prevent overflow
      in the receiving socket buffer. In TIPC, the latter task has to be
      solved end-to-end because messages can not be thrown away once they
      have been accepted and delivered upwards from the link layer, i.e, we
      can never permit the receive buffer to overflow.
      
      Currently, this algorithm is message based. A counter in the receiving
      socket keeps track of number of consumed messages, and sends a dedicated
      acknowledge message back to the sender for each 256 consumed message.
      A counter at the sending end keeps track of the sent, not yet
      acknowledged messages, and blocks the sender if this number ever reaches
      512 unacknowledged messages. When the missing acknowledge arrives, the
      socket is then woken up for renewed transmission. This works well for
      keeping the message flow running, as it almost never happens that a
      sender socket is blocked this way.
      
      A problem with the current mechanism is that it potentially is very
      memory consuming. Since we don't distinguish between small and large
      messages, we have to dimension the socket receive buffer according
      to a worst-case of both. I.e., the window size must be chosen large
      enough to sustain a reasonable throughput even for the smallest
      messages, while we must still consider a scenario where all messages
      are of maximum size. Hence, the current fix window size of 512 messages
      and a maximum message size of 66k results in a receive buffer of 66 MB
      when truesize(66k) = 131k is taken into account. It is possible to do
      much better.
      
      This commit introduces an algorithm where we instead use 1024-byte
      blocks as base unit. This unit, always rounded upwards from the
      actual message size, is used when we advertise windows as well as when
      we count and acknowledge transmitted data. The advertised window is
      based on the configured receive buffer size in such a way that even
      the worst-case truesize/msgsize ratio always is covered. Since the
      smallest possible message size (from a flow control viewpoint) now is
      1024 bytes, we can safely assume this ratio to be less than four, which
      is the value we are now using.
      
      This way, we have been able to reduce the default receive buffer size
      from 66 MB to 2 MB with maintained performance.
      
      In order to keep this solution backwards compatible, we introduce a
      new capability bit in the discovery protocol, and use this throughout
      the message sending/reception path to always select the right unit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10724cc7
    • J
      tipc: propagate peer node capabilities to socket layer · 60020e18
      Jon Paul Maloy 提交于
      During neighbor discovery, nodes advertise their capabilities as a bit
      map in a dedicated 16-bit field in the discovery message header. This
      bit map has so far only be stored in the node structure on the peer
      nodes, but we now see the need to keep a copy even in the socket
      structure.
      
      This commit adds this functionality.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60020e18
  22. 21 11月, 2015 5 次提交
  23. 24 10月, 2015 3 次提交
  24. 31 7月, 2015 2 次提交