1. 27 5月, 2020 1 次提交
    • T
      tipc: add support for broadcast rcv stats dumping · 03b6fefd
      Tuong Lien 提交于
      This commit enables dumping the statistics of a broadcast-receiver link
      like the traditional 'broadcast-link' one (which is for broadcast-
      sender). The link dumping can be triggered via netlink (e.g. the
      iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
      indicator.
      
      The name of a broadcast-receiver link of a specific peer will be in the
      format: 'broadcast-link:<peer-id>'.
      
      For example:
      
      Link <broadcast-link:1001002>
        Window:50 packets
        RX packets:7841 fragments:2408/440 bundles:0/0
        TX packets:0 fragments:0/0 bundles:0/0
        RX naks:0 defs:124 dups:0
        TX naks:21 acks:0 retrans:0
        Congestion link:0  Send queue max:0 avg:0
      
      In addition, the broadcast-receiver link statistics can be reset in the
      usual way via netlink by specifying that link name in command.
      
      Note: the 'tipc_link_name_ext()' is removed because the link name can
      now be retrieved simply via the 'l->name'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03b6fefd
  2. 15 3月, 2020 1 次提交
  3. 09 11月, 2019 1 次提交
    • T
      tipc: introduce TIPC encryption & authentication · fc1b6d6d
      Tuong Lien 提交于
      This commit offers an option to encrypt and authenticate all messaging,
      including the neighbor discovery messages. The currently most advanced
      algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
      encryption/decryption is done at the bearer layer, just before leaving
      or after entering TIPC.
      
      Supported features:
      - Encryption & authentication of all TIPC messages (header + data);
      - Two symmetric-key modes: Cluster and Per-node;
      - Automatic key switching;
      - Key-expired revoking (sequence number wrapped);
      - Lock-free encryption/decryption (RCU);
      - Asynchronous crypto, Intel AES-NI supported;
      - Multiple cipher transforms;
      - Logs & statistics;
      
      Two key modes:
      - Cluster key mode: One single key is used for both TX & RX in all
      nodes in the cluster.
      - Per-node key mode: Each nodes in the cluster has one specific TX key.
      For RX, a node requires its peers' TX key to be able to decrypt the
      messages from those peers.
      
      Key setting from user-space is performed via netlink by a user program
      (e.g. the iproute2 'tipc' tool).
      
      Internal key state machine:
      
                                       Attach    Align(RX)
                                           +-+   +-+
                                           | V   | V
              +---------+      Attach     +---------+
              |  IDLE   |---------------->| PENDING |(user = 0)
              +---------+                 +---------+
                 A   A                   Switch|  A
                 |   |                         |  |
                 |   | Free(switch/revoked)    |  |
           (Free)|   +----------------------+  |  |Timeout
                 |              (TX)        |  |  |(RX)
                 |                          |  |  |
                 |                          |  v  |
              +---------+      Switch     +---------+
              | PASSIVE |<----------------| ACTIVE  |
              +---------+       (RX)      +---------+
              (user = 1)                  (user >= 1)
      
      The number of TFMs is 10 by default and can be changed via the procfs
      'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
      also used to print the crypto statistics at runtime:
      
      echo 0xfff1 > /proc/sys/net/tipc/max_tfms
      
      The patch defines a new TIPC version (v7) for the encryption message (-
      backward compatibility as well). The message is basically encapsulated
      as follows:
      
         +----------------------------------------------------------+
         | TIPCv7 encryption  | Original TIPCv2    | Authentication |
         | header             | packet (encrypted) | Tag            |
         +----------------------------------------------------------+
      
      The throughput is about ~40% for small messages (compared with non-
      encryption) and ~9% for large messages. With the support from hardware
      crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
      upto ~85% for small messages and ~55% for large messages.
      
      By default, the new feature is inactive (i.e. no encryption) until user
      sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
      in the kernel configuration to enable/disable the new code when needed.
      
      MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc1b6d6d
  4. 04 11月, 2019 1 次提交
    • T
      tipc: improve message bundling algorithm · 06e7c70c
      Tuong Lien 提交于
      As mentioned in commit e95584a8 ("tipc: fix unlimited bundling of
      small messages"), the current message bundling algorithm is inefficient
      that can generate bundles of only one payload message, that causes
      unnecessary overheads for both the sender and receiver.
      
      This commit re-designs the 'tipc_msg_make_bundle()' function (now named
      as 'tipc_msg_try_bundle()'), so that when a message comes at the first
      place, we will just check & keep a reference to it if the message is
      suitable for bundling. The message buffer will be put into the link
      backlog queue and processed as normal. Later on, when another one comes
      we will make a bundle with the first message if possible and so on...
      This way, a bundle if really needed will always consist of at least two
      payload messages. Otherwise, we let the first buffer go its way without
      any need of bundling, so reduce the overheads to zero.
      
      Moreover, since now we have both the messages in hand, we can even
      optimize the 'tipc_msg_bundle()' function, make bundle of a very large
      (size ~ MSS) and small messages which is not with the current algorithm
      e.g. [1400-byte message] + [10-byte message] (MTU = 1500).
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06e7c70c
  5. 31 10月, 2019 1 次提交
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97
  6. 02 10月, 2019 1 次提交
    • T
      tipc: fix unlimited bundling of small messages · e95584a8
      Tuong Lien 提交于
      We have identified a problem with the "oversubscription" policy in the
      link transmission code.
      
      When small messages are transmitted, and the sending link has reached
      the transmit window limit, those messages will be bundled and put into
      the link backlog queue. However, bundles of data messages are counted
      at the 'CRITICAL' level, so that the counter for that level, instead of
      the counter for the real, bundled message's level is the one being
      increased.
      Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
      to be tested against the unchanged counter for their own level, while
      contributing to an unrestrained increase at the CRITICAL backlog level.
      
      This leaves a gap in congestion control algorithm for small messages
      that can result in starvation for other users or a "real" CRITICAL
      user. Even that eventually can lead to buffer exhaustion & link reset.
      
      We fix this by keeping a 'target_bskb' buffer pointer at each levels,
      then when bundling, we only bundle messages at the same importance
      level only. This way, we know exactly how many slots a certain level
      have occupied in the queue, so can manage level congestion accurately.
      
      By bundling messages at the same level, we even have more benefits. Let
      consider this:
      - One socket sends 64-byte messages at the 'CRITICAL' level;
      - Another sends 4096-byte messages at the 'LOW' level;
      
      When a 64-byte message comes and is bundled the first time, we put the
      overhead of message bundle to it (+ 40-byte header, data copy, etc.)
      for later use, but the next message can be a 4096-byte one that cannot
      be bundled to the previous one. This means the last bundle carries only
      one payload message which is totally inefficient, as for the receiver
      also! Later on, another 64-byte message comes, now we make a new bundle
      and the same story repeats...
      
      With the new bundling algorithm, this will not happen, the 64-byte
      messages will be bundled together even when the 4096-byte message(s)
      comes in between. However, if the 4096-byte messages are sent at the
      same level i.e. 'CRITICAL', the bundling algorithm will again cause the
      same overhead.
      
      Also, the same will happen even with only one socket sending small
      messages at a rate close to the link transmit's one, so that, when one
      message is bundled, it's transmitted shortly. Then, another message
      comes, a new bundle is created and so on...
      
      We will solve this issue radically by another patch.
      
      Fixes: 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reported-by: NHoang Le <hoang.h.le@dektech.com.au>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e95584a8
  7. 26 7月, 2019 1 次提交
    • T
      tipc: fix changeover issues due to large packet · 2320bcda
      Tuong Lien 提交于
      In conjunction with changing the interfaces' MTU (e.g. especially in
      the case of a bonding) where the TIPC links are brought up and down
      in a short time, a couple of issues were detected with the current link
      changeover mechanism:
      
      1) When one link is up but immediately forced down again, the failover
      procedure will be carried out in order to failover all the messages in
      the link's transmq queue onto the other working link. The link and node
      state is also set to FAILINGOVER as part of the process. The message
      will be transmited in form of a FAILOVER_MSG, so its size is plus of 40
      bytes (= the message header size). There is no problem if the original
      message size is not larger than the link's MTU - 40, and indeed this is
      the max size of a normal payload messages. However, in the situation
      above, because the link has just been up, the messages in the link's
      transmq are almost SYNCH_MSGs which had been generated by the link
      synching procedure, then their size might reach the max value already!
      When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size
      will exceed the link's MTU. As a result, the messages are dropped
      silently and the failover procedure will never end up, the link will
      not be able to exit the FAILINGOVER state, so cannot be re-established.
      
      2) The same scenario above can happen more easily in case the MTU of
      the links is set differently or when changing. In that case, as long as
      a large message in the failure link's transmq queue was built and
      fragmented with its link's MTU > the other link's one, the issue will
      happen (there is no need of a link synching in advance).
      
      3) The link synching procedure also faces with the same issue but since
      the link synching is only started upon receipt of a SYNCH_MSG, dropping
      the message will not result in a state deadlock, but it is not expected
      as design.
      
      The 1) & 3) issues are resolved by the last commit that only a dummy
      SYNCH_MSG (i.e. without data) is generated at the link synching, so the
      size of a FAILOVER_MSG if any then will never exceed the link's MTU.
      
      For the 2) issue, the only solution is trying to fragment the messages
      in the failure link's transmq queue according to the working link's MTU
      so they can be failovered then. A new function is made to accomplish
      this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the
      original message size is too large, it will be fragmented & reassembled
      at the receiving side.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2320bcda
  8. 30 9月, 2018 2 次提交
    • T
      tipc: buffer overflow handling in listener socket · 67879274
      Tung Nguyen 提交于
      Default socket receive buffer size for a listener socket is 2Mb. For
      each arriving empty SYN, the linux kernel allocates a 768 bytes buffer.
      This means that a listener socket can serve maximum 2700 simultaneous
      empty connection setup requests before it hits a receive buffer
      overflow, and much fewer if the SYN is carrying any significant
      amount of data.
      
      When this happens the setup request is rejected, and the client
      receives an ECONNREFUSED error.
      
      This commit mitigates this problem by letting the client socket try to
      retransmit the SYN message multiple times when it sees it rejected with
      the code TIPC_ERR_OVERLOAD. Retransmission is done at random intervals
      in the range of [100 ms, setup_timeout / 4], as many times as there is
      room for within the setup timeout limit.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67879274
    • J
      tipc: refactor function tipc_msg_reverse() · 5cbdbd1a
      Jon Maloy 提交于
      The function tipc_msg_reverse() is reversing the header of a message
      while reusing the original buffer. We have seen at several occasions
      that this may have unfortunate side effects when the buffer to be
      reversed is a clone.
      
      In one of the following commits we will again need to reverse cloned
      buffers, so this is the right time to permanently eliminate this
      problem. In this commit we let the said function always consume the
      original buffer and replace it with a new one when applicable.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cbdbd1a
  9. 30 6月, 2018 1 次提交
    • T
      tipc: eliminate buffer cloning in function tipc_msg_extract() · ef9be755
      Tung Nguyen 提交于
      The function tipc_msg_extract() is using skb_clone() to clone inner
      messages from a message bundle buffer. Although this method is safe,
      it has an undesired effect that each buffer clone inherits the
      true-size of the bundling buffer. As a result, the buffer clone
      almost always ends up with being copied anyway by the message
      validation function. This makes the cloning into a sub-optimization.
      
      In this commit we take the consequence of this realization, and copy
      each inner message to a separately allocated buffer up front in the
      extraction function.
      
      As a bonus we can now eliminate the two cases where we had to copy
      re-routed packets that may potentially go out on the wire again.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef9be755
  10. 18 3月, 2018 1 次提交
    • J
      tipc: obsolete TIPC_ZONE_SCOPE · 928df188
      Jon Maloy 提交于
      Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
      aspects handled the same way, both on the publishing node and on the
      receiving nodes.
      
      Despite previous ambitions to the contrary, this is never going to change,
      so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
      macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
      using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
      remain compatible with users and remote nodes still using ZONE_SCOPE.
      
      Furthermore, the non-formalized scope value 0 has always been permitted
      for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
      We now permit it even as binding scope, but for compatibility reasons we
      choose to not change the value of TIPC_CLUSTER_SCOPE.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      928df188
  11. 09 2月, 2018 1 次提交
    • H
      tipc: fix skb truesize/datasize ratio control · 55b3280d
      Hoang Le 提交于
      In commit d618d09a ("tipc: enforce valid ratio between skb truesize
      and contents") we introduced a test for ensuring that the condition
      truesize/datasize <= 4 is true for a received buffer. Unfortunately this
      test has two problems.
      
      - Because of the integer arithmetics the test
        if (skb->truesize / buf_roundup_len(skb) > 4) will miss all
        ratios [4 < ratio < 5], which was not the intention.
      - The buffer returned by skb_copy() inherits skb->truesize of the
        original buffer, which doesn't help the situation at all.
      
      In this commit, we change the ratio condition and replace skb_copy()
      with a call to skb_copy_expand() to finally get this right.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55b3280d
  12. 02 12月, 2017 1 次提交
    • J
      tipc: fall back to smaller MTU if allocation of local send skb fails · 4c94cc2d
      Jon Maloy 提交于
      When sending node local messages the code is using an 'mtu' of 66060
      bytes to avoid unnecessary fragmentation. During situations of low
      memory tipc_msg_build() may sometimes fail to allocate such large
      buffers, resulting in unnecessary send failures. This can easily be
      remedied by falling back to a smaller MTU, and then reassemble the
      buffer chain as if the message were arriving from a remote node.
      
      At the same time, we change the initial MTU setting of the broadcast
      link to a lower value, so that large messages always are fragmented
      into smaller buffers even when we run in single node mode. Apart from
      obtaining the same advantage as for the 'fallback' solution above, this
      turns out to give a significant performance improvement. This can
      probably be explained with the __pskb_copy() operation performed on the
      buffer for each recipient during reception. We found the optimal value
      for this, considering the most relevant skb pool, to be 3744 bytes.
      Acked-by: NYing Xue <ying.xue@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c94cc2d
  13. 16 11月, 2017 1 次提交
    • J
      tipc: enforce valid ratio between skb truesize and contents · d618d09a
      Jon Maloy 提交于
      The socket level flow control is based on the assumption that incoming
      buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
      where the latter value is rounded off upwards to the nearest 1k number.
      This does empirically hold true for the device drivers we know, but we
      cannot trust that it will always be so, e.g., in a system with jumbo
      frames and very small packets.
      
      We now introduce a check for this condition at packet arrival, and if
      we find it to be false, we copy the packet to a new, smaller buffer,
      where the condition will be true. We expect this to affect only a small
      fraction of all incoming packets, if at all.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618d09a
  14. 13 10月, 2017 1 次提交
  15. 09 10月, 2017 1 次提交
    • J
      tipc: Unclone message at secondary destination lookup · a9e2971b
      Jon Maloy 提交于
      When a bundling message is received, the function tipc_link_input()
      calls function tipc_msg_extract() to unbundle all inner messages of
      the bundling message before adding them to input queue.
      
      The function tipc_msg_extract() just clones all inner skb for all
      inner messagges from the bundling skb. This means that the skb
      headroom of an inner message overlaps with the data part of the
      preceding message in the bundle.
      
      If the message in question is a name addressed message, it may be
      subject to a secondary destination lookup, and eventually be sent out
      on one of the interfaces again. But, since what is perceived as headroom
      by the device driver in reality is the last bytes of the preceding
      message in the bundle, the latter will be overwritten by the MAC
      addresses of the L2 header. If the preceding message has not yet been
      consumed by the user, it will evenually be delivered with corrupted
      contents.
      
      This commit fixes this by uncloning all messages passing through the
      function tipc_msg_lookup_dest(), hence ensuring that the headroom
      is always valid when the message is passed on.
      Signed-off-by: NTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9e2971b
  16. 01 10月, 2017 1 次提交
  17. 25 8月, 2017 1 次提交
  18. 15 8月, 2017 1 次提交
    • J
      tipc: avoid inheriting msg_non_seq flag when message is returned · 59a361bc
      Jon Paul Maloy 提交于
      In the function msg_reverse(), we reverse the header while trying to
      reuse the original buffer whenever possible. Those rejected/returned
      messages are always transmitted as unicast, but the msg_non_seq field
      is not explicitly set to zero as it should be.
      
      We have seen cases where multicast senders set the message type to
      "NOT dest_droppable", meaning that a multicast message shorter than
      one MTU will be returned, e.g., during receive buffer overflow, by
      reusing the original buffer. This has the effect that even the
      'msg_non_seq' field is inadvertently inherited by the rejected message,
      although it is now sent as a unicast message. This again leads the
      receiving unicast link endpoint to steer the packet toward the broadcast
      link receive function, where it is dropped. The affected unicast link is
      thereafter (after 100 failed retransmissions) declared 'stale' and
      reset.
      
      We fix this by unconditionally setting the 'msg_non_seq' flag to zero
      for all rejected/returned messages.
      Reported-by: NCanh Duc Luu <canh.d.luu@dektech.com.au>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59a361bc
  19. 11 6月, 2017 1 次提交
  20. 21 1月, 2017 1 次提交
  21. 17 1月, 2017 1 次提交
  22. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  23. 23 6月, 2016 1 次提交
    • J
      tipc: unclone unbundled buffers before forwarding · 27777daa
      Jon Paul Maloy 提交于
      When extracting an individual message from a received "bundle" buffer,
      we just create a clone of the base buffer, and adjust it to point into
      the right position of the linearized data area of the latter. This works
      well for regular message reception, but during periods of extremely high
      load it may happen that an extracted buffer, e.g, a connection probe, is
      reversed and forwarded through an external interface while the preceding
      extracted message is still unhandled. When this happens, the header or
      data area of the preceding message will be partially overwritten by a
      MAC header, leading to unpredicatable consequences, such as a link
      reset.
      
      We now fix this by ensuring that the msg_reverse() function never
      returns a cloned buffer, and that the returned buffer always contains
      sufficient valid head and tail room to be forwarded.
      Reported-by: NErik Hugne <erik.hugne@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27777daa
  24. 24 10月, 2015 2 次提交
  25. 22 10月, 2015 1 次提交
    • J
      tipc: allow non-linear first fragment buffer · 45c8b7b1
      Jon Paul Maloy 提交于
      The current code for message reassembly is erroneously assuming that
      the the first arriving fragment buffer always is linear, and then goes
      ahead resetting the fragment list of that buffer in anticipation of
      more arriving fragments.
      
      However, if the buffer already happens to be non-linear, we will
      inadvertently drop the already attached fragment list, and later
      on trig a BUG() in __pskb_pull_tail().
      
      We see this happen when running fragmented TIPC multicast across UDP,
      something made possible since
      commit d0f91938 ("tipc: add ip/udp media type")
      
      We fix this by not resetting the fragment list when the buffer is non-
      linear, and by initiatlizing our private fragment list tail pointer to
      the tail of the existing fragment list.
      
      Fixes: commit d0f91938 ("tipc: add ip/udp media type")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45c8b7b1
  26. 16 10月, 2015 1 次提交
    • J
      tipc: disallow packet duplicates in link deferred queue · 8306f99a
      Jon Paul Maloy 提交于
      After the previous commits, we are guaranteed that no packets
      of type LINK_PROTOCOL or with illegal sequence numbers will be
      attempted added to the link deferred queue. This makes it possible to
      make some simplifications to the sorting algorithm in the function
      tipc_skb_queue_sorted().
      
      We also alter the function so that it will drop packets if one with
      the same seqeunce number is already present in the queue. This is
      necessary because we have identified weird packet sequences, involving
      duplicate packets, where a legitimate in-sequence packet may advance to
      the head of the queue without being detected and de-queued.
      
      Finally, we make this function outline, since it will now be called only
      in exceptional cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8306f99a
  27. 21 9月, 2015 1 次提交
  28. 27 7月, 2015 3 次提交
    • J
      tipc: clean up socket layer message reception · cda3696d
      Jon Paul Maloy 提交于
      When a message is received in a socket, one of the call chains
      tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
      or
      tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
      are followed. At each of these levels we may encounter situations
      where the message may need to be rejected, or a new message
      produced for transfer back to the sender. Despite recent
      improvements, the current code for doing this is perceived
      as awkward and hard to follow.
      
      Leveraging the two previous commits in this series, we now
      introduce a more uniform handling of such situations. We
      let each of the functions in the chain itself produce/reverse
      the message to be returned to the sender, but also perform the
      actual forwarding. This simplifies the necessary logics within
      each function.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cda3696d
    • J
      tipc: introduce new tipc_sk_respond() function · bcd3ffd4
      Jon Paul Maloy 提交于
      Currently, we use the code sequence
      
      if (msg_reverse())
         tipc_link_xmit_skb()
      
      at numerous locations in socket.c. The preparation of arguments
      for these calls, as well as the sequence itself, makes the code
      unecessarily complex.
      
      In this commit, we introduce a new function, tipc_sk_respond(),
      that performs this call combination. We also replace some, but not
      yet all, of these explicit call sequences with calls to the new
      function. Notably, we let the function tipc_sk_proto_rcv() use
      the new function to directly send out PROBE_REPLY messages,
      instead of deferring this to the calling tipc_sk_rcv() function,
      as we do now.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcd3ffd4
    • J
      tipc: let function tipc_msg_reverse() expand header when needed · 29042e19
      Jon Paul Maloy 提交于
      The shortest TIPC message header, for cluster local CONNECTED messages,
      is 24 bytes long. With this format, the fields "dest_node" and
      "orig_node" are optimized away, since they in reality are redundant
      in this particular case.
      
      However, the absence of these fields leads to code inconsistencies
      that are difficult to handle in some cases, especially when we need
      to reverse or reject messages at the socket layer.
      
      In this commit, we concentrate the handling of the absent fields
      to one place, by letting the function tipc_msg_reverse() reallocate
      the buffer and expand the header to 32 bytes when necessary. This
      means that the socket code now can assume that the two previously
      absent fields are present in the header when a message needs to be
      rejected. This opens up for some further simplifications of the
      socket code.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29042e19
  29. 15 5月, 2015 2 次提交
    • J
      tipc: add packet sequence number at instant of transmission · dd3f9e70
      Jon Paul Maloy 提交于
      Currently, the packet sequence number is updated and added to each
      packet at the moment a packet is added to the link backlog queue.
      This is wasteful, since it forces the code to traverse the send
      packet list packet by packet when adding them to the backlog queue.
      It would be better to just splice the whole packet list into the
      backlog queue when that is the right action to do.
      
      In this commit, we do this change. Also, since the sequence numbers
      cannot now be assigned to the packets at the moment they are added
      the backlog queue, we do instead calculate and add them at the moment
      of transmission, when the backlog queue has to be traversed anyway.
      We do this in the function tipc_link_push_packet().
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd3f9e70
    • J
      tipc: improve link congestion algorithm · f21e897e
      Jon Paul Maloy 提交于
      The link congestion algorithm used until now implies two problems.
      
      - It is too generous towards lower-level messages in situations of high
        load by giving "absolute" bandwidth guarantees to the different
        priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
        20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
        available bandwidth. But, in the absence of higher level traffic, the
        ratio between two distinct levels becomes unreasonable. E.g. if there
        is only LOW and MEDIUM traffic on a system, the former is guaranteed
        1/3 of the bandwidth, and the latter 2/3. This again means that if
        there is e.g. one LOW user and 10 MEDIUM users, the  former will have
        33.3% of the bandwidth, and the others will have to compete for the
        remainder, i.e. each will end up with 6.7% of the capacity.
      
      - Packets of type MSG_BUNDLER are created at SYSTEM importance level,
        but only after the packets bundled into it have passed the congestion
        test for their own respective levels. Since bundled packets don't
        result in incrementing the level counter for their own importance,
        only occasionally for the SYSTEM level counter, they do in practice
        obtain SYSTEM level importance. Hence, the current implementation
        provides a gap in the congestion algorithm that in the worst case
        may lead to a link reset.
      
      We now refine the congestion algorithm as follows:
      
      - A message is accepted to the link backlog only if its own level
        counter, and all superior level counters, permit it.
      
      - The importance of a created bundle packet is set according to its
        contents. A bundle packet created from messges at levels LOW to
        CRITICAL is given importance level CRITICAL, while a bundle created
        from a SYSTEM level message is given importance SYSTEM. In the latter
        case only subsequent SYSTEM level messages are allowed to be bundled
        into it.
      
      This solves the first problem described above, by making the bandwidth
      guarantee relative to the total number of users at all levels; only
      the upper limit for each level remains absolute. In the example
      described above, the single LOW user would use 1/11th of the bandwidth,
      the same as each of the ten MEDIUM users, but he still has the same
      guarantee against starvation as the latter ones.
      
      The fix also solves the second problem. If the CRITICAL level is filled
      up by bundle packets of that level, no lower level packets will be
      accepted any more.
      Suggested-by: NGergely Kiss <gergely.kiss@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f21e897e
  30. 03 4月, 2015 1 次提交
    • J
      tipc: eliminate delayed link deletion at link failover · dff29b1a
      Jon Paul Maloy 提交于
      When a bearer is disabled manually, all its links have to be reset
      and deleted. However, if there is a remaining, parallel link ready
      to take over a deleted link's traffic, we currently delay the delete
      of the removed link until the failover procedure is finished. This
      is because the remaining link needs to access state from the reset
      link, such as the last received packet number, and any partially
      reassembled buffer, in order to perform a successful failover.
      
      In this commit, we do instead move the state data over to the new
      link, so that it can fulfill the procedure autonomously, without
      accessing any data on the old link. This means that we can now
      proceed and delete all pertaining links immediately when a bearer
      is disabled. This saves us from some unnecessary complexity in such
      situations.
      
      We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
      ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
      FAILOVER_MSG and SYNCH_MSG respectively.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dff29b1a
  31. 30 3月, 2015 1 次提交
    • J
      tipc: fix two bugs in secondary destination lookup · d482994f
      Jon Paul Maloy 提交于
      A message sent to a node after a successful name table lookup may still
      find that the destination socket has disappeared, because distribution
      of name table updates is non-atomic. If so, the message will be rejected
      back to the sender with error code TIPC_ERR_NO_PORT. If the source
      socket of the message has disappeared in the meantime, the message
      should be dropped.
      
      However, in the currrent code, the message will instead be subject to an
      unwanted tertiary lookup, because the function tipc_msg_lookup_dest()
      doesn't check if there is an error code present in the message before
      performing the lookup. In the worst case, the message may now find the
      old destination again, and be redirected once more, instead of being
      dropped directly as it should be.
      
      A second bug in this function is that the "prev_node" field in the message
      is not updated after successful lookup, something that may have
      unpredictable consequences.
      
      The problems arising from those bugs occur very infrequently.
      
      The third change in this function; the test on msg_reroute_msg_cnt() is
      purely cosmetic, reflecting that the returned value never can be negative.
      
      This commit corrects the two bugs described above.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d482994f
  32. 15 3月, 2015 4 次提交
    • J
      tipc: clean up handling of message priorities · e3eea1eb
      Jon Paul Maloy 提交于
      Messages transferred by TIPC are assigned an "importance priority", -an
      integer value indicating how to treat the message when there is link or
      destination socket congestion.
      
      There is no separate header field for this value. Instead, the message
      user values have been chosen in ascending order according to perceived
      importance, so that the message user field can be used for this.
      
      This is not a good solution. First, we have many more users than the
      needed priority levels, so we end up with treating more priority
      levels than necessary. Second, the user field cannot always
      accurately reflect the priority of the message. E.g., a message
      fragment packet should really have the priority of the enveloped
      user data message, and not the priority of the MSG_FRAGMENTER user.
      Until now, we have been working around this problem in different ways,
      but it is now time to implement a consistent way of handling such
      priorities, although still within the constraint that we cannot
      allocate any more bits in the regular data message header for this.
      
      In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
      that will be the only one used apart from the four (lower) user data
      levels. All non-data messages map down to this priority. Furthermore,
      we take some free bits from the MSG_FRAGMENTER header and allocate
      them to store the priority of the enveloped message. We then adjust
      the functions msg_importance()/msg_set_importance() so that they
      read/set the correct header fields depending on user type.
      
      This small protocol change is fully compatible, because the code at
      the receiving end of a link currently reads the importance level
      only from user data messages, where there is no change.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3eea1eb
    • J
      tipc: split link outqueue · 05dcc5aa
      Jon Paul Maloy 提交于
      struct tipc_link contains one single queue for outgoing packets,
      where both transmitted and waiting packets are queued.
      
      This infrastructure is hard to maintain, because we need
      to keep a number of fields to keep track of which packets are
      sent or unsent, and the number of packets in each category.
      
      A lot of code becomes simpler if we split this queue into a transmission
      queue, where sent/unacknowledged packets are kept, and a backlog queue,
      where we keep the not yet sent packets.
      
      In this commit we do this separation.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05dcc5aa
    • J
      tipc: extract bundled buffers by cloning instead of copying · c1336ee4
      Jon Paul Maloy 提交于
      When we currently extract a bundled buffer from a message bundle in
      the function tipc_msg_extract(), we allocate a new buffer and explicitly
      copy the linear data area.
      
      This is unnecessary, since we can just clone the buffer and do
      skb_pull() on the clone to move the data pointer to the correct
      position.
      
      This is what we do in this commit.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1336ee4
    • J
      tipc: eliminate unnecessary linearization of incoming buffers · 1149557d
      Jon Paul Maloy 提交于
      Currently, TIPC linearizes all incoming buffers directly at reception
      before passing them upwards in the stack. This is clearly a waste of
      CPU resources, and must be avoided.
      
      In this commit, we eliminate this unnecessary linearization. We still
      ensure that at least the message header is linear, and that the buffer
      is linearized where this is still needed, i.e. when unbundling and when
      reversing messages.
      
      In addition, we ensure that fragmented messages are validated after
      reassembly before delivering them upwards in the stack.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1149557d