1. 17 1月, 2017 1 次提交
  2. 01 11月, 2016 1 次提交
  3. 30 10月, 2016 1 次提交
    • J
      tipc: fix broadcast link synchronization problem · 06bd2b1e
      Jon Paul Maloy 提交于
      In commit 2d18ac4b ("tipc: extend broadcast link initialization
      criteria") we tried to fix a problem with the initial synchronization
      of broadcast link acknowledge values. Unfortunately that solution is
      not sufficient to solve the issue.
      
      We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
      non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
      initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
      broadcast acknowledge numbers, leading to premature opening of the
      broadcast link. When the bypassed packets finally arrive, they are
      inadvertently accepted, and the already correctly initialized
      acknowledge number in the broadcast receive link is overwritten by
      the invalid (zero) value of the said packets. After this the broadcast
      link goes stale.
      
      We now fix this by marking the packets where we know the acknowledge
      value is or may be invalid, and then ignoring the acks from those.
      
      To this purpose, we claim an unused bit in the header to indicate that
      the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
      synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
      packets, plus those LINK_PROTOCOL packets sent out before the broadcast
      links are fully synchronized.
      
      This minor protocol update is fully backwards compatible.
      Reported-by: NJohn Thompson <thompa.atl@gmail.com>
      Tested-by: NJohn Thompson <thompa.atl@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06bd2b1e
  4. 03 9月, 2016 1 次提交
    • J
      tipc: transfer broadcast nacks in link state messages · 02d11ca2
      Jon Paul Maloy 提交于
      When we send broadcasts in clusters of more 70-80 nodes, we sometimes
      see the broadcast link resetting because of an excessive number of
      retransmissions. This is caused by a combination of two factors:
      
      1) A 'NACK crunch", where loss of broadcast packets is discovered
         and NACK'ed by several nodes simultaneously, leading to multiple
         redundant broadcast retransmissions.
      
      2) The fact that the NACKS as such also are sent as broadcast, leading
         to excessive load and packet loss on the transmitting switch/bridge.
      
      This commit deals with the latter problem, by moving sending of
      broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
      to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
      bits in word 8 of the said message for this purpose, and introduce a
      new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
      backwards compatible.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d11ca2
  5. 23 6月, 2016 1 次提交
    • J
      tipc: unclone unbundled buffers before forwarding · 27777daa
      Jon Paul Maloy 提交于
      When extracting an individual message from a received "bundle" buffer,
      we just create a clone of the base buffer, and adjust it to point into
      the right position of the linearized data area of the latter. This works
      well for regular message reception, but during periods of extremely high
      load it may happen that an extracted buffer, e.g, a connection probe, is
      reversed and forwarded through an external interface while the preceding
      extracted message is still unhandled. When this happens, the header or
      data area of the preceding message will be partially overwritten by a
      MAC header, leading to unpredicatable consequences, such as a link
      reset.
      
      We now fix this by ensuring that the msg_reverse() function never
      returns a cloned buffer, and that the returned buffer always contains
      sufficient valid head and tail room to be forwarded.
      Reported-by: NErik Hugne <erik.hugne@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27777daa
  6. 04 5月, 2016 1 次提交
    • J
      tipc: redesign connection-level flow control · 10724cc7
      Jon Paul Maloy 提交于
      There are two flow control mechanisms in TIPC; one at link level that
      handles network congestion, burst control, and retransmission, and one
      at connection level which' only remaining task is to prevent overflow
      in the receiving socket buffer. In TIPC, the latter task has to be
      solved end-to-end because messages can not be thrown away once they
      have been accepted and delivered upwards from the link layer, i.e, we
      can never permit the receive buffer to overflow.
      
      Currently, this algorithm is message based. A counter in the receiving
      socket keeps track of number of consumed messages, and sends a dedicated
      acknowledge message back to the sender for each 256 consumed message.
      A counter at the sending end keeps track of the sent, not yet
      acknowledged messages, and blocks the sender if this number ever reaches
      512 unacknowledged messages. When the missing acknowledge arrives, the
      socket is then woken up for renewed transmission. This works well for
      keeping the message flow running, as it almost never happens that a
      sender socket is blocked this way.
      
      A problem with the current mechanism is that it potentially is very
      memory consuming. Since we don't distinguish between small and large
      messages, we have to dimension the socket receive buffer according
      to a worst-case of both. I.e., the window size must be chosen large
      enough to sustain a reasonable throughput even for the smallest
      messages, while we must still consider a scenario where all messages
      are of maximum size. Hence, the current fix window size of 512 messages
      and a maximum message size of 66k results in a receive buffer of 66 MB
      when truesize(66k) = 131k is taken into account. It is possible to do
      much better.
      
      This commit introduces an algorithm where we instead use 1024-byte
      blocks as base unit. This unit, always rounded upwards from the
      actual message size, is used when we advertise windows as well as when
      we count and acknowledge transmitted data. The advertised window is
      based on the configured receive buffer size in such a way that even
      the worst-case truesize/msgsize ratio always is covered. Since the
      smallest possible message size (from a flow control viewpoint) now is
      1024 bytes, we can safely assume this ratio to be less than four, which
      is the value we are now using.
      
      This way, we have been able to reduce the default receive buffer size
      from 66 MB to 2 MB with maintained performance.
      
      In order to keep this solution backwards compatible, we introduce a
      new capability bit in the discovery protocol, and use this throughout
      the message sending/reception path to always select the right unit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10724cc7
  7. 16 4月, 2016 1 次提交
    • J
      tipc: guarantee peer bearer id exchange after reboot · 634696b1
      Jon Paul Maloy 提交于
      When a link endpoint is going down locally, e.g., because its interface
      is being stopped, it will spontaneously send out a RESET message to
      its peer, informing it about this fact. This saves the peer from
      detecting the failure via probing, and hence gives both speedier and
      less resource consuming failure detection on the peer side.
      
      According to the link FSM, a receiver of a RESET message, ignoring the
      reason for it, must now consider the sender ready to come back up, and
      starts periodically sending out ACTIVATE messages to the peer in order
      to re-establish the link. Also, according to the FSM, the receiver of
      an ACTIVATE message can now go directly to state ESTABLISHED and start
      sending regular traffic packets. This is a well-proven and robust FSM.
      
      However, in the case of a reboot, there is a small possibilty that link
      endpoint on the rebooted node may have been re-created with a new bearer
      identity between the moment it sent its (pre-boot) RESET and the moment
      it receives the ACTIVATE from the peer. The new bearer identity cannot
      be known by the peer according to this scenario, since traffic headers
      don't convey such information. This is a problem, because both endpoints
      need to know the correct value of the peer's bearer id at any moment in
      time in order to be able to produce correct link events for their users.
      
      The only way to guarantee this is to enforce a full setup message
      exchange (RESET + ACTIVATE) even after the reboot, since those messages
      carry the bearer idientity in their header.
      
      In this commit we do this by introducing and setting a "stopping" bit in
      the header of the spontaneously generated RESET messages, informing the
      peer that the sender will not be immediately ready to re-establish the
      link. A receiver seeing this bit must act as if this were a locally
      detected connectivity failure, and hence has to go through a full two-
      way setup message exchange before any link can be re-established.
      
      Although never reported, this problem seems to have always been around.
      
      This protocol addition is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634696b1
  8. 08 4月, 2016 1 次提交
    • J
      tipc: stricter filtering of packets in bearer layer · 5b7066c3
      Jon Paul Maloy 提交于
      Resetting a bearer/interface, with the consequence of resetting all its
      pertaining links, is not an atomic action. This becomes particularly
      evident in very large clusters, where a lot of traffic may happen on the
      remaining links while we are busy shutting them down. In extreme cases,
      we may even see links being re-created and re-established before we are
      finished with the job.
      
      To solve this, we now introduce a solution where we temporarily detach
      the bearer from the interface when the bearer is reset. This inhibits
      all packet reception, while sending still is possible. For the latter,
      we use the fact that the device's user pointer now is zero to filter out
      which packets can be sent during this situation; i.e., outgoing RESET
      messages only.  This filtering serves to speed up the neighbors'
      detection of the loss event, and saves us from unnecessary probing.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b7066c3
  9. 24 10月, 2015 3 次提交
    • J
      tipc: let broadcast packet reception use new link receive function · 52666986
      Jon Paul Maloy 提交于
      The code path for receiving broadcast packets is currently distinct
      from the unicast path. This leads to unnecessary code and data
      duplication, something that can be avoided with some effort.
      
      We now introduce separate per-peer tipc_link instances for handling
      broadcast packet reception. Each receive link keeps a pointer to the
      common, single, broadcast link instance, and can hence handle release
      and retransmission of send buffers as if they belonged to the own
      instance.
      
      Furthermore, we let each unicast link instance keep a reference to both
      the pertaining broadcast receive link, and to the common send link.
      This makes it possible for the unicast links to easily access data for
      broadcast link synchronization, as well as for carrying acknowledges for
      received broadcast packets.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52666986
    • J
      tipc: let broadcast transmission use new link transmit function · 2f566124
      Jon Paul Maloy 提交于
      This commit simplifies the broadcast link transmission function, by
      leveraging previous changes to the link transmission function and the
      broadcast transmission link life cycle.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f566124
    • J
      tipc: make struct tipc_link generic to support broadcast · c1ab3f1d
      Jon Paul Maloy 提交于
      Realizing that unicast is just a special case of broadcast, we also see
      that we can go in the other direction, i.e., that modest changes to the
      current unicast link can make it generic enough to support broadcast.
      
      The following changes are introduced here:
      
      - A new counter ("ackers") in struct tipc_link, to indicate how many
        peers need to ack a packet before it can be released.
      - A corresponding counter in the skb user area, to keep track of how
        many peers a are left to ack before a buffer can be released.
      - A new counter ("acked"), to keep persistent track of how far a peer
        has acked at the moment, i.e., where in the transmission queue to
        start updating buffers when the next ack arrives. This is to avoid
        double acknowledgements from a peer, with inadvertent relase of
        packets as a result.
      - A more generic tipc_link_retrans() function, where retransmit starts
        from a given sequence number, instead of the first packet in the
        transmision queue. This is to minimize the number of retransmitted
        packets on the broadcast media.
      
      When the new functionality is taken into use in the next commits,
      we expect it to have minimal effect on unicast mode performance.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1ab3f1d
  10. 16 10月, 2015 1 次提交
    • J
      tipc: disallow packet duplicates in link deferred queue · 8306f99a
      Jon Paul Maloy 提交于
      After the previous commits, we are guaranteed that no packets
      of type LINK_PROTOCOL or with illegal sequence numbers will be
      attempted added to the link deferred queue. This makes it possible to
      make some simplifications to the sorting algorithm in the function
      tipc_skb_queue_sorted().
      
      We also alter the function so that it will drop packets if one with
      the same seqeunce number is already present in the queue. This is
      necessary because we have identified weird packet sequences, involving
      duplicate packets, where a legitimate in-sequence packet may advance to
      the head of the queue without being detected and de-queued.
      
      Finally, we make this function outline, since it will now be called only
      in exceptional cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8306f99a
  11. 15 10月, 2015 1 次提交
    • J
      tipc: move fragment importance field to new header position · dde4b5ae
      Jon Paul Maloy 提交于
      In commit e3eea1eb ("tipc: clean up handling of message priorities")
      we introduced a field in the packet header for keeping track of the
      priority of fragments, since this value is not present in the specified
      protocol header. Since the value so far only is used at the transmitting
      end of the link, we have not yet officially defined it as part of the
      protocol.
      
      Unfortunately, the field we use for keeping this value, bits 13-15 in
      in word 5, has turned out to be a poor choice; it is already used by the
      broadcast protocol for carrying the 'network id' field of the sending
      node. Since packet fragments also need to be transported across the
      broadcast protocol, the risk of conflict is obvious, and we see this
      happen when we use network identities larger than 2^13-1. This has
      escaped our testing because we have so far only been using small network
      id values.
      
      We now move this field to bits 0-2 in word 9, a field that is guaranteed
      to be unused by all involved protocols.
      
      Fixes: e3eea1eb ("tipc: clean up handling of message priorities")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dde4b5ae
  12. 31 7月, 2015 3 次提交
  13. 27 7月, 2015 3 次提交
    • J
      tipc: clean up socket layer message reception · cda3696d
      Jon Paul Maloy 提交于
      When a message is received in a socket, one of the call chains
      tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
      or
      tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
      are followed. At each of these levels we may encounter situations
      where the message may need to be rejected, or a new message
      produced for transfer back to the sender. Despite recent
      improvements, the current code for doing this is perceived
      as awkward and hard to follow.
      
      Leveraging the two previous commits in this series, we now
      introduce a more uniform handling of such situations. We
      let each of the functions in the chain itself produce/reverse
      the message to be returned to the sender, but also perform the
      actual forwarding. This simplifies the necessary logics within
      each function.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cda3696d
    • J
      tipc: introduce new tipc_sk_respond() function · bcd3ffd4
      Jon Paul Maloy 提交于
      Currently, we use the code sequence
      
      if (msg_reverse())
         tipc_link_xmit_skb()
      
      at numerous locations in socket.c. The preparation of arguments
      for these calls, as well as the sequence itself, makes the code
      unecessarily complex.
      
      In this commit, we introduce a new function, tipc_sk_respond(),
      that performs this call combination. We also replace some, but not
      yet all, of these explicit call sequences with calls to the new
      function. Notably, we let the function tipc_sk_proto_rcv() use
      the new function to directly send out PROBE_REPLY messages,
      instead of deferring this to the calling tipc_sk_rcv() function,
      as we do now.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcd3ffd4
    • J
      tipc: let function tipc_msg_reverse() expand header when needed · 29042e19
      Jon Paul Maloy 提交于
      The shortest TIPC message header, for cluster local CONNECTED messages,
      is 24 bytes long. With this format, the fields "dest_node" and
      "orig_node" are optimized away, since they in reality are redundant
      in this particular case.
      
      However, the absence of these fields leads to code inconsistencies
      that are difficult to handle in some cases, especially when we need
      to reverse or reject messages at the socket layer.
      
      In this commit, we concentrate the handling of the absent fields
      to one place, by letting the function tipc_msg_reverse() reallocate
      the buffer and expand the header to 32 bytes when necessary. This
      means that the socket code now can assume that the two previously
      absent fields are present in the header when a message needs to be
      rejected. This opens up for some further simplifications of the
      socket code.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29042e19
  14. 21 7月, 2015 2 次提交
    • J
      tipc: reduce locking scope during packet reception · d999297c
      Jon Paul Maloy 提交于
      We convert packet/message reception according to the same principle
      we have been using for message sending and timeout handling:
      
      We move the function tipc_rcv() to node.c, hence handling the initial
      packet reception at the link aggregation level. The function grabs
      the node lock, selects the receiving link, and accesses it via a new
      call tipc_link_rcv(). This function appends buffers to the input
      queue for delivery upwards, but it may also append outgoing packets
      to the xmit queue, just as we do during regular message sending. The
      latter will happen when buffers are forwarded from the link backlog,
      or when retransmission is requested.
      
      Upon return of this function, and after having released the node lock,
      tipc_rcv() delivers/tranmsits the contents of those queues, but it may
      also perform actions such as link activation or reset, as indicated by
      the return flags from the link.
      
      This reduces the number of cpu cycles spent inside the node spinlock,
      and reduces contention on that lock.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d999297c
    • J
      tipc: introduce node contact FSM · 1a20cc25
      Jon Paul Maloy 提交于
      The logics for determining when a node is permitted to establish
      and maintain contact with its peer node becomes non-trivial in the
      presence of multiple parallel links that may come and go independently.
      
      A known failure scenario is that one endpoint registers both its links
      to the peer lost, cleans up it binding table, and prepares for a table
      update once contact is re-establihed, while the other endpoint may
      see its links reset and re-established one by one, hence seeing
      no need to re-synchronize the binding table. To avoid this, a node
      must not allow re-establishing contact until it has confirmation that
      even the peer has lost both links.
      
      Currently, the mechanism for handling this consists of setting and
      resetting two state flags from different locations in the code. This
      solution is hard to understand and maintain. A closer analysis even
      reveals that it is not completely safe.
      
      In this commit we do instead introduce an FSM that keeps track of
      the conditions for when the node can establish and maintain links.
      It has six states and four events, and is strictly based on explicit
      knowledge about the own node's and the peer node's contact states.
      Only events leading to state change are shown as edges in the figure
      below.
      
                                   +--------------+
                                   | SELF_UP/     |
                 +---------------->| PEER_COMING  |-----------------+
          SELF_  |                 +--------------+                 |PEER_
          ESTBL_ |                        |                         |ESTBL_
          CONTACT|      SELF_LOST_CONTACT |                         |CONTACT
                 |                        v                         |
                 |                 +--------------+                 |
                 |      PEER_      | SELF_DOWN/   |     SELF_       |
                 |      LOST_   +--| PEER_LEAVING |<--+ LOST_       v
      +-------------+   CONTACT |  +--------------+   | CONTACT  +-----------+
      | SELF_DOWN/  |<----------+                     +----------| SELF_UP/  |
      | PEER_DOWN   |<----------+                     +----------| PEER_UP   |
      +-------------+   SELF_   |  +--------------+   | PEER_    +-----------+
                 |      LOST_   +--| SELF_LEAVING/|<--+ LOST_       A
                 |      CONTACT    | PEER_DOWN    |     CONTACT     |
                 |                 +--------------+                 |
                 |                         A                        |
          PEER_  |       PEER_LOST_CONTACT |                        |SELF_
          ESTBL_ |                         |                        |ESTBL_
          CONTACT|                 +--------------+                 |CONTACT
                 +---------------->| PEER_UP/     |-----------------+
                                   | SELF_COMING  |
                                   +--------------+
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a20cc25
  15. 15 5月, 2015 3 次提交
    • J
      tipc: add packet sequence number at instant of transmission · dd3f9e70
      Jon Paul Maloy 提交于
      Currently, the packet sequence number is updated and added to each
      packet at the moment a packet is added to the link backlog queue.
      This is wasteful, since it forces the code to traverse the send
      packet list packet by packet when adding them to the backlog queue.
      It would be better to just splice the whole packet list into the
      backlog queue when that is the right action to do.
      
      In this commit, we do this change. Also, since the sequence numbers
      cannot now be assigned to the packets at the moment they are added
      the backlog queue, we do instead calculate and add them at the moment
      of transmission, when the backlog queue has to be traversed anyway.
      We do this in the function tipc_link_push_packet().
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd3f9e70
    • J
      tipc: improve link congestion algorithm · f21e897e
      Jon Paul Maloy 提交于
      The link congestion algorithm used until now implies two problems.
      
      - It is too generous towards lower-level messages in situations of high
        load by giving "absolute" bandwidth guarantees to the different
        priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
        20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
        available bandwidth. But, in the absence of higher level traffic, the
        ratio between two distinct levels becomes unreasonable. E.g. if there
        is only LOW and MEDIUM traffic on a system, the former is guaranteed
        1/3 of the bandwidth, and the latter 2/3. This again means that if
        there is e.g. one LOW user and 10 MEDIUM users, the  former will have
        33.3% of the bandwidth, and the others will have to compete for the
        remainder, i.e. each will end up with 6.7% of the capacity.
      
      - Packets of type MSG_BUNDLER are created at SYSTEM importance level,
        but only after the packets bundled into it have passed the congestion
        test for their own respective levels. Since bundled packets don't
        result in incrementing the level counter for their own importance,
        only occasionally for the SYSTEM level counter, they do in practice
        obtain SYSTEM level importance. Hence, the current implementation
        provides a gap in the congestion algorithm that in the worst case
        may lead to a link reset.
      
      We now refine the congestion algorithm as follows:
      
      - A message is accepted to the link backlog only if its own level
        counter, and all superior level counters, permit it.
      
      - The importance of a created bundle packet is set according to its
        contents. A bundle packet created from messges at levels LOW to
        CRITICAL is given importance level CRITICAL, while a bundle created
        from a SYSTEM level message is given importance SYSTEM. In the latter
        case only subsequent SYSTEM level messages are allowed to be bundled
        into it.
      
      This solves the first problem described above, by making the bandwidth
      guarantee relative to the total number of users at all levels; only
      the upper limit for each level remains absolute. In the example
      described above, the single LOW user would use 1/11th of the bandwidth,
      the same as each of the ten MEDIUM users, but he still has the same
      guarantee against starvation as the latter ones.
      
      The fix also solves the second problem. If the CRITICAL level is filled
      up by bundle packets of that level, no lower level packets will be
      accepted any more.
      Suggested-by: NGergely Kiss <gergely.kiss@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f21e897e
    • J
      tipc: simplify packet sequence number handling · e4bf4f76
      Jon Paul Maloy 提交于
      Although the sequence number in the TIPC protocol is 16 bits, we have
      until now stored it internally as an unsigned 32 bits integer.
      We got around this by always doing explicit modulo-65535 operations
      whenever we need to access a sequence number.
      
      We now make the incoming and outgoing sequence numbers to unsigned
      16-bit integers, and remove the modulo operations where applicable.
      
      We also move the arithmetic inline functions for 16 bit integers
      to core.h, and the function buf_seqno() to msg.h, so they can easily
      be accessed from anywhere in the code.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4bf4f76
  16. 03 4月, 2015 1 次提交
    • J
      tipc: eliminate delayed link deletion at link failover · dff29b1a
      Jon Paul Maloy 提交于
      When a bearer is disabled manually, all its links have to be reset
      and deleted. However, if there is a remaining, parallel link ready
      to take over a deleted link's traffic, we currently delay the delete
      of the removed link until the failover procedure is finished. This
      is because the remaining link needs to access state from the reset
      link, such as the last received packet number, and any partially
      reassembled buffer, in order to perform a successful failover.
      
      In this commit, we do instead move the state data over to the new
      link, so that it can fulfill the procedure autonomously, without
      accessing any data on the old link. This means that we can now
      proceed and delete all pertaining links immediately when a bearer
      is disabled. This saves us from some unnecessary complexity in such
      situations.
      
      We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
      ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
      FAILOVER_MSG and SYNCH_MSG respectively.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dff29b1a
  17. 26 3月, 2015 2 次提交
    • J
      tipc: eliminate race condition at dual link establishment · 8b4ed863
      Jon Paul Maloy 提交于
      Despite recent improvements, the establishment of dual parallel
      links still has a small glitch where messages can bypass each
      other. When the second link in a dual-link configuration is
      established, part of the first link's traffic will be steered over
      to the new link. Although we do have a mechanism to ensure that
      packets sent before and after the establishment of the new link
      arrive in sequence to the destination node, this is not enough.
      The arriving messages will still be delivered upwards in different
      threads, something entailing a risk of message disordering during
      the transition phase.
      
      To fix this, we introduce a synchronization mechanism between the
      two parallel links, so that traffic arriving on the new link cannot
      be added to its input queue until we are guaranteed that all
      pre-establishment messages have been delivered on the old, parallel
      link.
      
      This problem seems to always have been around, but its occurrence is
      so rare that it has not been noticed until recent intensive testing.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b4ed863
    • J
      tipc: clean up handling of link congestion · 3127a020
      Jon Paul Maloy 提交于
      After the recent changes in message importance handling it becomes
      possible to simplify handling of messages and sockets when we
      encounter link congestion.
      
      We merge the function tipc_link_cong() into link_schedule_user(),
      and simplify the code of the latter. The code should now be
      easier to follow, especially regarding return codes and handling
      of the message that caused the situation.
      
      In case the scheduling function is unable to pre-allocate a wakeup
      message buffer, it now returns -ENOBUFS, which is a more correct
      code than the previously used -EHOSTUNREACH.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3127a020
  18. 15 3月, 2015 4 次提交
    • J
      tipc: clean up handling of message priorities · e3eea1eb
      Jon Paul Maloy 提交于
      Messages transferred by TIPC are assigned an "importance priority", -an
      integer value indicating how to treat the message when there is link or
      destination socket congestion.
      
      There is no separate header field for this value. Instead, the message
      user values have been chosen in ascending order according to perceived
      importance, so that the message user field can be used for this.
      
      This is not a good solution. First, we have many more users than the
      needed priority levels, so we end up with treating more priority
      levels than necessary. Second, the user field cannot always
      accurately reflect the priority of the message. E.g., a message
      fragment packet should really have the priority of the enveloped
      user data message, and not the priority of the MSG_FRAGMENTER user.
      Until now, we have been working around this problem in different ways,
      but it is now time to implement a consistent way of handling such
      priorities, although still within the constraint that we cannot
      allocate any more bits in the regular data message header for this.
      
      In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
      that will be the only one used apart from the four (lower) user data
      levels. All non-data messages map down to this priority. Furthermore,
      we take some free bits from the MSG_FRAGMENTER header and allocate
      them to store the priority of the enveloped message. We then adjust
      the functions msg_importance()/msg_set_importance() so that they
      read/set the correct header fields depending on user type.
      
      This small protocol change is fully compatible, because the code at
      the receiving end of a link currently reads the importance level
      only from user data messages, where there is no change.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3eea1eb
    • J
      tipc: split link outqueue · 05dcc5aa
      Jon Paul Maloy 提交于
      struct tipc_link contains one single queue for outgoing packets,
      where both transmitted and waiting packets are queued.
      
      This infrastructure is hard to maintain, because we need
      to keep a number of fields to keep track of which packets are
      sent or unsent, and the number of packets in each category.
      
      A lot of code becomes simpler if we split this queue into a transmission
      queue, where sent/unacknowledged packets are kept, and a backlog queue,
      where we keep the not yet sent packets.
      
      In this commit we do this separation.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05dcc5aa
    • J
      tipc: move message validation function to msg.c · cf2157f8
      Jon Paul Maloy 提交于
      The function link_buf_validate() is in reality re-entrant and context
      independent, and will in later commits be called from several locations.
      Therefore, we move it to msg.c, make it outline and rename the it to
      tipc_msg_validate().
      
      We also redesign the function to make proper use of pskb_may_pull()
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf2157f8
    • J
      tipc: add framework for node capabilities exchange · 7764d6e8
      Jon Paul Maloy 提交于
      The TIPC protocol spec has defined a 13 bit capability bitmap in
      the neighbor discovery header, as a means to maintain compatibility
      between different code and protocol generations. Until now this field
      has been unused.
      
      We now introduce the basic framework for exchanging capabilities
      between nodes at first contact. After exchange, a peer node's
      capabilities are stored as a 16 bit bitmap in struct tipc_node.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7764d6e8
  19. 06 3月, 2015 1 次提交
  20. 28 2月, 2015 1 次提交
  21. 06 2月, 2015 4 次提交
    • J
      tipc: eliminate race condition at multicast reception · cb1b7280
      Jon Paul Maloy 提交于
      In a previous commit in this series we resolved a race problem during
      unicast message reception.
      
      Here, we resolve the same problem at multicast reception. We apply the
      same technique: an input queue serializing the delivery of arriving
      buffers. The main difference is that here we do it in two steps.
      First, the broadcast link feeds arriving buffers into the tail of an
      arrival queue, which head is consumed at the socket level, and where
      destination lookup is performed. Second, if the lookup is successful,
      the resulting buffer clones are fed into a second queue, the input
      queue. This queue is consumed at reception in the socket just like
      in the unicast case. Both queues are protected by the same lock, -the
      one of the input queue.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1b7280
    • J
      tipc: resolve race problem at unicast message reception · c637c103
      Jon Paul Maloy 提交于
      TIPC handles message cardinality and sequencing at the link layer,
      before passing messages upwards to the destination sockets. During the
      upcall from link to socket no locks are held. It is therefore possible,
      and we see it happen occasionally, that messages arriving in different
      threads and delivered in sequence still bypass each other before they
      reach the destination socket. This must not happen, since it violates
      the sequentiality guarantee.
      
      We solve this by adding a new input buffer queue to the link structure.
      Arriving messages are added safely to the tail of that queue by the
      link, while the head of the queue is consumed, also safely, by the
      receiving socket. Sequentiality is secured per socket by only allowing
      buffers to be dequeued inside the socket lock. Since there may be multiple
      simultaneous readers of the queue, we use a 'filter' parameter to reduce
      the risk that they peek the same buffer from the queue, hence also
      reducing the risk of contention on the receiving socket locks.
      
      This solves the sequentiality problem, and seems to cause no measurable
      performance degradation.
      
      A nice side effect of this change is that lock handling in the functions
      tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
      will enable future simplifications of those functions.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c637c103
    • J
      tipc: split up function tipc_msg_eval() · e3a77561
      Jon Paul Maloy 提交于
      The function tipc_msg_eval() is in reality doing two related, but
      different tasks. First it tries to find a new destination for named
      messages, in case there was no first lookup, or if the first lookup
      failed. Second, it does what its name suggests, evaluating the validity
      of the message and its destination, and returning an appropriate error
      code depending on the result.
      
      This is confusing, and in this commit we choose to break it up into two
      functions. A new function, tipc_msg_lookup_dest(), first attempts to find
      a new destination, if the message is of the right type. If this lookup
      fails, or if the message should not be subject to a second lookup, the
      already existing tipc_msg_reverse() is called. This function performs
      prepares the message for rejection, if applicable.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3a77561
    • J
      tipc: reduce usage of context info in socket and link · c5898636
      Jon Paul Maloy 提交于
      The most common usage of namespace information is when we fetch the
      own node addess from the net structure. This leads to a lot of
      passing around of a parameter of type 'struct net *' between
      functions just to make them able to obtain this address.
      
      However, in many cases this is unnecessary. The own node address
      is readily available as a member of both struct tipc_sock and
      tipc_link, and can be fetched from there instead.
      The fact that the vast majority of functions in socket.c and link.c
      anyway are maintaining a pointer to their respective base structures
      makes this option even more compelling.
      
      In this commit, we introduce the inline functions tsk_own_node()
      and link_own_node() to make it easy for functions to fetch the node
      address from those structs instead of having to pass along and
      dereference the namespace struct.
      
      In particular, we make calls to the msg_xx() functions in msg.{h,c}
      context independent by directly passing them the own node address
      as parameter when needed. Those functions should be regarded as
      leaves in the code dependency tree, and it is hence desirable to
      keep them namspace unaware.
      
      Apart from a potential positive effect on cache behavior, these
      changes make it easier to introduce the changes that will follow
      later in this series.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5898636
  22. 13 1月, 2015 3 次提交