1. 27 3月, 2018 1 次提交
  2. 24 3月, 2018 3 次提交
    • J
      tipc: handle collisions of 32-bit node address hash values · 25b0b9c4
      Jon Maloy 提交于
      When a 32-bit node address is generated from a 128-bit identifier,
      there is a risk of collisions which must be discovered and handled.
      
      We do this as follows:
      - We don't apply the generated address immediately to the node, but do
        instead initiate a 1 sec trial period to allow other cluster members
        to discover and handle such collisions.
      
      - During the trial period the node periodically sends out a new type
        of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
        to all the other nodes in the cluster.
      
      - When a node is receiving such a message, it must check that the
        presented 32-bit identifier either is unused, or was used by the very
        same peer in a previous session. In both cases it accepts the request
        by not responding to it.
      
      - If it finds that the same node has been up before using a different
        address, it responds with a DSC_TRIAL_FAIL_MSG containing that
        address.
      
      - If it finds that the address has already been taken by some other
        node, it generates a new, unused address and returns it to the
        requester.
      
      - During the trial period the requesting node must always be prepared
        to accept a failure message, i.e., a message where a peer suggests a
        different (or equal)  address to the one tried. In those cases it
        must apply the suggested value as trial address and restart the trial
        period.
      
      This algorithm ensures that in the vast majority of cases a node will
      have the same address before and after a reboot. If a legacy user
      configures the address explicitly, there will be no trial period and
      messages, so this protocol addition is completely backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25b0b9c4
    • J
      tipc: add 128-bit node identifier · d50ccc2d
      Jon Maloy 提交于
      We add a 128-bit node identity, as an alternative to the currently used
      32-bit node address.
      
      For the sake of compatibility and to minimize message header changes
      we retain the existing 32-bit address field. When not set explicitly by
      the user, this field will be filled with a hash value generated from the
      much longer node identity, and be used as a shorthand value for the
      latter.
      
      We permit either the address or the identity to be set by configuration,
      but not both, so when the address value is set by a legacy user the
      corresponding 128-bit node identity is generated based on the that value.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d50ccc2d
    • J
      tipc: remove restrictions on node address values · 20263641
      Jon Maloy 提交于
      Nominally, TIPC organizes network nodes into a three-level network
      hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
      hierarchy is reflected in the node address format, - it is sub-divided
      into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
      
      However, the 'zone' and 'cluster' levels have in reality never been
      fully implemented,and never will be. The result of this has been
      that the first 20 bits the node identity structure have been wasted,
      and the usable node identity range within a cluster has been limited
      to 12 bits. This is starting to become a problem.
      
      In the following commits, we will need to be able to connect between
      nodes which are using the whole 32-bit value space of the node address.
      We therefore remove the restrictions on which values can be assigned
      to node identity, -it is from now on only a 32-bit integer with no
      assumed internal structure.
      
      Isolation between clusters is now achieved only by setting different
      values for the 'network id' field used during neighbor discovery, in
      practice leading to the latter becoming the new cluster identity.
      
      The rules for accepting discovery requests/responses from neighboring
      nodes now become:
      
      - If the user is using legacy address format on both peers, reception
        of discovery messages is subject to the legacy lookup domain check
        in addition to the cluster id check.
      
      - Otherwise, the discovery request/response is always accepted, provided
        both peers have the same network id.
      
      This secures backwards compatibility for users who have been using zone
      or cluster identities as cluster separators, instead of the intended
      'network id'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20263641
  3. 15 2月, 2018 1 次提交
  4. 16 1月, 2018 1 次提交
  5. 16 11月, 2017 1 次提交
    • J
      tipc: enforce valid ratio between skb truesize and contents · d618d09a
      Jon Maloy 提交于
      The socket level flow control is based on the assumption that incoming
      buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
      where the latter value is rounded off upwards to the nearest 1k number.
      This does empirically hold true for the device drivers we know, but we
      cannot trust that it will always be so, e.g., in a system with jumbo
      frames and very small packets.
      
      We now introduce a check for this condition at packet arrival, and if
      we find it to be false, we copy the packet to a new, smaller buffer,
      where the condition will be true. We expect this to affect only a small
      fraction of all incoming packets, if at all.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618d09a
  6. 01 11月, 2017 1 次提交
  7. 13 10月, 2017 2 次提交
  8. 25 8月, 2017 2 次提交
  9. 22 8月, 2017 1 次提交
    • J
      tipc: don't reset stale broadcast send link · 40501f90
      Jon Paul Maloy 提交于
      When the broadcast send link after 100 attempts has failed to
      transfer a packet to all peers, we consider it stale, and reset
      it. Thereafter it needs to re-synchronize with the peers, something
      currently done by just resetting and re-establishing all links to
      all peers. This has turned out to be overkill, with potentially
      unwanted consequences for the remaining cluster.
      
      A closer analysis reveals that this can be done much simpler. When
      this kind of failure happens, for reasons that may lie outside the
      TIPC protocol, it is typically only one peer which is failing to
      receive and acknowledge packets. It is hence sufficient to identify
      and reset the links only to that peer to resolve the situation, without
      having to reset the broadcast link at all. This solution entails a much
      lower risk of negative consequences for the own node as well as for
      the overall cluster.
      
      We implement this change in this commit.
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40501f90
  10. 10 8月, 2017 1 次提交
    • J
      tipc: remove premature ESTABLISH FSM event at link synchronization · ed43594a
      Jon Paul Maloy 提交于
      When a link between two nodes come up, both endpoints will initially
      send out a STATE message to the peer, to increase the probability that
      the peer endpoint also is up when the first traffic message arrives.
      Thereafter, if the establishing link is the second link between two
      nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
      helping the peer to perform initial synchronization between the two
      links.
      
      However, the initial STATE message may be lost, in which case the SYNCH
      message will be the first one arriving at the peer. This should also
      work, as the SYNCH message itself will be used to take up the link
      endpoint before  initializing synchronization.
      
      Unfortunately the code for this case is broken. Currently, the link is
      brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
      arrives, whereupon __tipc_node_link_up() is called to distribute the
      link slots and take the link into traffic. But, __tipc_node_link_up() is
      itself starting with a test for whether the link is up, and if true,
      returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
      is unnecessary, since tipc_node_link_up() is itself issuing such an
      event, but also harmful, since it inhibits tipc_node_link_up() to
      perform the test of its tasks, and the link endpoint in question hence
      is never taken into traffic.
      
      This problem has been exposed when we set up dual links between pre-
      and post-4.4 kernels, because the former ones don't send out the
      initial STATE message described above.
      
      We fix this by removing the unnecessary event call.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed43594a
  11. 25 4月, 2017 1 次提交
  12. 14 4月, 2017 2 次提交
    • J
      netlink: pass extended ACK struct where available · fe52145f
      Johannes Berg 提交于
      This is an add-on to the previous patch that passes the extended ACK
      structure where it's already available by existing genl_info or extack
      function arguments.
      
      This was done with this spatch (with some manual adjustment of
      indentation):
      
      @@
      expression A, B, C, D, E;
      identifier fn, info;
      @@
      fn(..., struct genl_info *info, ...) {
      ...
      -nlmsg_parse(A, B, C, D, E, NULL)
      +nlmsg_parse(A, B, C, D, E, info->extack)
      ...
      }
      
      @@
      expression A, B, C, D, E;
      identifier fn, info;
      @@
      fn(..., struct genl_info *info, ...) {
      <...
      -nla_parse_nested(A, B, C, D, NULL)
      +nla_parse_nested(A, B, C, D, info->extack)
      ...>
      }
      
      @@
      expression A, B, C, D, E;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      <...
      -nlmsg_parse(A, B, C, D, E, NULL)
      +nlmsg_parse(A, B, C, D, E, extack)
      ...>
      }
      
      @@
      expression A, B, C, D, E;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      <...
      -nla_parse(A, B, C, D, E, NULL)
      +nla_parse(A, B, C, D, E, extack)
      ...>
      }
      
      @@
      expression A, B, C, D, E;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      ...
      -nlmsg_parse(A, B, C, D, E, NULL)
      +nlmsg_parse(A, B, C, D, E, extack)
      ...
      }
      
      @@
      expression A, B, C, D;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      <...
      -nla_parse_nested(A, B, C, D, NULL)
      +nla_parse_nested(A, B, C, D, extack)
      ...>
      }
      
      @@
      expression A, B, C, D;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      <...
      -nlmsg_validate(A, B, C, D, NULL)
      +nlmsg_validate(A, B, C, D, extack)
      ...>
      }
      
      @@
      expression A, B, C, D;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      <...
      -nla_validate(A, B, C, D, NULL)
      +nla_validate(A, B, C, D, extack)
      ...>
      }
      
      @@
      expression A, B, C;
      identifier fn, extack;
      @@
      fn(..., struct netlink_ext_ack *extack, ...) {
      <...
      -nla_validate_nested(A, B, C, NULL)
      +nla_validate_nested(A, B, C, extack)
      ...>
      }
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe52145f
    • J
      netlink: pass extended ACK struct to parsing functions · fceb6435
      Johannes Berg 提交于
      Pass the new extended ACK reporting struct to all of the generic
      netlink parsing functions. For now, pass NULL in almost all callers
      (except for some in the core.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fceb6435
  13. 25 2月, 2017 1 次提交
  14. 25 1月, 2017 1 次提交
    • P
      tipc: fix nametbl_lock soft lockup at node/link events · 93f955aa
      Parthasarathy Bhuvaragan 提交于
      We trigger a soft lockup as we grab nametbl_lock twice if the node
      has a pending node up/down or link up/down event while:
      - we process an incoming named message in tipc_named_rcv() and
        perform an tipc_update_nametbl().
      - we have pending backlog items in the name distributor queue
        during a nametable update using tipc_nametbl_publish() or
        tipc_nametbl_withdraw().
      
      The following are the call chain associated:
      tipc_named_rcv() Grabs nametbl_lock
         tipc_update_nametbl() (publish/withdraw)
           tipc_node_subscribe()/unsubscribe()
             tipc_node_write_unlock()
                << lockup occurs if an outstanding node/link event
                   exits, as we grabs nametbl_lock again >>
      
      tipc_nametbl_withdraw() Grab nametbl_lock
        tipc_named_process_backlog()
          tipc_update_nametbl()
            << rest as above >>
      
      The function tipc_node_write_unlock(), in addition to releasing the
      lock processes the outstanding node/link up/down events. To do this,
      we need to grab the nametbl_lock again leading to the lockup.
      
      In this commit we fix the soft lockup by introducing a fast variant of
      node_unlock(), where we just release the lock. We adapt the
      node_subscribe()/node_unsubscribe() to use the fast variants.
      Reported-and-Tested-by: NJohn Thompson <thompa.atl@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93f955aa
  15. 21 1月, 2017 1 次提交
  16. 04 1月, 2017 1 次提交
    • J
      tipc: reduce risk of user starvation during link congestion · 365ad353
      Jon Paul Maloy 提交于
      The socket code currently handles link congestion by either blocking
      and trying to send again when the congestion has abated, or just
      returning to the user with -EAGAIN and let him re-try later.
      
      This mechanism is prone to starvation, because the wakeup algorithm is
      non-atomic. During the time the link issues a wakeup signal, until the
      socket wakes up and re-attempts sending, other senders may have come
      in between and occupied the free buffer space in the link. This in turn
      may lead to a socket having to make many send attempts before it is
      successful. In extremely loaded systems we have observed latency times
      of several seconds before a low-priority socket is able to send out a
      message.
      
      In this commit, we simplify this mechanism and reduce the risk of the
      described scenario happening. When a message is attempted sent via a
      congested link, we now let it be added to the link's backlog queue
      anyway, thus permitting an oversubscription of one message per source
      socket. We still create a wakeup item and return an error code, hence
      instructing the sender to block or stop sending. Only when enough space
      has been freed up in the link's backlog queue do we issue a wakeup event
      that allows the sender to continue with the next message, if any.
      
      The fact that a socket now can consider a message sent even when the
      link returns a congestion code means that the sending socket code can
      be simplified. Also, since this is a good opportunity to get rid of the
      obsolete 'mtu change' condition in the three socket send functions, we
      now choose to refactor those functions completely.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      365ad353
  17. 30 10月, 2016 1 次提交
    • J
      tipc: fix broadcast link synchronization problem · 06bd2b1e
      Jon Paul Maloy 提交于
      In commit 2d18ac4b ("tipc: extend broadcast link initialization
      criteria") we tried to fix a problem with the initial synchronization
      of broadcast link acknowledge values. Unfortunately that solution is
      not sufficient to solve the issue.
      
      We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
      non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
      initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
      broadcast acknowledge numbers, leading to premature opening of the
      broadcast link. When the bypassed packets finally arrive, they are
      inadvertently accepted, and the already correctly initialized
      acknowledge number in the broadcast receive link is overwritten by
      the invalid (zero) value of the said packets. After this the broadcast
      link goes stale.
      
      We now fix this by marking the packets where we know the acknowledge
      value is or may be invalid, and then ignoring the acks from those.
      
      To this purpose, we claim an unused bit in the header to indicate that
      the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
      synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
      packets, plus those LINK_PROTOCOL packets sent out before the broadcast
      links are fully synchronized.
      
      This minor protocol update is fully backwards compatible.
      Reported-by: NJohn Thompson <thompa.atl@gmail.com>
      Tested-by: NJohn Thompson <thompa.atl@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06bd2b1e
  18. 03 9月, 2016 1 次提交
    • J
      tipc: transfer broadcast nacks in link state messages · 02d11ca2
      Jon Paul Maloy 提交于
      When we send broadcasts in clusters of more 70-80 nodes, we sometimes
      see the broadcast link resetting because of an excessive number of
      retransmissions. This is caused by a combination of two factors:
      
      1) A 'NACK crunch", where loss of broadcast packets is discovered
         and NACK'ed by several nodes simultaneously, leading to multiple
         redundant broadcast retransmissions.
      
      2) The fact that the NACKS as such also are sent as broadcast, leading
         to excessive load and packet loss on the transmitting switch/bridge.
      
      This commit deals with the latter problem, by moving sending of
      broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
      to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
      bits in word 8 of the said message for this purpose, and introduce a
      new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
      backwards compatible.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d11ca2
  19. 19 8月, 2016 1 次提交
  20. 27 7月, 2016 3 次提交
  21. 12 7月, 2016 1 次提交
  22. 16 6月, 2016 1 次提交
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
  23. 09 6月, 2016 2 次提交
    • J
      tipc: change node timer unit from jiffies to ms · 5ca509fc
      Jon Paul Maloy 提交于
      The node keepalive interval is recalculated at each timer expiration
      to catch any changes in the link tolerance, and stored in a field in
      struct tipc_node. We use jiffies as unit for the stored value.
      
      This is suboptimal, because it makes the calculation unnecessary
      complex, including two unit conversions. The conversions also lead to
      a rounding error that causes the link "abort limit" to be 3 in the
      normal case, instead of 4, as intended. This again leads to unnecessary
      link resets when the network is pushed close to its limit, e.g., in an
      environment with hundreds of nodes or namesapces.
      
      In this commit, we do instead let the keepalive value be calculated and
      stored in milliseconds, so that there is only one conversion and the
      rounding error is eliminated.
      
      We also remove a redundant "keepalive" field in struct tipc_link. This
      is remnant from the previous implementation.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ca509fc
    • J
      tipc: correct error in node fsm · c4282ca7
      Jon Paul Maloy 提交于
      commit 88e8ac70 ("tipc: reduce transmission rate of reset messages
      when link is down") revealed a flaw in the node FSM, as defined in
      the log of commit 66996b6c ("tipc: extend node FSM").
      
      We see the following scenario:
      1: Node B receives a RESET message from node A before its link endpoint
         is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
         event will not change the node FSM state, but the (distinct) link FSM
         will move to state RESETTING.
      2: As an effect of the previous event, the local endpoint on B will
         declare node A lost, and post the event SELF_DOWN to the its node
         FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
         that no messages will be accepted from A until it receives another
         RESET message that confirms that A's endpoint has been reset. This
         is  wasteful, since we know this as a fact already from the first
         received RESET, but worse is that the link instance's FSM has not
         wasted this information, but instead moved on to state ESTABLISHING,
         meaning that it repeatedly sends out ACTIVATE messages to the reset
         peer A.
      3: Node A will receive one of the ACTIVATE messages, move its link FSM
         to state ESTABLISHED, and start repeatedly sending out STATE messages
         to node B.
      4: Node B will consistently drop these messages, since it can only accept
         accept a RESET according to its node FSM.
      5: After four lost STATE messages node A will reset its link and start
         repeatedly sending out RESET messages to B.
      6: Because of the reduced send rate for RESET messages, it is very
         likely that A will receive an ACTIVATE (which is sent out at a much
         higher frequency) before it gets the chance to send a RESET, and A
         may hence quickly move back to state ESTABLISHED and continue sending
         out STATE messages, which will again be dropped by B.
      7: GOTO 5.
      8: After having repeated the cycle 5-7 a number of times, node A will
         by chance get in between with sending a RESET, and the situation is
         resolved.
      
      Unfortunately, we have seen that it may take a substantial amount of
      time before this vicious loop is broken, sometimes in the order of
      minutes.
      
      We correct this by making a small correction to the node FSM: When a
      node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
      moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
      SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
      need to wait for RESET confirmation from of an endpoint that we alread
      know has been reset. It also means that node B in the scenario above
      will not be dropping incoming STATE messages, and the link can come up
      immediately.
      
      Finally, a symmetry comparison reveals that the  FSM has a similar
      error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
      Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
      to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
      of this logical error, we choose fix this one, too.
      
      The node FSM looks as follows after those changes:
      
                                 +----------------------------------------+
                                 |                           PEER_DOWN_EVT|
                                 |                                        |
        +------------------------+----------------+                       |
        |SELF_DOWN_EVT           |                |                       |
        |                        |                |                       |
        |              +-----------+          +-----------+               |
        |              |NODE_      |          |NODE_      |               |
        |   +----------|FAILINGOVER|<---------|SYNCHING   |-----------+   |
        |   |SELF_     +-----------+ FAILOVER_+-----------+   PEER_   |   |
        |   |DOWN_EVT   |          A BEGIN_EVT  A         |   DOWN_EVT|   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |FAILOVER_ |FAILOVER_   |SYNCH_   |SYNCH_     |   |
        |   |           |END_EVT   |BEGIN_EVT   |BEGIN_EVT|END_EVT    |   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |         +--------------+        |           |   |
        |   |           +-------->|   SELF_UP_   |<-------+           |   |
        |   |   +-----------------|   PEER_UP    |----------------+   |   |
        |   |   |SELF_DOWN_EVT    +--------------+   PEER_DOWN_EVT|   |   |
        |   |   |                    A        A                   |   |   |
        |   |   |                    |        |                   |   |   |
        |   |   |         PEER_UP_EVT|        |SELF_UP_EVT        |   |   |
        |   |   |                    |        |                   |   |   |
        V   V   V                    |        |                   V   V   V
      +------------+       +-----------+    +-----------+       +------------+
      |SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
      |PEER_LEAVING|       |PEER_COMING|    |SELF_COMING|       |SELF_LEAVING|
      +------------+       +-----------+    +-----------+       +------------+
             |               |       A        A       |                |
             |               |       |        |       |                |
             |       SELF_   |       |SELF_   |PEER_  |PEER_           |
             |       DOWN_EVT|       |UP_EVT  |UP_EVT |DOWN_EVT        |
             |               |       |        |       |                |
             |               |       |        |       |                |
             |               |    +--------------+    |                |
             |PEER_DOWN_EVT  +--->|  SELF_DOWN_  |<---+   SELF_DOWN_EVT|
             +------------------->|  PEER_DOWN   |<--------------------+
                                  +--------------+
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4282ca7
  24. 13 5月, 2016 1 次提交
    • J
      tipc: eliminate risk of double link_up events · e7142c34
      Jon Paul Maloy 提交于
      When an ACTIVATE or data packet is received in a link in state
      ESTABLISHING, the link does not immediately change state to
      ESTABLISHED, but does instead return a LINK_UP event to the caller,
      which will execute the state change in a different lock context.
      
      This non-atomic approach incurs a low risk that we may have two
      LINK_UP events pending simultaneously for the same link, resulting
      in the final part of the setup procedure being executed twice. The
      only potential harm caused by this it that we may see two LINK_UP
      events issued to subsribers of the topology server, something that
      may cause confusion.
      
      This commit eliminates this risk by checking if the link is already
      up before proceeding with the second half of the setup.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7142c34
  25. 04 5月, 2016 1 次提交
  26. 02 5月, 2016 2 次提交
    • H
      tipc: only process unicast on intended node · efe79050
      Hamish Martin 提交于
      We have observed complete lock up of broadcast-link transmission due to
      unacknowledged packets never being removed from the 'transmq' queue. This
      is traced to nodes having their ack field set beyond the sequence number
      of packets that have actually been transmitted to them.
      Consider an example where node 1 has sent 10 packets to node 2 on a
      link and node 3 has sent 20 packets to node 2 on another link. We
      see examples of an ack from node 2 destined for node 3 being treated as
      an ack from node 2 at node 1. This leads to the ack on the node 1 to node
      2 link being increased to 20 even though we have only sent 10 packets.
      When node 1 does get around to sending further packets, none of the
      packets with sequence numbers less than 21 are actually removed from the
      transmq.
      To resolve this we reinstate some code lost in commit d999297c ("tipc:
      reduce locking scope during packet reception") which ensures that only
      messages destined for the receiving node are processed by that node. This
      prevents the sequence numbers from getting out of sync and resolves the
      packet leakage, thereby resolving the broadcast-link transmission
      lock-ups we observed.
      
      While we are aware that this change only patches over a root problem that
      we still haven't identified, this is a sanity test that it is always
      legitimate to do. It will remain in the code even after we identify and
      fix the real problem.
      Reviewed-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: NJohn Thompson <john.thompson@alliedtelesis.co.nz>
      Signed-off-by: NHamish Martin <hamish.martin@alliedtelesis.co.nz>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe79050
    • J
      tipc: set 'active' state correctly for first established link · def22c47
      Jon Paul Maloy 提交于
      When we are displaying statistics for the first link established between
      two peers, it will always be presented as STANDBY although it in reality
      is ACTIVE.
      
      This happens because we forget to set the 'active' flag in the link
      instance at the moment it is established. Although this is a bug, it only
      has impact on the presentation view of the link, not on its actual
      functionality.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def22c47
  27. 16 4月, 2016 2 次提交
  28. 08 3月, 2016 1 次提交
  29. 07 3月, 2016 1 次提交
  30. 26 2月, 2016 1 次提交
    • J
      tipc: fix crash during node removal · d25a0125
      Jon Paul Maloy 提交于
      When the TIPC module is unloaded, we have identified a race condition
      that allows a node reference counter to go to zero and the node instance
      being freed before the node timer is finished with accessing it. This
      leads to occasional crashes, especially in multi-namespace environments.
      
      The scenario goes as follows:
      
      CPU0:(node_stop)                       CPU1:(node_timeout)  // ref == 2
      
      1:                                          if(!mod_timer())
      2: if (del_timer())
      3:   tipc_node_put()                                        // ref -> 1
      4: tipc_node_put()                                          // ref -> 0
      5:   kfree_rcu(node);
      6:                                               tipc_node_get(node)
      7:                                               // BOOM!
      
      We now clean up this functionality as follows:
      
      1) We remove the node pointer from the node lookup table before we
         attempt deactivating the timer. This way, we reduce the risk that
         tipc_node_find() may obtain a valid pointer to an instance marked
         for deletion; a harmless but undesirable situation.
      
      2) We use del_timer_sync() instead of del_timer() to safely deactivate
         the node timer without any risk that it might be reactivated by the
         timeout handler. There is no risk of deadlock here, since the two
         functions never touch the same spinlocks.
      
      3: We remove a pointless tipc_node_get() + tipc_node_put() from the
         timeout handler.
      Reported-by: NZhijiang Hu <huzhijiang@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d25a0125