1. 13 10月, 2017 2 次提交
    • J
      tipc: guarantee that group broadcast doesn't bypass group unicast · 2f487712
      Jon Maloy 提交于
      We need a mechanism guaranteeing that group unicasts sent out from a
      socket are not bypassed by later sent broadcasts from the same socket.
      We do this as follows:
      
      - Each time a unicast is sent, we set a the broadcast method for the
        socket to "replicast" and "mandatory". This forces the first
        subsequent broadcast message to follow the same network and data path
        as the preceding unicast to a destination, hence preventing it from
        overtaking the latter.
      
      - In order to make the 'same data path' statement above true, we let
        group unicasts pass through the multicast link input queue, instead
        of as previously through the unicast link input queue.
      
      - In the first broadcast following a unicast, we set a new header flag,
        requiring all recipients to immediately acknowledge its reception.
      
      - During the period before all the expected acknowledges are received,
        the socket refuses to accept any more broadcast attempts, i.e., by
        blocking or returning EAGAIN. This period should typically not be
        longer than a few microseconds.
      
      - When all acknowledges have been received, the sending socket will
        open up for subsequent broadcasts, this time giving the link layer
        freedom to itself select the best transmission method.
      
      - The forced and/or abrupt transmission method changes described above
        may lead to broadcasts arriving out of order to the recipients. We
        remedy this by introducing code that checks and if necessary
        re-orders such messages at the receiving end.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f487712
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
  2. 22 8月, 2017 1 次提交
    • J
      tipc: don't reset stale broadcast send link · 40501f90
      Jon Paul Maloy 提交于
      When the broadcast send link after 100 attempts has failed to
      transfer a packet to all peers, we consider it stale, and reset
      it. Thereafter it needs to re-synchronize with the peers, something
      currently done by just resetting and re-establishing all links to
      all peers. This has turned out to be overkill, with potentially
      unwanted consequences for the remaining cluster.
      
      A closer analysis reveals that this can be done much simpler. When
      this kind of failure happens, for reasons that may lie outside the
      TIPC protocol, it is typically only one peer which is failing to
      receive and acknowledge packets. It is hence sufficient to identify
      and reset the links only to that peer to resolve the situation, without
      having to reset the broadcast link at all. This solution entails a much
      lower risk of negative consequences for the own node as well as for
      the overall cluster.
      
      We implement this change in this commit.
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40501f90
  3. 14 4月, 2017 1 次提交
  4. 21 1月, 2017 2 次提交
  5. 17 1月, 2017 1 次提交
  6. 04 1月, 2017 1 次提交
    • J
      tipc: reduce risk of user starvation during link congestion · 365ad353
      Jon Paul Maloy 提交于
      The socket code currently handles link congestion by either blocking
      and trying to send again when the congestion has abated, or just
      returning to the user with -EAGAIN and let him re-try later.
      
      This mechanism is prone to starvation, because the wakeup algorithm is
      non-atomic. During the time the link issues a wakeup signal, until the
      socket wakes up and re-attempts sending, other senders may have come
      in between and occupied the free buffer space in the link. This in turn
      may lead to a socket having to make many send attempts before it is
      successful. In extremely loaded systems we have observed latency times
      of several seconds before a low-priority socket is able to send out a
      message.
      
      In this commit, we simplify this mechanism and reduce the risk of the
      described scenario happening. When a message is attempted sent via a
      congested link, we now let it be added to the link's backlog queue
      anyway, thus permitting an oversubscription of one message per source
      socket. We still create a wakeup item and return an error code, hence
      instructing the sender to block or stop sending. Only when enough space
      has been freed up in the link's backlog queue do we issue a wakeup event
      that allows the sender to continue with the next message, if any.
      
      The fact that a socket now can consider a message sent even when the
      link returns a congestion code means that the sending socket code can
      be simplified. Also, since this is a good opportunity to get rid of the
      obsolete 'mtu change' condition in the three socket send functions, we
      now choose to refactor those functions completely.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      365ad353
  7. 28 11月, 2016 1 次提交
    • J
      tipc: fix link statistics counter errors · 95901122
      Jon Paul Maloy 提交于
      In commit e4bf4f76 ("tipc: simplify packet sequence number
      handling") we changed the internal representation of the packet
      sequence number counters from u32 to u16, reflecting what is really
      sent over the wire.
      
      Since then some link statistics counters have been displaying incorrect
      values, partially because the counters meant to be used as sequence
      number snapshots are now used as direct counters, stored as u32, and
      partially because some counter updates are just missing in the code.
      
      In this commit we correct this in two ways. First, we base the
      displayed packet sent/received values on direct counters instead
      of as previously a calculated difference between current sequence
      number and a snapshot. Second, we add the missing updates of the
      counters.
      
      This change is compatible with the current netlink API, and requires
      no changes to the user space tools.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95901122
  8. 26 11月, 2016 1 次提交
    • J
      tipc: fix compatibility bug in link monitoring · f7967556
      Jon Paul Maloy 提交于
      commit 81729810 ("tipc: fix link priority propagation") introduced a
      compatibility problem between TIPC versions newer than Linux 4.6 and
      those older than Linux 4.4. In versions later than 4.4, link STATE
      messages only contain a non-zero link priority value when the sender
      wants the receiver to change its priority. This has the effect that the
      receiver resets itself in order to apply the new priority. This works
      well, and is consistent with the said commit.
      
      However, in versions older than 4.4 a valid link priority is present in
      all sent link STATE messages, leading to cyclic link establishment and
      reset on the 4.6+ node.
      
      We fix this by adding a test that the received value should not only
      be valid, but also differ from the current value in order to cause the
      receiving link endpoint to reset.
      Reported-by: NAmar Nv <amar.nv005@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7967556
  9. 30 10月, 2016 1 次提交
    • J
      tipc: fix broadcast link synchronization problem · 06bd2b1e
      Jon Paul Maloy 提交于
      In commit 2d18ac4b ("tipc: extend broadcast link initialization
      criteria") we tried to fix a problem with the initial synchronization
      of broadcast link acknowledge values. Unfortunately that solution is
      not sufficient to solve the issue.
      
      We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
      non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
      initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
      broadcast acknowledge numbers, leading to premature opening of the
      broadcast link. When the bypassed packets finally arrive, they are
      inadvertently accepted, and the already correctly initialized
      acknowledge number in the broadcast receive link is overwritten by
      the invalid (zero) value of the said packets. After this the broadcast
      link goes stale.
      
      We now fix this by marking the packets where we know the acknowledge
      value is or may be invalid, and then ignoring the acks from those.
      
      To this purpose, we claim an unused bit in the header to indicate that
      the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
      synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
      packets, plus those LINK_PROTOCOL packets sent out before the broadcast
      links are fully synchronized.
      
      This minor protocol update is fully backwards compatible.
      Reported-by: NJohn Thompson <thompa.atl@gmail.com>
      Tested-by: NJohn Thompson <thompa.atl@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06bd2b1e
  10. 03 9月, 2016 3 次提交
    • J
      tipc: send broadcast nack directly upon sequence gap detection · e0a05ebe
      Jon Paul Maloy 提交于
      Because of the risk of an excessive number of NACK messages and
      retransissions, receivers have until now abstained from sending
      broadcast NACKS directly upon detection of a packet sequence number
      gap. We have instead relied on such gaps being detected by link
      protocol STATE message exchange, something that by necessity delays
      such detection and subsequent retransmissions.
      
      With the introduction of unicast NACK transmission and rate control
      of retransmissions we can now remove this limitation. We now allow
      receiving nodes to send NACKS immediately, while coordinating the
      permission to do so among the nodes in order to avoid NACK storms.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0a05ebe
    • J
      tipc: rate limit broadcast retransmissions · 7c4a54b9
      Jon Paul Maloy 提交于
      As cluster sizes grow, so does the amount of identical or overlapping
      broadcast NACKs generated by the packet receivers. This often leads to
      'NACK crunches' resulting in huge numbers of redundant retransmissions
      of the same packet ranges.
      
      In this commit, we introduce rate control of broadcast retransmissions,
      so that a retransmitted range cannot be retransmitted again until after
      at least 10 ms. This reduces the frequency of duplicate, redundant
      retransmissions by an order of magnitude, while having a significant
      positive impact on overall throughput and scalability.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4a54b9
    • J
      tipc: transfer broadcast nacks in link state messages · 02d11ca2
      Jon Paul Maloy 提交于
      When we send broadcasts in clusters of more 70-80 nodes, we sometimes
      see the broadcast link resetting because of an excessive number of
      retransmissions. This is caused by a combination of two factors:
      
      1) A 'NACK crunch", where loss of broadcast packets is discovered
         and NACK'ed by several nodes simultaneously, leading to multiple
         redundant broadcast retransmissions.
      
      2) The fact that the NACKS as such also are sent as broadcast, leading
         to excessive load and packet loss on the transmitting switch/bridge.
      
      This commit deals with the latter problem, by moving sending of
      broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
      to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
      bits in word 8 of the said message for this purpose, and introduce a
      new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
      backwards compatible.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d11ca2
  11. 19 8月, 2016 1 次提交
    • J
      tipc: ensure that link congestion and wakeup use same criteria · 5a0950c2
      Jon Paul Maloy 提交于
      When a link is attempted woken up after congestion, it uses a different,
      more generous criteria than when it was originally declared congested.
      This has the effect that the link, and the sending process, sometimes
      will be woken up unnecessarily, just to immediately return to congestion
      when it turns out there is not not enough space in its send queue to
      host the pending message. This is a waste of CPU cycles.
      
      We now change the function link_prepare_wakeup() to use exactly the same
      criteria as tipc_link_xmit(). However, since we are now excluding the
      window limit from the wakeup calculation, and the current backlog limit
      for the lowest level is too small to house even a single maximum-size
      message, we have to expand this limit. We do this by evaluating an
      alternative, minimum value during the setting of the importance limits.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a0950c2
  12. 12 7月, 2016 2 次提交
    • J
      tipc: ensure correct broadcast send buffer release when peer is lost · a71eb720
      Jon Paul Maloy 提交于
      After a new receiver peer has been added to the broadcast transmission
      link, we allow immediate transmission of new broadcast packets, trusting
      that the new peer will not accept the packets until it has received the
      previously sent unicast broadcast initialiation message. In the same
      way, the sender must not accept any acknowledges until it has itself
      received the broadcast initialization from the peer, as well as
      confirmation of the reception of its own initialization message.
      
      Furthermore, when a receiver peer goes down, the sender has to produce
      the missing acknowledges from the lost peer locally, in order ensure
      correct release of the buffers that were expected to be acknowledged by
      the said peer.
      
      In a highly stressed system we have observed that contact with a peer
      may come up and be lost before the above mentioned broadcast initial-
      ization and confirmation have been received. This leads to the locally
      produced acknowledges being rejected, and the non-acknowledged buffers
      to linger in the broadcast link transmission queue until it fills up
      and the link goes into permanent congestion.
      
      In this commit, we remedy this by temporarily setting the corresponding
      broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
      state to true before we issue the local acknowledges. This ensures that
      those acknowledges will always be accepted. The mentioned state values
      are restored immediately afterwards when the link is reset.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a71eb720
    • J
      tipc: extend broadcast link initialization criteria · 2d18ac4b
      Jon Paul Maloy 提交于
      At first contact between two nodes, an endpoint might sometimes have
      time to send out a LINK_PROTOCOL/STATE packet before it has received
      the broadcast initialization packet from the peer, i.e., before it has
      received a valid broadcast packet number to add to the 'bc_ack' field
      of the protocol message.
      
      This means that the peer endpoint will receive a protocol packet with an
      invalid broadcast acknowledge value of 0. Under unlucky circumstances
      this may lead to the original, already received acknowledge value being
      overwritten, so that the whole broadcast link goes stale after a while.
      
      We fix this by delaying the setting of the link field 'bc_peer_is_up'
      until we know that the peer really has received our own broadcast
      initialization message. The latter is always sent out as the first
      unicast message on a link, and always with seqeunce number 1. Because
      of this, we only need to look for a non-zero unicast acknowledge value
      in the arriving STATE messages, and once that is confirmed we know we
      are safe and can set the mentioned field. Before this moment, we must
      ignore all broadcast acknowledges from the peer.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d18ac4b
  13. 16 6月, 2016 2 次提交
    • Y
      tipc: eliminate uninitialized variable warning · c91522f8
      Ying Xue 提交于
      net/tipc/link.c: In function ‘tipc_link_timeout’:
      net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
      
      Fixes: 42b18f60 ("tipc: refactor function tipc_link_timeout()")
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91522f8
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
  14. 09 6月, 2016 1 次提交
    • J
      tipc: change node timer unit from jiffies to ms · 5ca509fc
      Jon Paul Maloy 提交于
      The node keepalive interval is recalculated at each timer expiration
      to catch any changes in the link tolerance, and stored in a field in
      struct tipc_node. We use jiffies as unit for the stored value.
      
      This is suboptimal, because it makes the calculation unnecessary
      complex, including two unit conversions. The conversions also lead to
      a rounding error that causes the link "abort limit" to be 3 in the
      normal case, instead of 4, as intended. This again leads to unnecessary
      link resets when the network is pushed close to its limit, e.g., in an
      environment with hundreds of nodes or namesapces.
      
      In this commit, we do instead let the keepalive value be calculated and
      stored in milliseconds, so that there is only one conversion and the
      rounding error is eliminated.
      
      We also remove a redundant "keepalive" field in struct tipc_link. This
      is remnant from the previous implementation.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ca509fc
  15. 25 4月, 2016 1 次提交
  16. 16 4月, 2016 4 次提交
    • J
      tipc: let first message on link be a state message · 34b9cd64
      Jon Paul Maloy 提交于
      According to the link FSM, a received traffic packet can take a link
      from state ESTABLISHING to ESTABLISHED, but the link can still not be
      fully set up in one atomic operation. This means that even if the the
      very first packet on the link is a traffic packet with sequence number
      1 (one), it has to be dropped and retransmitted.
      
      This can be avoided if we let the mentioned packet be preceded by a
      LINK_PROTOCOL/STATE message, which takes up the endpoint before the
      arrival of the traffic.
      
      We add this small feature in this commit.
      
      This is a fully compatible change.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34b9cd64
    • J
      tipc: refactor function tipc_link_timeout() · 42b18f60
      Jon Paul Maloy 提交于
      The function tipc_link_timeout() is unnecessary complex, and can
      easily be made more readable.
      
      We do that with this commit. The only functional change is that we
      remove a redundant test for whether the broadcast link is up or not.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42b18f60
    • J
      tipc: reduce transmission rate of reset messages when link is down · 88e8ac70
      Jon Paul Maloy 提交于
      When a link is down, it will continuously try to re-establish contact
      with the peer by sending out a RESET or an ACTIVATE message at each
      timeout interval. The default value for this interval is currently
      375 ms. This is wasteful, and may become a problem in very large
      clusters with dozens or hundreds of nodes being down simultaneously.
      
      We now introduce a simple backoff algorithm for these cases. The
      first five messages are sent at default rate; thereafter a message
      is sent only each 16th timer interval.
      
      This will cover the vast majority of link recycling cases, since the
      endpoint starting last will transmit at the higher speed, and the link
      should normally be established well be before the rate needs to be
      reduced.
      
      The only case where we will see a degradation of link re-establishment
      times is when the endpoints remain intact, and a glitch in the
      transmission media is causing the link reset. We will then experience
      a worst-case re-establishing time of 6 seconds, something we deem
      acceptable.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88e8ac70
    • J
      tipc: guarantee peer bearer id exchange after reboot · 634696b1
      Jon Paul Maloy 提交于
      When a link endpoint is going down locally, e.g., because its interface
      is being stopped, it will spontaneously send out a RESET message to
      its peer, informing it about this fact. This saves the peer from
      detecting the failure via probing, and hence gives both speedier and
      less resource consuming failure detection on the peer side.
      
      According to the link FSM, a receiver of a RESET message, ignoring the
      reason for it, must now consider the sender ready to come back up, and
      starts periodically sending out ACTIVATE messages to the peer in order
      to re-establish the link. Also, according to the FSM, the receiver of
      an ACTIVATE message can now go directly to state ESTABLISHED and start
      sending regular traffic packets. This is a well-proven and robust FSM.
      
      However, in the case of a reboot, there is a small possibilty that link
      endpoint on the rebooted node may have been re-created with a new bearer
      identity between the moment it sent its (pre-boot) RESET and the moment
      it receives the ACTIVATE from the peer. The new bearer identity cannot
      be known by the peer according to this scenario, since traffic headers
      don't convey such information. This is a problem, because both endpoints
      need to know the correct value of the peer's bearer id at any moment in
      time in order to be able to produce correct link events for their users.
      
      The only way to guarantee this is to enforce a full setup message
      exchange (RESET + ACTIVATE) even after the reboot, since those messages
      carry the bearer idientity in their header.
      
      In this commit we do this by introducing and setting a "stopping" bit in
      the header of the spontaneously generated RESET messages, informing the
      peer that the sender will not be immediately ready to re-establish the
      link. A receiver seeing this bit must act as if this were a locally
      detected connectivity failure, and hence has to go through a full two-
      way setup message exchange before any link can be re-established.
      
      Although never reported, this problem seems to have always been around.
      
      This protocol addition is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634696b1
  17. 08 3月, 2016 1 次提交
  18. 07 3月, 2016 1 次提交
  19. 20 2月, 2016 1 次提交
  20. 17 2月, 2016 1 次提交
  21. 06 2月, 2016 2 次提交
  22. 21 11月, 2015 7 次提交
    • J
      tipc: correct settings of broadcast link state · 9a650838
      Jon Paul Maloy 提交于
      Since commit 52666986 ("tipc: let broadcast packet
      reception use new link receive function") the broadcast send
      link state was meant to always be set to LINK_ESTABLISHED, since
      we don't need this link to follow the regular link FSM rules. It
      was also the intention that this state anyway shouldn't impact
      the run-time working state of the link, since the latter in
      reality is controlled by the number of registered peers.
      
      We have now discovered that this assumption is not quite correct.
      If the broadcast link is reset because of too many retransmissions,
      its state will inadvertently go to LINK_RESETTING, and never go
      back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
      anticipated. This will work well once, but if it happens a second
      time, the reset on a link in LINK_RESETTING has has no effect, and
      neither the broadcast link nor the unicast links will go down as
      they should.
      
      Furthermore, it is confusing that the management tool shows that
      this link is in UP state when that obviously isn't the case.
      
      We now ensure that this state strictly follows the true working
      state of the link. The state is set to LINK_ESTABLISHED when
      the number of peers is non-zero, and to LINK_RESET otherwise.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a650838
    • J
      tipc: eliminate remnants of hungarian notation · 1a90632d
      Jon Paul Maloy 提交于
      The number of variables with Hungarian notation (l_ptr, n_ptr etc.)
      has been significantly reduced over the last couple of years.
      
      We now root out the last traces of this practice.
      There are no functional changes in this commit.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a90632d
    • J
      tipc: narrow down interface towards struct tipc_link · 38206d59
      Jon Paul Maloy 提交于
      We move the definition of struct tipc_link from link.h to link.c in
      order to minimize its exposure to the rest of the code.
      
      When needed, we define new functions to make it possible for external
      entities to access and set data in the link.
      
      Apart from the above, there are no functional changes.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38206d59
    • J
      tipc: narrow down exposure of struct tipc_node · 5be9c086
      Jon Paul Maloy 提交于
      In our effort to have less code and include dependencies between
      entities such as node, link and bearer, we try to narrow down
      the exposed interface towards the node as much as possible.
      
      In this commit, we move the definition of struct tipc_node, along
      with many of its associated function declarations, from node.h to
      node.c. We also move some function definitions from link.c and
      name_distr.c to node.c, since they access fields in struct tipc_node
      that should not be externally visible. The moved functions are renamed
      according to new location, and made static whenever possible.
      
      There are no functional changes in this commit.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5be9c086
    • J
      tipc: convert node lock to rwlock · 5405ff6e
      Jon Paul Maloy 提交于
      According to the node FSM a node in state SELF_UP_PEER_UP cannot
      change state inside a lock context, except when a TUNNEL_PROTOCOL
      (SYNCH or FAILOVER) packet arrives. However, the node's individual
      links may still change state.
      
      Since each link now is protected by its own spinlock, we finally have
      the conditions in place to convert the node spinlock to an rwlock_t.
      If the node state and arriving packet type are rigth, we can let the
      link directly receive the packet under protection of its own spinlock
      and the node lock in read mode. In all other cases we use the node
      lock in write mode. This enables full concurrent execution between
      parallel links during steady-state traffic situations, i.e., 99+ %
      of the time.
      
      This commit implements this change.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5405ff6e
    • J
      tipc: introduce per-link spinlock · 2312bf61
      Jon Paul Maloy 提交于
      As a preparation to allow parallel links to work more independently
      from each other we introduce a per-link spinlock, to be stored in the
      struct nodes's link entry area. Since the node lock still is a regular
      spinlock there is no increase in parallellism at this stage.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2312bf61
    • J
      tipc: move linearization of buffers to generic code · c7cad0d6
      Jon Paul Maloy 提交于
      In commit 5cbb28a4 ("tipc: linearize arriving NAME_DISTR
      and LINK_PROTO buffers") we added linearization of NAME_DISTRIBUTOR,
      LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE to the function
      tipc_udp_recv(). The location of the change was selected in order
      to make the commit easily appliable to 'net' and 'stable'.
      
      We now move this linearization to where it should be done, in the
      functions tipc_named_rcv() and tipc_link_proto_rcv() respectively.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7cad0d6
  23. 25 10月, 2015 1 次提交
  24. 24 10月, 2015 1 次提交