1. 16 6月, 2016 1 次提交
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
  2. 06 3月, 2015 1 次提交
  3. 10 2月, 2015 3 次提交
  4. 27 11月, 2014 1 次提交
    • Y
      tipc: remove node subscription infrastructure · a8f48af5
      Ying Xue 提交于
      The node subscribe infrastructure represents a virtual base class, so
      its users, such as struct tipc_port and struct publication, can derive
      its implemented functionalities. However, after the removal of struct
      tipc_port, struct publication is left as its only single user now. So
      defining an abstract infrastructure for one user becomes no longer
      reasonable. If corresponding new functions associated with the
      infrastructure are moved to name_table.c file, the node subscription
      infrastructure can be removed as well.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8f48af5
  5. 24 8月, 2014 2 次提交
  6. 06 5月, 2014 1 次提交
  7. 18 6月, 2013 2 次提交
    • Y
      tipc: introduce new TIPC server infrastructure · c5fa7b3c
      Ying Xue 提交于
      TIPC has two internal servers, one providing a subscription
      service for topology events, and another providing the
      configuration interface. These servers have previously been running
      in BH context, accessing the TIPC-port (aka native) API directly.
      Apart from these servers, even the TIPC socket implementation is
      partially built on this API.
      
      As this API may simultaneously be called via different paths and in
      different contexts, a complex and costly lock policiy is required
      in order to protect TIPC internal resources.
      
      To eliminate the need for this complex lock policiy, we introduce
      a new, generic service API that uses kernel sockets for message
      passing instead of the native API. Once the toplogy and configuration
      servers are converted to use this new service, all code pertaining
      to the native API can be removed. This entails a significant
      reduction in code amount and complexity, and opens up for a complete
      rework of the locking policy in TIPC.
      
      The new service also solves another problem:
      
      As the current topology server works in BH context, it cannot easily
      be blocked when sending of events fails due to congestion. In such
      cases events may have to be silently dropped, something that is
      unacceptable. Therefore, the new service keeps a dedicated outbound
      queue receiving messages from BH context. Once messages are
      inserted into this queue, we will immediately schedule a work from a
      special workqueue. This way, messages/events from the topology server
      are in reality sent in process context, and the server can block
      if necessary.
      
      Analogously, there is a new workqueue for receiving messages. Once a
      notification about an arriving message is received in BH context, we
      schedule a work from the receive workqueue to do the job of
      receiving the message in process context.
      
      As both sending and receive messages are now finished in processes,
      subscribed events cannot be dropped any more.
      
      As of this commit, this new server infrastructure is built, but
      not actually yet called by the existing TIPC code, but since the
      conversion changes required in order to use it are significant,
      the addition is kept here as a separate commit.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5fa7b3c
    • Y
      tipc: change socket buffer overflow control to respect sk_rcvbuf · cc79dd1b
      Ying Xue 提交于
      As per feedback from the netdev community, we change the buffer
      overflow protection algorithm in receiving sockets so that it
      always respects the nominal upper limit set in sk_rcvbuf.
      
      Instead of scaling up from a small sk_rcvbuf value, which leads to
      violation of the configured sk_rcvbuf limit, we now calculate the
      weighted per-message limit by scaling down from a much bigger value,
      still in the same field, according to the importance priority of the
      received message.
      
      To allow for administrative tunability of the socket receive buffer
      size, we create a tipc_rmem sysctl variable to allow the user to
      configure an even bigger value via sysctl command.  It is a size of
      three (min/default/max) to be consistent with things like tcp_rmem.
      
      By default, the value initialized in tipc_rmem[1] is equal to the
      receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
      This value is also set as the default value of sk_rcvbuf.
      Originally-by: NJon Maloy <jon.maloy@ericsson.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      [Ying: added sysctl variation to Jon's original patch]
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      [PG: don't compile sysctl.c if not config'd; add Documentation]
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc79dd1b
  8. 18 4月, 2013 1 次提交
    • P
      tipc: add InfiniBand media type · a29a194a
      Patrick McHardy 提交于
      Add InfiniBand media type based on the ethernet media type.
      
      The only real difference is that in case of InfiniBand, we need the entire
      20 bytes of space reserved for media addresses, so the TIPC media type ID is
      not explicitly stored in the packet payload.
      
      Sample output of tipc-config:
      
      # tipc-config -v -addr -netid -nt=all -p -m -b -n -ls
      
      node address: <10.1.4>
      current network id: 4711
      Type       Lower      Upper      Port Identity              Publication Scope
      0          167776257  167776257  <10.1.1:1855512577>        1855512578  cluster
                 167776260  167776260  <10.1.4:1216454657>        1216454658  zone
      1          1          1          <10.1.4:1216479235>        1216479236  node
      Ports:
      1216479235: bound to {1,1}
      1216454657: bound to {0,167776260}
      Media:
      eth
      ib
      Bearers:
      ib:ib0
      Nodes known:
      <10.1.1>: up
      Link <broadcast-link>
        Window:20 packets
        RX packets:0 fragments:0/0 bundles:0/0
        TX packets:0 fragments:0/0 bundles:0/0
        RX naks:0 defs:0 dups:0
        TX naks:0 acks:0 dups:0
        Congestion bearer:0 link:0  Send queue max:0 avg:0
      
      Link <10.1.4:ib0-10.1.1:ib0>
        ACTIVE  MTU:2044  Priority:10  Tolerance:1500 ms  Window:50 packets
        RX packets:80 fragments:0/0 bundles:0/0
        TX packets:40 fragments:0/0 bundles:0/0
        TX profile sample:22 packets  average:54 octets
        0-64:100% -256:0% -1024:0% -4096:0% -16384:0% -32768:0% -66000:0%
        RX states:410 probes:213 naks:0 defs:0 dups:0
        TX states:410 probes:197 naks:0 acks:0 dups:0
        Congestion bearer:0 link:0  Send queue max:1 avg:0
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a29a194a
  9. 01 5月, 2012 1 次提交
    • P
      tipc: compress out gratuitous extra carriage returns · 617d3c7a
      Paul Gortmaker 提交于
      Some of the comment blocks are floating in limbo between two
      functions, or between blocks of code.  Delete the extra line
      feeds between any comment and its associated following block
      of code, to be consistent with the majority of the rest of
      the kernel.  Also delete trailing newlines at EOF and fix
      a couple trivial typos in existing comments.
      
      This is a 100% cosmetic change with no runtime impact.  We get
      rid of over 500 lines of non-code, and being blank line deletes,
      they won't even show up as noise in git blame.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      617d3c7a
  10. 02 1月, 2011 4 次提交
  11. 13 1月, 2006 1 次提交