1. 30 10月, 2019 1 次提交
    • H
      tipc: improve throughput between nodes in netns · f73b1281
      Hoang Le 提交于
      Currently, TIPC transports intra-node user data messages directly
      socket to socket, hence shortcutting all the lower layers of the
      communication stack. This gives TIPC very good intra node performance,
      both regarding throughput and latency.
      
      We now introduce a similar mechanism for TIPC data traffic across
      network namespaces located in the same kernel. On the send path, the
      call chain is as always accompanied by the sending node's network name
      space pointer. However, once we have reliably established that the
      receiving node is represented by a namespace on the same host, we just
      replace the namespace pointer with the receiving node/namespace's
      ditto, and follow the regular socket receive patch though the receiving
      node. This technique gives us a throughput similar to the node internal
      throughput, several times larger than if we let the traffic go though
      the full network stacks. As a comparison, max throughput for 64k
      messages is four times larger than TCP throughput for the same type of
      traffic.
      
      To meet any security concerns, the following should be noted.
      
      - All nodes joining a cluster are supposed to have been be certified
      and authenticated by mechanisms outside TIPC. This is no different for
      nodes/namespaces on the same host; they have to auto discover each
      other using the attached interfaces, and establish links which are
      supervised via the regular link monitoring mechanism. Hence, a kernel
      local node has no other way to join a cluster than any other node, and
      have to obey to policies set in the IP or device layers of the stack.
      
      - Only when a sender has established with 100% certainty that the peer
      node is located in a kernel local namespace does it choose to let user
      data messages, and only those, take the crossover path to the receiving
      node/namespace.
      
      - If the receiving node/namespace is removed, its namespace pointer
      is invalidated at all peer nodes, and their neighbor link monitoring
      will eventually note that this node is gone.
      
      - To ensure the "100% certainty" criteria, and prevent any possible
      spoofing, received discovery messages must contain a proof that the
      sender knows a common secret. We use the hash mix of the sending
      node/namespace for this purpose, since it can be accessed directly by
      all other namespaces in the kernel. Upon reception of a discovery
      message, the receiver checks this proof against all the local
      namespaces'hash_mix:es. If it finds a match, that, along with a
      matching node id and cluster id, this is deemed sufficient proof that
      the peer node in question is in a local namespace, and a wormhole can
      be opened.
      
      - We should also consider that TIPC is intended to be a cluster local
      IPC mechanism (just like e.g. UNIX sockets) rather than a network
      protocol, and hence we think it can justified to allow it to shortcut the
      lower protocol layers.
      
      Regarding traceability, we should notice that since commit 6c9081a3
      ("tipc: add loopback device tracking") it is possible to follow the node
      internal packet flow by just activating tcpdump on the loopback
      interface. This will be true even for this mechanism; by activating
      tcpdump on the involved nodes' loopback interfaces their inter-name
      space messaging can easily be tracked.
      
      v2:
      - update 'net' pointer when node left/rejoined
      v3:
      - grab read/write lock when using node ref obj
      v4:
      - clone traffics between netns to loopback
      Suggested-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f73b1281
  2. 09 8月, 2019 1 次提交
    • J
      tipc: add loopback device tracking · 6c9081a3
      John Rutherford 提交于
      Since node internal messages are passed directly to the socket, it is not
      possible to observe those messages via tcpdump or wireshark.
      
      We now remedy this by making it possible to clone such messages and send
      the clones to the loopback interface.  The clones are dropped at reception
      and have no functional role except making the traffic visible.
      
      The feature is enabled if network taps are active for the loopback device.
      pcap filtering restrictions require the messages to be presented to the
      receiving side of the loopback device.
      
      v3 - Function dev_nit_active used to check for network taps.
         - Procedure netif_rx_ni used to send cloned messages to loopback device.
      Signed-off-by: NJohn Rutherford <john.rutherford@dektech.com.au>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c9081a3
  3. 20 3月, 2019 1 次提交
  4. 01 4月, 2018 1 次提交
    • J
      tipc: replace name table service range array with rb tree · 218527fe
      Jon Maloy 提交于
      The current design of the binding table has an unnecessary memory
      consuming and complex data structure. It aggregates the service range
      items into an array, which is expanded by a factor two every time it
      becomes too small to hold a new item. Furthermore, the arrays never
      shrink when the number of ranges diminishes.
      
      We now replace this array with an RB tree that is holding the range
      items as tree nodes, each range directly holding a list of bindings.
      
      This, along with a few name changes, improves both readability and
      volume of the code, as well as reducing memory consumption and hopefully
      improving cache hit rate.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218527fe
  5. 24 3月, 2018 3 次提交
    • J
      tipc: handle collisions of 32-bit node address hash values · 25b0b9c4
      Jon Maloy 提交于
      When a 32-bit node address is generated from a 128-bit identifier,
      there is a risk of collisions which must be discovered and handled.
      
      We do this as follows:
      - We don't apply the generated address immediately to the node, but do
        instead initiate a 1 sec trial period to allow other cluster members
        to discover and handle such collisions.
      
      - During the trial period the node periodically sends out a new type
        of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
        to all the other nodes in the cluster.
      
      - When a node is receiving such a message, it must check that the
        presented 32-bit identifier either is unused, or was used by the very
        same peer in a previous session. In both cases it accepts the request
        by not responding to it.
      
      - If it finds that the same node has been up before using a different
        address, it responds with a DSC_TRIAL_FAIL_MSG containing that
        address.
      
      - If it finds that the address has already been taken by some other
        node, it generates a new, unused address and returns it to the
        requester.
      
      - During the trial period the requesting node must always be prepared
        to accept a failure message, i.e., a message where a peer suggests a
        different (or equal)  address to the one tried. In those cases it
        must apply the suggested value as trial address and restart the trial
        period.
      
      This algorithm ensures that in the vast majority of cases a node will
      have the same address before and after a reboot. If a legacy user
      configures the address explicitly, there will be no trial period and
      messages, so this protocol addition is completely backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25b0b9c4
    • J
      tipc: add 128-bit node identifier · d50ccc2d
      Jon Maloy 提交于
      We add a 128-bit node identity, as an alternative to the currently used
      32-bit node address.
      
      For the sake of compatibility and to minimize message header changes
      we retain the existing 32-bit address field. When not set explicitly by
      the user, this field will be filled with a hash value generated from the
      much longer node identity, and be used as a shorthand value for the
      latter.
      
      We permit either the address or the identity to be set by configuration,
      but not both, so when the address value is set by a legacy user the
      corresponding 128-bit node identity is generated based on the that value.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d50ccc2d
    • J
      tipc: allow closest-first lookup algorithm when legacy address is configured · b89afb11
      Jon Maloy 提交于
      The removal of an internal structure of the node address has an unwanted
      side effect.
      - Currently, if a user is sending an anycast message with destination
        domain 0, the tipc_namebl_translate() function will use the 'closest-
        first' algorithm to first look for a node local destination, and only
        when no such is found, will it resort to the cluster global 'round-
        robin' lookup algorithm.
      - Current users can get around this, and enforce unconditional use of
        global round-robin by indicating a destination as Z.0.0 or Z.C.0.
      - This option disappears when we make the node address flat, since the
        lookup algorithm has no way of recognizing this case. So, as long as
        there are node local destinations, the algorithm will always select
        one of those, and there is nothing the sender can do to change this.
      
      We solve this by eliminating the 'closest-first' option, which was never
      a good idea anyway, for non-legacy users, but only for those. To
      distinguish between legacy users and non-legacy users we introduce a new
      flag 'legacy_addr_format' in struct tipc_core, to be set when the user
      configures a legacy-style Z.C.N node address. Hence, when a legacy user
      indicates a zero lookup domain 'closest-first' is selected, and in all
      other cases we use 'round-robin'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b89afb11
  6. 18 3月, 2018 1 次提交
  7. 17 2月, 2018 1 次提交
  8. 09 1月, 2018 1 次提交
  9. 13 10月, 2017 1 次提交
  10. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  11. 16 6月, 2016 1 次提交
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
  12. 12 4月, 2016 1 次提交
  13. 21 11月, 2015 1 次提交
  14. 24 10月, 2015 4 次提交
    • J
      tipc: clean up unused code and structures · 2af5ae37
      Jon Paul Maloy 提交于
      After the previous changes in this series, we can now remove some
      unused code and structures, both in the broadcast, link aggregation
      and link code.
      
      There are no functional changes in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2af5ae37
    • J
      tipc: let broadcast packet reception use new link receive function · 52666986
      Jon Paul Maloy 提交于
      The code path for receiving broadcast packets is currently distinct
      from the unicast path. This leads to unnecessary code and data
      duplication, something that can be avoided with some effort.
      
      We now introduce separate per-peer tipc_link instances for handling
      broadcast packet reception. Each receive link keeps a pointer to the
      common, single, broadcast link instance, and can hence handle release
      and retransmission of send buffers as if they belonged to the own
      instance.
      
      Furthermore, we let each unicast link instance keep a reference to both
      the pertaining broadcast receive link, and to the common send link.
      This makes it possible for the unicast links to easily access data for
      broadcast link synchronization, as well as for carrying acknowledges for
      received broadcast packets.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52666986
    • J
      tipc: move broadcast link lock to struct tipc_net · 0043550b
      Jon Paul Maloy 提交于
      The broadcast lock will need to be acquired outside bcast.c in a later
      commit. For this reason, we move the lock to struct tipc_net. Consistent
      with the changes in the previous commit, we also introducee two new
      functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
      currently using tipc_bclink_lock()/unlock() will be phased out during
      the coming commits in this series.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0043550b
    • J
      tipc: move bcast definitions to bcast.c · 6beb19a6
      Jon Paul Maloy 提交于
      Currently, a number of structure and function definitions related
      to the broadcast functionality are unnecessarily exposed in the file
      bcast.h. This obscures the fact that the external interface towards
      the broadcast link in fact is very narrow, and causes unnecessary
      recompilations of other files when anything changes in those
      definitions.
      
      In this commit, we move as many of those definitions as is currently
      possible to the file bcast.c.
      
      We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
      since the name does not correctly describe the contents of this
      struct, and will do so even less in the future, and because we want
      to use the term 'link' more appropriately in the functionality
      introduced later in this series.
      
      Finally, we rename a couple of functions, such as tipc_bclink_xmit()
      and others that will be kept in the future, to include the term 'bcast'
      instead.
      
      There are no functional changes in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6beb19a6
  15. 31 7月, 2015 1 次提交
  16. 21 7月, 2015 1 次提交
    • J
      tipc: reduce locking scope during packet reception · d999297c
      Jon Paul Maloy 提交于
      We convert packet/message reception according to the same principle
      we have been using for message sending and timeout handling:
      
      We move the function tipc_rcv() to node.c, hence handling the initial
      packet reception at the link aggregation level. The function grabs
      the node lock, selects the receiving link, and accesses it via a new
      call tipc_link_rcv(). This function appends buffers to the input
      queue for delivery upwards, but it may also append outgoing packets
      to the xmit queue, just as we do during regular message sending. The
      latter will happen when buffers are forwarded from the link backlog,
      or when retransmission is requested.
      
      Upon return of this function, and after having released the node lock,
      tipc_rcv() delivers/tranmsits the contents of those queues, but it may
      also perform actions such as link activation or reset, as indicated by
      the return flags from the link.
      
      This reduces the number of cpu cycles spent inside the node spinlock,
      and reduces contention on that lock.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d999297c
  17. 15 5月, 2015 2 次提交
    • J
      tipc: simplify packet sequence number handling · e4bf4f76
      Jon Paul Maloy 提交于
      Although the sequence number in the TIPC protocol is 16 bits, we have
      until now stored it internally as an unsigned 32 bits integer.
      We got around this by always doing explicit modulo-65535 operations
      whenever we need to access a sequence number.
      
      We now make the incoming and outgoing sequence numbers to unsigned
      16-bit integers, and remove the modulo operations where applicable.
      
      We also move the arithmetic inline functions for 16 bit integers
      to core.h, and the function buf_seqno() to msg.h, so they can easily
      be accessed from anywhere in the code.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4bf4f76
    • J
      tipc: simplify include dependencies · a6bf70f7
      Jon Paul Maloy 提交于
      When we try to add new inline functions in the code, we sometimes
      run into circular include dependencies.
      
      The main problem is that the file core.h, which really should be at
      the root of the dependency chain, instead is a leaf. I.e., core.h
      includes a number of header files that themselves should be allowed
      to include core.h. In reality this is unnecessary, because core.h does
      not need to know the full signature of any of the structs it refers to,
      only their type declaration.
      
      In this commit, we remove all dependencies from core.h towards any
      other tipc header file.
      
      As a consequence of this change, we can now move the function
      tipc_own_addr(net) from addr.c to addr.h, and make it inline.
      
      There are no functional changes in this commit.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6bf70f7
  18. 10 2月, 2015 2 次提交
  19. 13 1月, 2015 11 次提交
  20. 09 1月, 2015 1 次提交
    • Y
      tipc: convert tipc reference table to use generic rhashtable · 07f6c4bc
      Ying Xue 提交于
      As tipc reference table is statically allocated, its memory size
      requested on stack initialization stage is quite big even if the
      maximum port number is just restricted to 8191 currently, however,
      the number already becomes insufficient in practice. But if the
      maximum ports is allowed to its theory value - 2^32, its consumed
      memory size will reach a ridiculously unacceptable value. Apart from
      this, heavy tipc users spend a considerable amount of time in
      tipc_sk_get() due to the read-lock on ref_table_lock.
      
      If tipc reference table is converted with generic rhashtable, above
      mentioned both disadvantages would be resolved respectively: making
      use of the new resizable hash table can avoid locking on the lookup;
      smaller memory size is required at initial stage, for example, 256
      hash bucket slots are requested at the beginning phase instead of
      allocating the entire 8191 slots in old mode. The hash table will
      grow if entries exceeds 75% of table size up to a total table size
      of 1M, and it will automatically shrink if usage falls below 30%,
      but the minimum table size is allowed down to 256.
      
      Also converts ref_table_lock to a separate mutex to protect hash table
      mutations on write side. Lastly defers the release of the socket
      reference using call_rcu() to allow using an RCU read-side protected
      call to rhashtable_lookup().
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NErik Hugne <erik.hugne@ericsson.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f6c4bc
  21. 27 11月, 2014 1 次提交
  22. 22 11月, 2014 1 次提交
    • R
      tipc: add bearer disable/enable to new netlink api · 0655f6a8
      Richard Alpe 提交于
      A new netlink API for tipc that can disable or enable a tipc bearer.
      
      The new API is separated from the old API because of a bug in the
      user space client (tipc-config). The problem is that older versions
      of tipc-config has a very low receive limit and adding commands to
      the legacy genl_opts struct causes the ctrl_getfamily() response
      message to grow, subsequently breaking the tool.
      
      The new API utilizes netlink policies for input validation. Where the
      top-level netlink attributes are tipc-logical entities, like bearer.
      The top level entities then contain nested attributes. In this case
      a name, nested link properties and a domain.
      
      Netlink commands implemented in this patch:
      TIPC_NL_BEARER_ENABLE
      TIPC_NL_BEARER_DISABLE
      
      Netlink logical layout of bearer enable message:
      -> bearer
          -> name
          [ -> domain ]
          [
          -> properties
              -> priority
          ]
      
      Netlink logical layout of bearer disable message:
      -> bearer
          -> name
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0655f6a8
  23. 02 9月, 2014 1 次提交
    • E
      tipc: add name distributor resiliency queue · a5325ae5
      Erik Hugne 提交于
      TIPC name table updates are distributed asynchronously in a cluster,
      entailing a risk of certain race conditions. E.g., if two nodes
      simultaneously issue conflicting (overlapping) publications, this may
      not be detected until both publications have reached a third node, in
      which case one of the publications will be silently dropped on that
      node. Hence, we end up with an inconsistent name table.
      
      In most cases this conflict is just a temporary race, e.g., one
      node is issuing a publication under the assumption that a previous,
      conflicting, publication has already been withdrawn by the other node.
      However, because of the (rtt related) distributed update delay, this
      may not yet hold true on all nodes. The symptom of this failure is a
      syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".
      
      In this commit we add a resiliency queue at the receiving end of
      the name table distributor. When insertion of an arriving publication
      fails, we retain it in this queue for a short amount of time, assuming
      that another update will arrive very soon and clear the conflict. If so
      happens, we insert the publication, otherwise we drop it.
      
      The (configurable) retention value defaults to 2000 ms. Knowing from
      experience that the situation described above is extremely rare, there
      is no risk that the queue will accumulate any large number of items.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5325ae5