1. 04 6月, 2019 1 次提交
  2. 26 5月, 2019 2 次提交
  3. 28 3月, 2018 1 次提交
  4. 24 3月, 2018 2 次提交
    • J
      tipc: handle collisions of 32-bit node address hash values · 25b0b9c4
      Jon Maloy 提交于
      When a 32-bit node address is generated from a 128-bit identifier,
      there is a risk of collisions which must be discovered and handled.
      
      We do this as follows:
      - We don't apply the generated address immediately to the node, but do
        instead initiate a 1 sec trial period to allow other cluster members
        to discover and handle such collisions.
      
      - During the trial period the node periodically sends out a new type
        of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
        to all the other nodes in the cluster.
      
      - When a node is receiving such a message, it must check that the
        presented 32-bit identifier either is unused, or was used by the very
        same peer in a previous session. In both cases it accepts the request
        by not responding to it.
      
      - If it finds that the same node has been up before using a different
        address, it responds with a DSC_TRIAL_FAIL_MSG containing that
        address.
      
      - If it finds that the address has already been taken by some other
        node, it generates a new, unused address and returns it to the
        requester.
      
      - During the trial period the requesting node must always be prepared
        to accept a failure message, i.e., a message where a peer suggests a
        different (or equal)  address to the one tried. In those cases it
        must apply the suggested value as trial address and restart the trial
        period.
      
      This algorithm ensures that in the vast majority of cases a node will
      have the same address before and after a reboot. If a legacy user
      configures the address explicitly, there will be no trial period and
      messages, so this protocol addition is completely backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25b0b9c4
    • J
      tipc: add 128-bit node identifier · d50ccc2d
      Jon Maloy 提交于
      We add a 128-bit node identity, as an alternative to the currently used
      32-bit node address.
      
      For the sake of compatibility and to minimize message header changes
      we retain the existing 32-bit address field. When not set explicitly by
      the user, this field will be filled with a hash value generated from the
      much longer node identity, and be used as a shorthand value for the
      latter.
      
      We permit either the address or the identity to be set by configuration,
      but not both, so when the address value is set by a legacy user the
      corresponding 128-bit node identity is generated based on the that value.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d50ccc2d
  5. 13 3月, 2018 1 次提交
  6. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  7. 16 6月, 2016 1 次提交
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
  8. 04 5月, 2016 1 次提交
    • J
      tipc: redesign connection-level flow control · 10724cc7
      Jon Paul Maloy 提交于
      There are two flow control mechanisms in TIPC; one at link level that
      handles network congestion, burst control, and retransmission, and one
      at connection level which' only remaining task is to prevent overflow
      in the receiving socket buffer. In TIPC, the latter task has to be
      solved end-to-end because messages can not be thrown away once they
      have been accepted and delivered upwards from the link layer, i.e, we
      can never permit the receive buffer to overflow.
      
      Currently, this algorithm is message based. A counter in the receiving
      socket keeps track of number of consumed messages, and sends a dedicated
      acknowledge message back to the sender for each 256 consumed message.
      A counter at the sending end keeps track of the sent, not yet
      acknowledged messages, and blocks the sender if this number ever reaches
      512 unacknowledged messages. When the missing acknowledge arrives, the
      socket is then woken up for renewed transmission. This works well for
      keeping the message flow running, as it almost never happens that a
      sender socket is blocked this way.
      
      A problem with the current mechanism is that it potentially is very
      memory consuming. Since we don't distinguish between small and large
      messages, we have to dimension the socket receive buffer according
      to a worst-case of both. I.e., the window size must be chosen large
      enough to sustain a reasonable throughput even for the smallest
      messages, while we must still consider a scenario where all messages
      are of maximum size. Hence, the current fix window size of 512 messages
      and a maximum message size of 66k results in a receive buffer of 66 MB
      when truesize(66k) = 131k is taken into account. It is possible to do
      much better.
      
      This commit introduces an algorithm where we instead use 1024-byte
      blocks as base unit. This unit, always rounded upwards from the
      actual message size, is used when we advertise windows as well as when
      we count and acknowledge transmitted data. The advertised window is
      based on the configured receive buffer size in such a way that even
      the worst-case truesize/msgsize ratio always is covered. Since the
      smallest possible message size (from a flow control viewpoint) now is
      1024 bytes, we can safely assume this ratio to be less than four, which
      is the value we are now using.
      
      This way, we have been able to reduce the default receive buffer size
      from 66 MB to 2 MB with maintained performance.
      
      In order to keep this solution backwards compatible, we introduce a
      new capability bit in the discovery protocol, and use this throughout
      the message sending/reception path to always select the right unit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10724cc7
  9. 12 4月, 2016 1 次提交
  10. 24 10月, 2015 1 次提交
    • J
      tipc: create broadcast transmission link at namespace init · 5fd9fd63
      Jon Paul Maloy 提交于
      The broadcast transmission link is currently instantiated when the
      network subsystem is started, i.e., on order from user space via netlink.
      
      This forces the broadcast transmission code to do unnecessary tests for
      the existence of the transmission link, as well in single mode node as
      in network mode.
      
      In this commit, we do instead create the link during initialization of
      the name space, and remove it when it is stopped. The fact that the
      transmission link now has a guaranteed longer life cycle than any of its
      potential clients paves the way for further code simplifcations
      and optimizations.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fd9fd63
  11. 05 5月, 2015 1 次提交
    • Y
      tipc: rename functions defined in subscr.c · 57f1d186
      Ying Xue 提交于
      When a topology server accepts a connection request from its client,
      it allocates a connection instance and a tipc_subscriber structure
      object. The former is used to communicate with client, and the latter
      is often treated as a subscriber which manages all subscription events
      requested from a same client. When a topology server receives a request
      of subscribing name services from a client through the connection, it
      creates a tipc_subscription structure instance which is seen as a
      subscription recording what name services are subscribed. In order to
      manage all subscriptions from a same client, topology server links
      them into the subscrp_list of the subscriber. So subscriber and
      subscription completely represents different meanings respectively,
      but function names associated with them make us so confused that we
      are unable to easily tell which function is against subscriber and
      which is to subscription. So we want to eliminate the confusion by
      renaming them.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57f1d186
  12. 01 4月, 2015 1 次提交
    • Y
      tipc: fix a slab object leak · 7e436905
      Ying Xue 提交于
      When remove TIPC module, there is a warning to remind us that a slab
      object is leaked like:
      
      root@localhost:~# rmmod tipc
      [   19.056226] =============================================================================
      [   19.057549] BUG TIPC (Not tainted): Objects remaining in TIPC on kmem_cache_close()
      [   19.058736] -----------------------------------------------------------------------------
      [   19.058736]
      [   19.060287] INFO: Slab 0xffffea0000519a00 objects=23 used=1 fp=0xffff880014668b00 flags=0x100000000004080
      [   19.061915] INFO: Object 0xffff880014668000 @offset=0
      [   19.062717] kmem_cache_destroy TIPC: Slab cache still has objects
      
      This is because the listening socket of TIPC topology server is not
      closed before TIPC proto handler is unregistered with proto_unregister().
      However, as the socket is closed in tipc_exit_net() which is called by
      unregister_pernet_subsys() during unregistering TIPC namespace operation,
      the warning can be eliminated if calling unregister_pernet_subsys() is
      moved before calling proto_unregister().
      
      Fixes: e05b31f4 ("tipc: make tipc socket support net namespace")
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e436905
  13. 10 2月, 2015 2 次提交
  14. 13 1月, 2015 9 次提交
  15. 09 1月, 2015 1 次提交
    • Y
      tipc: convert tipc reference table to use generic rhashtable · 07f6c4bc
      Ying Xue 提交于
      As tipc reference table is statically allocated, its memory size
      requested on stack initialization stage is quite big even if the
      maximum port number is just restricted to 8191 currently, however,
      the number already becomes insufficient in practice. But if the
      maximum ports is allowed to its theory value - 2^32, its consumed
      memory size will reach a ridiculously unacceptable value. Apart from
      this, heavy tipc users spend a considerable amount of time in
      tipc_sk_get() due to the read-lock on ref_table_lock.
      
      If tipc reference table is converted with generic rhashtable, above
      mentioned both disadvantages would be resolved respectively: making
      use of the new resizable hash table can avoid locking on the lookup;
      smaller memory size is required at initial stage, for example, 256
      hash bucket slots are requested at the beginning phase instead of
      allocating the entire 8191 slots in old mode. The hash table will
      grow if entries exceeds 75% of table size up to a total table size
      of 1M, and it will automatically shrink if usage falls below 30%,
      but the minimum table size is allowed down to 256.
      
      Also converts ref_table_lock to a separate mutex to protect hash table
      mutations on write side. Lastly defers the release of the socket
      reference using call_rcu() to allow using an RCU read-side protected
      call to rhashtable_lookup().
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NErik Hugne <erik.hugne@ericsson.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f6c4bc
  16. 24 8月, 2014 2 次提交
  17. 15 5月, 2014 1 次提交
    • J
      tipc: decrease connection flow control window · 6163a194
      Jon Paul Maloy 提交于
      Memory overhead when allocating big buffers for data transfer may
      be quite significant. E.g., truesize of a 64 KB buffer turns out
      to be 132 KB, 2 x the requested size.
      
      This invalidates the "worst case" calculation we have been
      using to determine the default socket receive buffer limit,
      which is based on the assumption that 1024x64KB = 67MB buffers
      may be queued up on a socket.
      
      Since TIPC connections cannot survive hitting the buffer limit,
      we have to compensate for this overhead.
      
      We do that in this commit by dividing the fix connection flow
      control window from 1024 (2*512) messages to 512 (2*256). Since
      older version nodes send out acks at 512 message intervals,
      compatibility with such nodes is guaranteed, although performance
      may be non-optimal in such cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6163a194
  18. 06 5月, 2014 1 次提交
  19. 28 3月, 2014 1 次提交
  20. 22 2月, 2014 2 次提交
    • Y
      tipc: make bearer set up in module insertion stage · 970122fd
      Ying Xue 提交于
      Accidentally a side effect is involved by commit 6e967adf(tipc:
      relocate common functions from media to bearer). Now tipc stack
      handler of receiving packets from netdevices as well as netdevice
      notification handler are registered when bearer is enabled rather
      than tipc module initialization stage, but the two handlers are
      both unregistered in tipc module exit phase. If tipc module is
      inserted and then immediately removed, the following warning
      message will appear:
      
      "dev_remove_pack: ffffffffa0380940 not found"
      
      This is because in module insertion stage tipc stack packet handler
      is not registered at all, but in module exit phase dev_remove_pack()
      needs to remove it. Of course, dev_remove_pack() cannot find tipc
      protocol handler from the kernel protocol handler list so that the
      warning message is printed out.
      
      But if registering the two handlers is adjusted from enabling bearer
      phase into inserting module stage, the warning message will be
      eliminated. Due to this change, tipc_core_start_net() and
      tipc_core_stop_net() can be deleted as well.
      Reported-by: NWang Weidong <wangweidong1@huawei.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Erik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      970122fd
    • Y
      tipc: remove all enabled flags from all tipc components · 9fe7ed47
      Ying Xue 提交于
      When tipc module is inserted, many tipc components are initialized
      one by one. During the initialization period, if one of them is
      failed, tipc_core_stop() will be called to stop all components
      whatever corresponding components are created or not. To avoid to
      release uncreated ones, relevant components have to add necessary
      enabled flags indicating whether they are created or not.
      
      But in the initialization stage, if one component is unsuccessfully
      created, we will just destroy successfully created components before
      the failed component instead of all components. All enabled flags
      defined in components, in turn, become redundant. Additionally it's
      also unnecessary to identify whether table.types is NULL in
      tipc_nametbl_stop() because name stable has been definitely created
      successfully when tipc_nametbl_stop() is called.
      
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Erik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fe7ed47
  21. 14 2月, 2014 1 次提交
    • Y
      tipc: remove 'links' list from tipc_bearer struct · c61dd61d
      Ying Xue 提交于
      In our ongoing effort to simplify the TIPC locking structure,
      we see a need to remove the linked list for tipc_links
      in the bearer. This can be explained as follows.
      
      Currently, we have three different ways to access a link,
      via three different lists/tables:
      
      1: Via a node hash table:
         Used by the time-critical outgoing/incoming data paths.
         (e.g. link_send_sections_fast() and tipc_recv_msg() ):
      
      grab net_lock(read)
         find node from node hash table
         grab node_lock
             select link
             grab bearer_lock
                send_msg()
             release bearer_lock
         release node lock
      release net_lock
      
      2: Via a global linked list for nodes:
         Used by configuration commands (link_cmd_set_value())
      
      grab net_lock(read)
         find node and link from global node list (using link name)
         grab node_lock
             update link
         release node lock
      release net_lock
      
      (Same locking order as above. No problem.)
      
      3: Via the bearer's linked link list:
         Used by notifications from interface (e.g. tipc_disable_bearer() )
      
      grab net_lock(write)
         grab bearer_lock
            get link ptr from bearer's link list
            get node from link
            grab node_lock
               delete link
            release node lock
         release bearer_lock
      release net_lock
      
      (Different order from above, but works because we grab the
      outer net_lock in write mode first, excluding all other access.)
      
      The first major goal in our simplification effort is to get rid
      of the "big" net_lock, replacing it with rcu-locks when accessing
      the node list and node hash array. This will come in a later patch
      series.
      
      But to get there we first need to rewrite access methods ##2 and 3,
      since removal of net_lock would introduce three major problems:
      
      a) In access method #2, we access the link before taking the
         protecting node_lock. This will not work once net_lock is gone,
         so we will have to change the access order. We will deal with
         this in a later commit in this series, "tipc: add node lock
         protection to link found by link_find_link()".
      
      b) When the outer protection from net_lock is gone, taking
         bearer_lock and node_lock in opposite order of method 1) and 2)
         will become an obvious deadlock hazard. This is fixed in the
         commit ("tipc: remove bearer_lock from tipc_bearer struct")
         later in this series.
      
      c) Similar to what is described in problem a), access method #3
         starts with using a link pointer that is unprotected by node_lock,
         in order to via that pointer find the correct node struct and
         lock it. Before we remove net_lock, this access order must be
         altered. This is what we do with this commit.
      
      We can avoid introducing problem problem c) by even here using the
      global node list to find the node, before accessing its links. When
      we loop though the node list we use the own bearer identity as search
      criteria, thus easily finding the links that are associated to the
      resetting/disabling bearer. It should be noted that although this
      method is somewhat slower than the current list traversal, it is in
      no way time critical. This is only about resetting or deleting links,
      something that must be considered relatively infrequent events.
      
      As a bonus, we can get rid of the mutual pointers between links and
      bearers. After this commit, pointer dependency go in one direction
      only: from the link to the bearer.
      
      This commit pre-empts introduction of problem c) as described above.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c61dd61d
  22. 11 12月, 2013 2 次提交
    • Y
      tipc: relocate common functions from media to bearer · 6e967adf
      Ying Xue 提交于
      Currently, registering a TIPC stack handler in the network device layer
      is done twice, once for Ethernet (eth_media) and Infiniband (ib_media)
      repectively. But, as this registration is not media specific, we can
      avoid some code duplication by moving the registering function to
      the generic bearer layer, to the file bearer.c, and call it only once.
      The same is true for the network device event notifier.
      
      As a side effect, the two workqueues we are using for for setting up/
      cleaning up media can now be eliminated. Furthermore, the array for
      storing the specific media type structs, media_array[], can be entirely
      deleted.
      
      Note that the eth_started and ib_started flags were removed during the
      code relocation.  There is now only one call to bearer_setup and
      bearer_cleanup, and these can logically not race against each other.
      
      Despite its size, this cleanup work incurs no functional changes in TIPC.
      In particular, it should be noted that the sequence ordering of received
      packets is unaffected by this change, since packet reception never was
      subject to any work queue handling in the first place.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e967adf
    • J
      tipc: correct the order of stopping services at rmmod · 993b858e
      Jon Paul Maloy 提交于
      The 'signal handler' service in TIPC is a mechanism that makes it
      possible to postpone execution of functions, by launcing them into
      a job queue for execution in a separate tasklet, independent of
      the launching execution thread.
      
      When we do rmmod on the tipc module, this service is stopped after
      the network service. At the same time, the stopping of the network
      service may itself launch jobs for execution, with the risk that these
      functions may be scheduled for execution after the data structures
      meant to be accessed by the job have already been deleted. We have
      seen this happen, most often resulting in an oops.
      
      This commit ensures that the signal handler is the very first to be
      stopped when TIPC is shut down, so there are no surprises during
      the cleanup of the other services.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      993b858e
  23. 18 6月, 2013 3 次提交
    • Y
      tipc: convert configuration server to use new server facility · 7d0ab17b
      Ying Xue 提交于
      As the new socket-based TIPC server infrastructure has been
      introduced, we can now convert the configuration server to use
      it.  Then we can take future steps to simplify the configuration
      server locking policy.
      
      Some minor reordering of initialization is done, due to the
      dependency on having tipc_socket_init completed.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d0ab17b
    • Y
      tipc: convert topology server to use new server facility · 13a2e898
      Ying Xue 提交于
      As the new TIPC server infrastructure has been introduced, we can
      now convert the TIPC topology server to it.  We get two benefits
      from doing this:
      
      1) It simplifies the topology server locking policy.  In the
      original locking policy, we placed one spin lock pointer in the
      tipc_subscriber structure to reuse the lock of the subscriber's
      server port, controlling access to members of tipc_subscriber
      instance.  That is, we only used one lock to ensure both
      tipc_port and tipc_subscriber members were safely accessed.
      
      Now we introduce another spin lock for tipc_subscriber structure
      only protecting themselves, to get a finer granularity locking
      policy.  Moreover, the change will allow us to make the topology
      server code more readable and maintainable.
      
      2) It fixes a bug where sent subscription events may be lost when
      the topology port is congested.  Using the new service, the
      topology server now queues sent events into an outgoing buffer,
      and then wakes up a sender process which has been blocked in
      workqueue context.  The process will keep picking events from the
      buffer and send them to their respective subscribers, using the
      kernel socket interface, until the buffer is empty. Even if the
      socket is congested during transmission there is no risk that
      events may be dropped, since the sender process may block when
      needed.
      
      Some minor reordering of initialization is done, since we now
      have a scenario where the topology server must be started after
      socket initialization has taken place, as the former depends
      on the latter.  And overall, we see a simplification of the
      TIPC subscriber code in making this changeover.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13a2e898
    • Y
      tipc: change socket buffer overflow control to respect sk_rcvbuf · cc79dd1b
      Ying Xue 提交于
      As per feedback from the netdev community, we change the buffer
      overflow protection algorithm in receiving sockets so that it
      always respects the nominal upper limit set in sk_rcvbuf.
      
      Instead of scaling up from a small sk_rcvbuf value, which leads to
      violation of the configured sk_rcvbuf limit, we now calculate the
      weighted per-message limit by scaling down from a much bigger value,
      still in the same field, according to the importance priority of the
      received message.
      
      To allow for administrative tunability of the socket receive buffer
      size, we create a tipc_rmem sysctl variable to allow the user to
      configure an even bigger value via sysctl command.  It is a size of
      three (min/default/max) to be consistent with things like tcp_rmem.
      
      By default, the value initialized in tipc_rmem[1] is equal to the
      receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
      This value is also set as the default value of sk_rcvbuf.
      Originally-by: NJon Maloy <jon.maloy@ericsson.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      [Ying: added sysctl variation to Jon's original patch]
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      [PG: don't compile sysctl.c if not config'd; add Documentation]
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc79dd1b
  24. 18 4月, 2013 1 次提交
    • P
      tipc: add InfiniBand media type · a29a194a
      Patrick McHardy 提交于
      Add InfiniBand media type based on the ethernet media type.
      
      The only real difference is that in case of InfiniBand, we need the entire
      20 bytes of space reserved for media addresses, so the TIPC media type ID is
      not explicitly stored in the packet payload.
      
      Sample output of tipc-config:
      
      # tipc-config -v -addr -netid -nt=all -p -m -b -n -ls
      
      node address: <10.1.4>
      current network id: 4711
      Type       Lower      Upper      Port Identity              Publication Scope
      0          167776257  167776257  <10.1.1:1855512577>        1855512578  cluster
                 167776260  167776260  <10.1.4:1216454657>        1216454658  zone
      1          1          1          <10.1.4:1216479235>        1216479236  node
      Ports:
      1216479235: bound to {1,1}
      1216454657: bound to {0,167776260}
      Media:
      eth
      ib
      Bearers:
      ib:ib0
      Nodes known:
      <10.1.1>: up
      Link <broadcast-link>
        Window:20 packets
        RX packets:0 fragments:0/0 bundles:0/0
        TX packets:0 fragments:0/0 bundles:0/0
        RX naks:0 defs:0 dups:0
        TX naks:0 acks:0 dups:0
        Congestion bearer:0 link:0  Send queue max:0 avg:0
      
      Link <10.1.4:ib0-10.1.1:ib0>
        ACTIVE  MTU:2044  Priority:10  Tolerance:1500 ms  Window:50 packets
        RX packets:80 fragments:0/0 bundles:0/0
        TX packets:40 fragments:0/0 bundles:0/0
        TX profile sample:22 packets  average:54 octets
        0-64:100% -256:0% -1024:0% -4096:0% -16384:0% -32768:0% -66000:0%
        RX states:410 probes:213 naks:0 defs:0 dups:0
        TX states:410 probes:197 naks:0 acks:0 dups:0
        Congestion bearer:0 link:0  Send queue max:1 avg:0
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a29a194a