1. 13 10月, 2017 6 次提交
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • J
      tipc: improve destination linked list · a80ae530
      Jon Maloy 提交于
      We often see a need for a linked list of destination identities,
      sometimes containing a port number, sometimes a node identity, and
      sometimes both. The currently defined struct u32_list is not generic
      enough to cover all cases, so we extend it to contain two u32 integers
      and rename it to struct tipc_dest_list.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a80ae530
    • J
      tipc: add new function for sending multiple small messages · f70d37b7
      Jon Maloy 提交于
      We see an increasing need to send multiple single-buffer messages
      of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes.
      Instead of looping over the send queue and sending each buffer
      individually, as we do now, we add a new help function
      tipc_node_distr_xmit() to do this.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f70d37b7
    • J
      tipc: refactor function filter_rcv() · 64ac5f59
      Jon Maloy 提交于
      In the following commits we will need to handle multiple incoming and
      rejected/returned buffers in the function socket.c::filter_rcv().
      As a preparation for this, we generalize the function by handling
      buffer queues instead of individual buffers. We also introduce a
      help function tipc_skb_reject(), and rename filter_rcv() to
      tipc_sk_filter_rcv() in line with other functions in socket.c.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64ac5f59
    • J
      tipc: improve address sanity check in tipc_connect() · 23998835
      Jon Maloy 提交于
      The address given to tipc_connect() is not completely sanity checked,
      under the assumption that this will be done later in the function
      __tipc_sendmsg() when the address is used there.
      
      However, the latter functon will in the next commits serve as caller
      to several other send functions, so we want to move the corresponding
      sanity check there to the beginning of that function, before we possibly
      need to grab the address stored by tipc_connect(). We must therefore
      be able to trust that this address already has been thoroughly checked.
      
      We do this in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23998835
    • J
      tipc: add ability to order and receive topology events in driver · 14c04493
      Jon Maloy 提交于
      As preparation for introducing communication groups, we add the ability
      to issue topology subscriptions and receive topology events from kernel
      space. This will make it possible for group member sockets to keep track
      of other group members.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14c04493
  2. 25 8月, 2017 1 次提交
    • B
      tipc: Fix tipc_sk_reinit handling of -EAGAIN · 6c7e983b
      Bob Peterson 提交于
      In 9dbbfb0a function tipc_sk_reinit
      had additional logic added to loop in the event that function
      rhashtable_walk_next() returned -EAGAIN. No worries.
      
      However, if rhashtable_walk_start returns -EAGAIN, it does "continue",
      and therefore skips the call to rhashtable_walk_stop(). That has
      the effect of calling rcu_read_lock() without its paired call to
      rcu_read_unlock(). Since rcu_read_lock() may be nested, the problem
      may not be apparent for a while, especially since resize events may
      be rare. But the comments to rhashtable_walk_start() state:
      
       * ...Note that we take the RCU lock in all
       * cases including when we return an error.  So you must always call
       * rhashtable_walk_stop to clean up.
      
      This patch replaces the continue with a goto and label to ensure a
      matching call to rhashtable_walk_stop().
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c7e983b
  3. 01 7月, 2017 1 次提交
  4. 12 5月, 2017 1 次提交
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
  5. 03 5月, 2017 2 次提交
  6. 29 4月, 2017 3 次提交
  7. 25 4月, 2017 2 次提交
  8. 14 4月, 2017 1 次提交
  9. 30 3月, 2017 2 次提交
  10. 10 3月, 2017 1 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
  11. 02 3月, 2017 1 次提交
  12. 18 2月, 2017 1 次提交
  13. 16 2月, 2017 1 次提交
  14. 14 2月, 2017 1 次提交
  15. 26 1月, 2017 1 次提交
  16. 21 1月, 2017 2 次提交
  17. 04 1月, 2017 3 次提交
    • J
      tipc: reduce risk of user starvation during link congestion · 365ad353
      Jon Paul Maloy 提交于
      The socket code currently handles link congestion by either blocking
      and trying to send again when the congestion has abated, or just
      returning to the user with -EAGAIN and let him re-try later.
      
      This mechanism is prone to starvation, because the wakeup algorithm is
      non-atomic. During the time the link issues a wakeup signal, until the
      socket wakes up and re-attempts sending, other senders may have come
      in between and occupied the free buffer space in the link. This in turn
      may lead to a socket having to make many send attempts before it is
      successful. In extremely loaded systems we have observed latency times
      of several seconds before a low-priority socket is able to send out a
      message.
      
      In this commit, we simplify this mechanism and reduce the risk of the
      described scenario happening. When a message is attempted sent via a
      congested link, we now let it be added to the link's backlog queue
      anyway, thus permitting an oversubscription of one message per source
      socket. We still create a wakeup item and return an error code, hence
      instructing the sender to block or stop sending. Only when enough space
      has been freed up in the link's backlog queue do we issue a wakeup event
      that allows the sender to continue with the next message, if any.
      
      The fact that a socket now can consider a message sent even when the
      link returns a congestion code means that the sending socket code can
      be simplified. Also, since this is a good opportunity to get rid of the
      obsolete 'mtu change' condition in the three socket send functions, we
      now choose to refactor those functions completely.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      365ad353
    • J
      tipc: modify struct tipc_plist to be more versatile · 4d8642d8
      Jon Paul Maloy 提交于
      During multicast reception we currently use a simple linked list with
      push/pop semantics to store port numbers.
      
      We now see a need for a more generic list for storing values of type
      u32. We therefore make some modifications to this list, while replacing
      the prefix 'tipc_plist_' with 'u32_'. We also add a couple of new
      functions which will come to use in the next commits.
      Acked-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d8642d8
    • J
      tipc: unify tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() functions · 8c44e1af
      Jon Paul Maloy 提交于
      The functions tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() are very
      similar. The latter function is also called from two locations, and
      there will be more in the coming commits, which will all need to test on
      different conditions.
      
      Instead of making yet another duplicates of the function, we now
      introduce a new macro tipc_wait_for_cond() where the wakeup condition
      can be stated as an argument to the call. This macro replaces all
      current and future uses of the two functions, which can now be
      eliminated.
      Acked-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c44e1af
  18. 24 12月, 2016 1 次提交
  19. 26 11月, 2016 1 次提交
    • J
      tipc: resolve connection flow control compatibility problem · 6998cc6e
      Jon Paul Maloy 提交于
      In commit 10724cc7 ("tipc: redesign connection-level flow control")
      we replaced the previous message based flow control with one based on
      1k blocks. In order to ensure backwards compatibility the mechanism
      falls back to using message as base unit when it senses that the peer
      doesn't support the new algorithm. The default flow control window,
      i.e., how many units can be sent before the sender blocks and waits
      for an acknowledge (aka advertisement) is 512. This was tested against
      the previous version, which uses an acknowledge frequency of on ack per
      256 received message, and found to work fine.
      
      However, we missed the fact that versions older than Linux 3.15 use an
      acknowledge frequency of 512, which is exactly the limit where a 4.6+
      sender will stop and wait for acknowledge. This would also work fine if
      it weren't for the fact that if the first sent message on a 4.6+ server
      side is an empty SYNACK, this one is also is counted as a sent message,
      while it is not counted as a received message on a legacy 3.15-receiver.
      This leads to the sender always being one step ahead of the receiver, a
      scenario causing the sender to block after 512 sent messages, while the
      receiver only has registered 511 read messages. Hence, the legacy
      receiver is not trigged to send an acknowledge, with a permanently
      blocked sender as result.
      
      We solve this deadlock by simply allowing the sender to send one more
      message before it blocks, i.e., by a making minimal change to the
      condition used for determining connection congestion.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6998cc6e
  20. 20 11月, 2016 1 次提交
  21. 15 11月, 2016 1 次提交
  22. 01 11月, 2016 6 次提交