1. 10 1月, 2018 1 次提交
    • J
      tipc: create group member event messages when they are needed · 7ad32bcb
      Jon Maloy 提交于
      In the current implementation, a group socket receiving topology
      events about other members just converts the topology event message
      into a group event message and stores it until it reaches the right
      state to issue it to the user. This complicates the code unnecessarily,
      and becomes impractical when we in the coming commits will need to
      create and issue membership events independently.
      
      In this commit, we change this so that we just notice the type and
      origin of the incoming topology event, and then drop the buffer. Only
      when it is time to actually send a group event to the user do we
      explicitly create a new message and send it upwards.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ad32bcb
  2. 29 12月, 2017 1 次提交
  3. 14 12月, 2017 1 次提交
  4. 11 12月, 2017 1 次提交
    • T
      rhashtable: Change rhashtable_walk_start to return void · 97a6ec4a
      Tom Herbert 提交于
      Most callers of rhashtable_walk_start don't care about a resize event
      which is indicated by a return value of -EAGAIN. So calls to
      rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
      like this is common:
      
             ret = rhashtable_walk_start(rhiter);
             if (ret && ret != -EAGAIN)
                     goto out;
      
      Since zero and -EAGAIN are the only possible return values from the
      function this check is pointless. The condition never evaluates to true.
      
      This patch changes rhashtable_walk_start to return void. This simplifies
      code for the callers that ignore -EAGAIN. For the few cases where the
      caller cares about the resize event, particularly where the table can be
      walked in mulitple parts for netlink or seq file dump, the function
      rhashtable_walk_start_check has been added that returns -EAGAIN on a
      resize event.
      Signed-off-by: NTom Herbert <tom@quantonium.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a6ec4a
  5. 01 11月, 2017 1 次提交
  6. 26 10月, 2017 1 次提交
  7. 22 10月, 2017 1 次提交
  8. 21 10月, 2017 1 次提交
  9. 13 10月, 2017 15 次提交
    • J
      tipc: guarantee delivery of UP event before first broadcast · 399574d4
      Jon Maloy 提交于
      The following scenario is possible:
      - A user joins a group, and immediately sends out a broadcast message
        to its members.
      - The broadcast message, following a different data path than the
        initial JOIN message sent out during the joining procedure, arrives
        to a receiver before the latter..
      - The receiver drops the message, since it is not ready to accept any
        messages until the JOIN has arrived.
      
      We avoid this by treating group protocol JOIN messages like unicast
      messages.
      - We let them pass through the recipient's multicast input queue, just
        like ordinary unicasts.
      - We force the first following broadacst to be sent as replicated
        unicast and being acknowledged by the recipient before accepting
        any more broadcast transmissions.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      399574d4
    • J
      tipc: guarantee that group broadcast doesn't bypass group unicast · 2f487712
      Jon Maloy 提交于
      We need a mechanism guaranteeing that group unicasts sent out from a
      socket are not bypassed by later sent broadcasts from the same socket.
      We do this as follows:
      
      - Each time a unicast is sent, we set a the broadcast method for the
        socket to "replicast" and "mandatory". This forces the first
        subsequent broadcast message to follow the same network and data path
        as the preceding unicast to a destination, hence preventing it from
        overtaking the latter.
      
      - In order to make the 'same data path' statement above true, we let
        group unicasts pass through the multicast link input queue, instead
        of as previously through the unicast link input queue.
      
      - In the first broadcast following a unicast, we set a new header flag,
        requiring all recipients to immediately acknowledge its reception.
      
      - During the period before all the expected acknowledges are received,
        the socket refuses to accept any more broadcast attempts, i.e., by
        blocking or returning EAGAIN. This period should typically not be
        longer than a few microseconds.
      
      - When all acknowledges have been received, the sending socket will
        open up for subsequent broadcasts, this time giving the link layer
        freedom to itself select the best transmission method.
      
      - The forced and/or abrupt transmission method changes described above
        may lead to broadcasts arriving out of order to the recipients. We
        remedy this by introducing code that checks and if necessary
        re-orders such messages at the receiving end.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f487712
    • J
      tipc: guarantee group unicast doesn't bypass group broadcast · b87a5ea3
      Jon Maloy 提交于
      Group unicast messages don't follow the same path as broadcast messages,
      and there is a high risk that unicasts sent from a socket might bypass
      previously sent broadcasts from the same socket.
      
      We fix this by letting all unicast messages carry the sequence number of
      the next sent broadcast from the same node, but without updating this
      number at the receiver. This way, a receiver can check and if necessary
      re-order such messages before they are added to the socket receive buffer.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87a5ea3
    • J
      tipc: introduce group multicast messaging · 5b8dddb6
      Jon Maloy 提交于
      The previously introduced message transport to all group members is
      based on the tipc multicast service, but is logically a broadcast
      service within the group, and that is what we call it.
      
      We now add functionality for sending messages to all group members
      having a certain identity. Correspondingly, we call this feature 'group
      multicast'. The service is using unicast when only one destination is
      found, otherwise it will use the bearer broadcast service to transfer
      the messages. In the latter case, the receiving members filter arriving
      messages by looking at the intended destination instance. If there is
      no match, the message will be dropped, while still being considered
      received and read as seen by the flow control mechanism.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b8dddb6
    • J
      tipc: introduce group anycast messaging · ee106d7f
      Jon Maloy 提交于
      In this commit, we make it possible to send connectionless unicast
      messages to any member corresponding to the given member identity,
      when there is more than one such member. The sender must use a
      TIPC_ADDR_NAME address to achieve this effect.
      
      We also perform load balancing between the destinations, i.e., we
      primarily select one which has advertised sufficient send window
      to not cause a block/EAGAIN delay, if any. This mechanism is
      overlayed on the always present round-robin selection.
      
      Anycast messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee106d7f
    • J
      tipc: introduce group unicast messaging · 27bd9ec0
      Jon Maloy 提交于
      We now make it possible to send connectionless unicast messages
      within a communication group. To send a message, the sender can use
      either a direct port address, aka port identity, or an indirect port
      name to be looked up.
      
      This type of messages are subject to the same start synchronization
      and flow control mechanism as group broadcast messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27bd9ec0
    • J
      tipc: introduce flow control for group broadcast messages · b7d42635
      Jon Maloy 提交于
      We introduce an end-to-end flow control mechanism for group broadcast
      messages. This ensures that no messages are ever lost because of
      destination receive buffer overflow, with minimal impact on performance.
      For now, the algorithm is based on the assumption that there is only one
      active transmitter at any moment in time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7d42635
    • J
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy 提交于
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae236fb2
    • J
      tipc: add second source address to recvmsg()/recvfrom() · 31c82a2d
      Jon Maloy 提交于
      With group communication, it becomes important for a message receiver to
      identify not only from which socket (identfied by a node:port tuple) the
      message was sent, but also the logical identity (type:instance) of the
      sending member.
      
      We fix this by adding a second instance of struct sockaddr_tipc to the
      source address area when a message is read. The extra address struct
      is filled in with data found in the received message header (type,) and
      in the local member representation struct (instance.)
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c82a2d
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • J
      tipc: improve destination linked list · a80ae530
      Jon Maloy 提交于
      We often see a need for a linked list of destination identities,
      sometimes containing a port number, sometimes a node identity, and
      sometimes both. The currently defined struct u32_list is not generic
      enough to cover all cases, so we extend it to contain two u32 integers
      and rename it to struct tipc_dest_list.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a80ae530
    • J
      tipc: add new function for sending multiple small messages · f70d37b7
      Jon Maloy 提交于
      We see an increasing need to send multiple single-buffer messages
      of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes.
      Instead of looping over the send queue and sending each buffer
      individually, as we do now, we add a new help function
      tipc_node_distr_xmit() to do this.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f70d37b7
    • J
      tipc: refactor function filter_rcv() · 64ac5f59
      Jon Maloy 提交于
      In the following commits we will need to handle multiple incoming and
      rejected/returned buffers in the function socket.c::filter_rcv().
      As a preparation for this, we generalize the function by handling
      buffer queues instead of individual buffers. We also introduce a
      help function tipc_skb_reject(), and rename filter_rcv() to
      tipc_sk_filter_rcv() in line with other functions in socket.c.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64ac5f59
    • J
      tipc: improve address sanity check in tipc_connect() · 23998835
      Jon Maloy 提交于
      The address given to tipc_connect() is not completely sanity checked,
      under the assumption that this will be done later in the function
      __tipc_sendmsg() when the address is used there.
      
      However, the latter functon will in the next commits serve as caller
      to several other send functions, so we want to move the corresponding
      sanity check there to the beginning of that function, before we possibly
      need to grab the address stored by tipc_connect(). We must therefore
      be able to trust that this address already has been thoroughly checked.
      
      We do this in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23998835
    • J
      tipc: add ability to order and receive topology events in driver · 14c04493
      Jon Maloy 提交于
      As preparation for introducing communication groups, we add the ability
      to issue topology subscriptions and receive topology events from kernel
      space. This will make it possible for group member sockets to keep track
      of other group members.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14c04493
  10. 25 8月, 2017 1 次提交
    • B
      tipc: Fix tipc_sk_reinit handling of -EAGAIN · 6c7e983b
      Bob Peterson 提交于
      In 9dbbfb0a function tipc_sk_reinit
      had additional logic added to loop in the event that function
      rhashtable_walk_next() returned -EAGAIN. No worries.
      
      However, if rhashtable_walk_start returns -EAGAIN, it does "continue",
      and therefore skips the call to rhashtable_walk_stop(). That has
      the effect of calling rcu_read_lock() without its paired call to
      rcu_read_unlock(). Since rcu_read_lock() may be nested, the problem
      may not be apparent for a while, especially since resize events may
      be rare. But the comments to rhashtable_walk_start() state:
      
       * ...Note that we take the RCU lock in all
       * cases including when we return an error.  So you must always call
       * rhashtable_walk_stop to clean up.
      
      This patch replaces the continue with a goto and label to ensure a
      matching call to rhashtable_walk_stop().
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c7e983b
  11. 01 7月, 2017 1 次提交
  12. 12 5月, 2017 1 次提交
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
  13. 03 5月, 2017 2 次提交
  14. 29 4月, 2017 3 次提交
  15. 25 4月, 2017 2 次提交
  16. 14 4月, 2017 1 次提交
  17. 30 3月, 2017 2 次提交
  18. 10 3月, 2017 1 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
  19. 02 3月, 2017 1 次提交
  20. 18 2月, 2017 1 次提交
  21. 16 2月, 2017 1 次提交