1. 18 6月, 2016 1 次提交
    • J
      tipc: fix socket timer deadlock · f1d048f2
      Jon Paul Maloy 提交于
      We sometimes observe a 'deadly embrace' type deadlock occurring
      between mutually connected sockets on the same node. This happens
      when the one-hour peer supervision timers happen to expire
      simultaneously in both sockets.
      
      The scenario is as follows:
      
      CPU 1:                          CPU 2:
      --------                        --------
      tipc_sk_timeout(sk1)            tipc_sk_timeout(sk2)
        lock(sk1.slock)                 lock(sk2.slock)
        msg_create(probe)               msg_create(probe)
        unlock(sk1.slock)               unlock(sk2.slock)
        tipc_node_xmit_skb()            tipc_node_xmit_skb()
          tipc_node_xmit()                tipc_node_xmit()
            tipc_sk_rcv(sk2)                tipc_sk_rcv(sk1)
              lock(sk2.slock)                 lock((sk1.slock)
              filter_rcv()                    filter_rcv()
                tipc_sk_proto_rcv()             tipc_sk_proto_rcv()
                  msg_create(probe_rsp)           msg_create(probe_rsp)
                  tipc_sk_respond()               tipc_sk_respond()
                    tipc_node_xmit_skb()            tipc_node_xmit_skb()
                      tipc_node_xmit()                tipc_node_xmit()
                        tipc_sk_rcv(sk1)                tipc_sk_rcv(sk2)
                          lock((sk1.slock)                lock((sk2.slock)
                          ===> DEADLOCK                   ===> DEADLOCK
      
      Further analysis reveals that there are three different locations in the
      socket code where tipc_sk_respond() is called within the context of the
      socket lock, with ensuing risk of similar deadlocks.
      
      We now solve this by passing a buffer queue along with all upcalls where
      sk_lock.slock may potentially be held. Response or rejected message
      buffers are accumulated into this queue instead of being sent out
      directly, and only sent once we know we are safely outside the slock
      context.
      Reported-by: NGUNA <gbalasun@gmail.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1d048f2
  2. 17 5月, 2016 1 次提交
  3. 04 5月, 2016 3 次提交
    • J
      tipc: redesign connection-level flow control · 10724cc7
      Jon Paul Maloy 提交于
      There are two flow control mechanisms in TIPC; one at link level that
      handles network congestion, burst control, and retransmission, and one
      at connection level which' only remaining task is to prevent overflow
      in the receiving socket buffer. In TIPC, the latter task has to be
      solved end-to-end because messages can not be thrown away once they
      have been accepted and delivered upwards from the link layer, i.e, we
      can never permit the receive buffer to overflow.
      
      Currently, this algorithm is message based. A counter in the receiving
      socket keeps track of number of consumed messages, and sends a dedicated
      acknowledge message back to the sender for each 256 consumed message.
      A counter at the sending end keeps track of the sent, not yet
      acknowledged messages, and blocks the sender if this number ever reaches
      512 unacknowledged messages. When the missing acknowledge arrives, the
      socket is then woken up for renewed transmission. This works well for
      keeping the message flow running, as it almost never happens that a
      sender socket is blocked this way.
      
      A problem with the current mechanism is that it potentially is very
      memory consuming. Since we don't distinguish between small and large
      messages, we have to dimension the socket receive buffer according
      to a worst-case of both. I.e., the window size must be chosen large
      enough to sustain a reasonable throughput even for the smallest
      messages, while we must still consider a scenario where all messages
      are of maximum size. Hence, the current fix window size of 512 messages
      and a maximum message size of 66k results in a receive buffer of 66 MB
      when truesize(66k) = 131k is taken into account. It is possible to do
      much better.
      
      This commit introduces an algorithm where we instead use 1024-byte
      blocks as base unit. This unit, always rounded upwards from the
      actual message size, is used when we advertise windows as well as when
      we count and acknowledge transmitted data. The advertised window is
      based on the configured receive buffer size in such a way that even
      the worst-case truesize/msgsize ratio always is covered. Since the
      smallest possible message size (from a flow control viewpoint) now is
      1024 bytes, we can safely assume this ratio to be less than four, which
      is the value we are now using.
      
      This way, we have been able to reduce the default receive buffer size
      from 66 MB to 2 MB with maintained performance.
      
      In order to keep this solution backwards compatible, we introduce a
      new capability bit in the discovery protocol, and use this throughout
      the message sending/reception path to always select the right unit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10724cc7
    • J
      tipc: propagate peer node capabilities to socket layer · 60020e18
      Jon Paul Maloy 提交于
      During neighbor discovery, nodes advertise their capabilities as a bit
      map in a dedicated 16-bit field in the discovery message header. This
      bit map has so far only be stored in the node structure on the peer
      nodes, but we now see the need to keep a copy even in the socket
      structure.
      
      This commit adds this functionality.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60020e18
    • J
      tipc: re-enable compensation for socket receive buffer double counting · 7c8bcfb1
      Jon Paul Maloy 提交于
      In the refactoring commit d570d864 ("tipc: enqueue arrived buffers
      in socket in separate function") we did by accident replace the test
      
      if (sk->sk_backlog.len == 0)
           atomic_set(&tsk->dupl_rcvcnt, 0);
      
      with
      
      if (sk->sk_backlog.len)
           atomic_set(&tsk->dupl_rcvcnt, 0);
      
      This effectively disables the compensation we have for the double
      receive buffer accounting that occurs temporarily when buffers are
      moved from the backlog to the socket receive queue. Until now, this
      has gone unnoticed because of the large receive buffer limits we are
      applying, but becomes indispensable when we reduce this buffer limit
      later in this series.
      
      We now fix this by inverting the mentioned condition.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c8bcfb1
  4. 18 4月, 2016 1 次提交
  5. 08 3月, 2016 1 次提交
  6. 04 3月, 2016 1 次提交
    • P
      tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain" · f214fc40
      Parthasarathy Bhuvaragan 提交于
      reverts commit 94153e36 ("tipc: use existing sk_write_queue for
      outgoing packet chain")
      
      In Commit 94153e36, we assume that we fill & empty the socket's
      sk_write_queue within the same lock_sock() session.
      
      This is not true if the link is congested. During congestion, the
      socket lock is released while we wait for the congestion to cease.
      This implementation causes a nullptr exception, if the user space
      program has several threads accessing the same socket descriptor.
      
      Consider two threads of the same program performing the following:
           Thread1                                  Thread2
      --------------------                    ----------------------
      Enter tipc_sendmsg()                    Enter tipc_sendmsg()
      lock_sock()                             lock_sock()
      Enter tipc_link_xmit(), ret=ELINKCONG   spin on socket lock..
      sk_wait_event()                             :
      release_sock()                          grab socket lock
          :                                   Enter tipc_link_xmit(), ret=0
          :                                   release_sock()
      Wakeup after congestion
      lock_sock()
      skb = skb_peek(pktchain);
      !! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong;
      
      In this case, the second thread transmits the buffers belonging to
      both thread1 and thread2 successfully. When the first thread wakeup
      after the congestion it assumes that the pktchain is intact and
      operates on the skb's in it, which leads to the following exception:
      
      [2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
      [2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
      [2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0
      [2102.440074] Oops: 0000 [#1] SMP
      [2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1
      [2102.440074] RIP: 0010:[<ffffffffa005f330>]  [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
      [...]
      [2102.440074] Call Trace:
      [2102.440074]  [<ffffffff8163f0b9>] ? schedule+0x29/0x70
      [2102.440074]  [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc]
      [2102.440074]  [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc]
      [2102.440074]  [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc]
      [2102.440074]  [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20
      [2102.440074]  [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc]
      [2102.440074]  [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0
      [2102.440074]  [<ffffffff81507895>] ? release_sock+0x145/0x170
      [2102.440074]  [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0
      [2102.440074]  [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10
      [2102.440074]  [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0
      [2102.440074]  [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0
      [2102.440074]  [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20
      [2102.440074]  [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0
      [2102.440074]  [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950
      [2102.440074]  [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80
      [2102.440074]  [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20
      [2102.440074]  [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b
      
      In this commit, we maintain the skb list always in the stack.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f214fc40
  7. 01 12月, 2015 1 次提交
  8. 24 11月, 2015 1 次提交
    • Y
      tipc: avoid packets leaking on socket receive queue · f4195d1e
      Ying Xue 提交于
      Even if we drain receive queue thoroughly in tipc_release() after tipc
      socket is removed from rhashtable, it is possible that some packets
      are in flight because some CPU runs receiver and did rhashtable lookup
      before we removed socket. They will achieve receive queue, but nobody
      delete them at all. To avoid this leak, we register a private socket
      destructor to purge receive queue, meaning releasing packets pending
      on receive queue will be delayed until the last reference of tipc
      socket will be released.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4195d1e
  9. 24 10月, 2015 2 次提交
    • J
      tipc: introduce jumbo frame support for broadcast · 959e1781
      Jon Paul Maloy 提交于
      Until now, we have only been supporting a fix MTU size of 1500 bytes
      for all broadcast media, irrespective of their actual capability.
      
      We now make the broadcast MTU adaptable to the carrying media, i.e.,
      we use the smallest MTU supported by any of the interfaces attached
      to TIPC.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      959e1781
    • J
      tipc: move bcast definitions to bcast.c · 6beb19a6
      Jon Paul Maloy 提交于
      Currently, a number of structure and function definitions related
      to the broadcast functionality are unnecessarily exposed in the file
      bcast.h. This obscures the fact that the external interface towards
      the broadcast link in fact is very narrow, and causes unnecessary
      recompilations of other files when anything changes in those
      definitions.
      
      In this commit, we move as many of those definitions as is currently
      possible to the file bcast.c.
      
      We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
      since the name does not correctly describe the contents of this
      struct, and will do so even less in the future, and because we want
      to use the term 'link' more appropriately in the functionality
      introduced later in this series.
      
      Finally, we rename a couple of functions, such as tipc_bclink_xmit()
      and others that will be kept in the future, to include the term 'bcast'
      instead.
      
      There are no functional changes in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6beb19a6
  10. 27 7月, 2015 3 次提交
    • J
      tipc: clean up socket layer message reception · cda3696d
      Jon Paul Maloy 提交于
      When a message is received in a socket, one of the call chains
      tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
      or
      tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
      are followed. At each of these levels we may encounter situations
      where the message may need to be rejected, or a new message
      produced for transfer back to the sender. Despite recent
      improvements, the current code for doing this is perceived
      as awkward and hard to follow.
      
      Leveraging the two previous commits in this series, we now
      introduce a more uniform handling of such situations. We
      let each of the functions in the chain itself produce/reverse
      the message to be returned to the sender, but also perform the
      actual forwarding. This simplifies the necessary logics within
      each function.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cda3696d
    • J
      tipc: introduce new tipc_sk_respond() function · bcd3ffd4
      Jon Paul Maloy 提交于
      Currently, we use the code sequence
      
      if (msg_reverse())
         tipc_link_xmit_skb()
      
      at numerous locations in socket.c. The preparation of arguments
      for these calls, as well as the sequence itself, makes the code
      unecessarily complex.
      
      In this commit, we introduce a new function, tipc_sk_respond(),
      that performs this call combination. We also replace some, but not
      yet all, of these explicit call sequences with calls to the new
      function. Notably, we let the function tipc_sk_proto_rcv() use
      the new function to directly send out PROBE_REPLY messages,
      instead of deferring this to the calling tipc_sk_rcv() function,
      as we do now.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcd3ffd4
    • J
      tipc: let function tipc_msg_reverse() expand header when needed · 29042e19
      Jon Paul Maloy 提交于
      The shortest TIPC message header, for cluster local CONNECTED messages,
      is 24 bytes long. With this format, the fields "dest_node" and
      "orig_node" are optimized away, since they in reality are redundant
      in this particular case.
      
      However, the absence of these fields leads to code inconsistencies
      that are difficult to handle in some cases, especially when we need
      to reverse or reject messages at the socket layer.
      
      In this commit, we concentrate the handling of the absent fields
      to one place, by letting the function tipc_msg_reverse() reallocate
      the buffer and expand the header to 32 bytes when necessary. This
      means that the socket code now can assume that the two previously
      absent fields are present in the header when a message needs to be
      rejected. This opens up for some further simplifications of the
      socket code.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29042e19
  11. 21 7月, 2015 2 次提交
    • J
      tipc: make media xmit call outside node spinlock context · af9b028e
      Jon Paul Maloy 提交于
      Currently, message sending is performed through a deep call chain,
      where the node spinlock is grabbed and held during a significant
      part of the transmission time. This is clearly detrimental to
      overall throughput performance; it would be better if we could send
      the message after the spinlock has been released.
      
      In this commit, we do instead let the call revert on the stack after
      the buffer chain has been added to the transmission queue, whereafter
      clones of the buffers are transmitted to the device layer outside the
      spinlock scope.
      
      As a further step in our effort to separate the roles of the node
      and link entities we also move the function tipc_link_xmit() to
      node.c, and rename it to tipc_node_xmit().
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af9b028e
    • J
      tipc: change sk_buffer handling in tipc_link_xmit() · 22d85c79
      Jon Paul Maloy 提交于
      When the function tipc_link_xmit() is given a buffer list for
      transmission, it currently consumes the list both when transmission
      is successful and when it fails, except for the special case when
      it encounters link congestion.
      
      This behavior is inconsistent, and needs to be corrected if we want
      to avoid problems in later commits in this series.
      
      In this commit, we change this to let the function consume the list
      only when transmission is successful, and leave the list with the
      sender in all other cases. We also modifiy the socket code so that
      it adapts to this change, i.e., purges the list when a non-congestion
      error code is returned.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22d85c79
  12. 09 7月, 2015 1 次提交
  13. 11 6月, 2015 1 次提交
  14. 31 5月, 2015 1 次提交
  15. 15 5月, 2015 1 次提交
    • J
      tipc: simplify include dependencies · a6bf70f7
      Jon Paul Maloy 提交于
      When we try to add new inline functions in the code, we sometimes
      run into circular include dependencies.
      
      The main problem is that the file core.h, which really should be at
      the root of the dependency chain, instead is a leaf. I.e., core.h
      includes a number of header files that themselves should be allowed
      to include core.h. In reality this is unnecessary, because core.h does
      not need to know the full signature of any of the structs it refers to,
      only their type declaration.
      
      In this commit, we remove all dependencies from core.h towards any
      other tipc header file.
      
      As a consequence of this change, we can now move the function
      tipc_own_addr(net) from addr.c to addr.h, and make it inline.
      
      There are no functional changes in this commit.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6bf70f7
  16. 11 5月, 2015 1 次提交
  17. 23 4月, 2015 1 次提交
  18. 25 3月, 2015 2 次提交
  19. 24 3月, 2015 1 次提交
  20. 21 3月, 2015 1 次提交
  21. 20 3月, 2015 2 次提交
  22. 19 3月, 2015 1 次提交
  23. 18 3月, 2015 1 次提交
    • Y
      tipc: fix netns refcnt leak · 76100a8a
      Ying Xue 提交于
      When the TIPC module is loaded, we launch a topology server in kernel
      space, which in its turn is creating TIPC sockets for communication
      with topology server users. Because both the socket's creator and
      provider reside in the same module, it is necessary that the TIPC
      module's reference count remains zero after the server is started and
      the socket created; otherwise it becomes impossible to perform "rmmod"
      even on an idle module.
      
      Currently, we achieve this by defining a separate "tipc_proto_kern"
      protocol struct, that is used only for kernel space socket allocations.
      This structure has the "owner" field set to NULL, which restricts the
      module reference count from being be bumped when sk_alloc() for local
      sockets is called. Furthermore, we have defined three kernel-specific
      functions, tipc_sock_create_local(), tipc_sock_release_local() and
      tipc_sock_accept_local(), to avoid the module counter being modified
      when module local sockets are created or deleted. This has worked well
      until we introduced name space support.
      
      However, after name space support was introduced, we have observed that
      a reference count leak occurs, because the netns counter is not
      decremented in tipc_sock_delete_local().
      
      This commit remedies this problem. But instead of just modifying
      tipc_sock_delete_local(), we eliminate the whole parallel socket
      handling infrastructure, and start using the regular sk_create_kern(),
      kernel_accept() and sk_release_kernel() calls. Since those functions
      manipulate the module counter, we must now compensate for that by
      explicitly decrementing the counter after module local sockets are
      created, and increment it just before calling sk_release_kernel().
      
      Fixes: a62fbcce ("tipc: make subscriber server support net namespace")
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reported-by: NCong Wang <cwang@twopensource.com>
      Tested-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76100a8a
  24. 10 3月, 2015 1 次提交
  25. 03 3月, 2015 2 次提交
  26. 28 2月, 2015 1 次提交
  27. 10 2月, 2015 3 次提交
  28. 09 2月, 2015 1 次提交
  29. 06 2月, 2015 1 次提交
    • J
      tipc: eliminate race condition at multicast reception · cb1b7280
      Jon Paul Maloy 提交于
      In a previous commit in this series we resolved a race problem during
      unicast message reception.
      
      Here, we resolve the same problem at multicast reception. We apply the
      same technique: an input queue serializing the delivery of arriving
      buffers. The main difference is that here we do it in two steps.
      First, the broadcast link feeds arriving buffers into the tail of an
      arrival queue, which head is consumed at the socket level, and where
      destination lookup is performed. Second, if the lookup is successful,
      the resulting buffer clones are fed into a second queue, the input
      queue. This queue is consumed at reception in the socket just like
      in the unicast case. Both queues are protected by the same lock, -the
      one of the input queue.
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1b7280