1. 02 12月, 2017 3 次提交
    • S
      rds: tcp: atomically purge entries from rds_tcp_conn_list during netns delete · f10b4cff
      Sowmini Varadhan 提交于
      The rds_tcp_kill_sock() function parses the rds_tcp_conn_list
      to find the rds_connection entries marked for deletion as part
      of the netns deletion under the protection of the rds_tcp_conn_lock.
      Since the rds_tcp_conn_list tracks rds_tcp_connections (which
      have a 1:1 mapping with rds_conn_path), multiple tc entries in
      the rds_tcp_conn_list will map to a single rds_connection, and will
      be deleted as part of the rds_conn_destroy() operation that is
      done outside the rds_tcp_conn_lock.
      
      The rds_tcp_conn_list traversal done under the protection of
      rds_tcp_conn_lock should not leave any doomed tc entries in
      the list after the rds_tcp_conn_lock is released, else another
      concurrently executiong netns delete (for a differnt netns) thread
      may trip on these entries.
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f10b4cff
    • S
      rds: tcp: correctly sequence cleanup on netns deletion. · 681648e6
      Sowmini Varadhan 提交于
      Commit 8edc3aff ("rds: tcp: Take explicit refcounts on struct net")
      introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
      (and thus rds_tcp_dev_event notification) is only called from put_net()
      when all netns refcounts go to 0, but this cannot happen if the
      rds_connection itself is holding a c_net ref that it expects to
      release in rds_tcp_kill_sock.
      
      Instead, the rds_tcp_kill_sock callback should make sure to
      tear down state carefully, ensuring that the socket teardown
      is only done after all data-structures and workqs that depend
      on it are quiesced.
      
      The original motivation for commit 8edc3aff ("rds: tcp: Take explicit
      refcounts on struct net") was to resolve a race condition reported by
      syzkaller where workqs for tx/rx/connect were triggered after the
      namespace was deleted. Those worker threads should have been
      cancelled/flushed before socket tear-down and indeed,
      rds_conn_path_destroy() does try to sequence this by doing
           /* cancel cp_send_w */
           /* cancel cp_recv_w */
           /* flush cp_down_w */
           /* free data structures */
      Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
      invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
      we ought to have satisfied the requirement that "socket-close is
      done after all other dependent state is quiesced". However,
      rds_conn_shutdown has a bug in that it *always* triggers the reconnect
      workq (and if connection is successful, we always restart tx/rx
      workqs so with the right timing, we risk the race conditions reported
      by syzkaller).
      
      Netns deletion is like module teardown- no need to restart a
      reconnect in this case. We can use the c_destroy_in_prog bit
      to avoid restarting the reconnect.
      
      Fixes: 8edc3aff ("rds: tcp: Take explicit refcounts on struct net")
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      681648e6
    • S
      rds: tcp: remove redundant function rds_tcp_conn_paths_destroy() · 2d746c93
      Sowmini Varadhan 提交于
      A side-effect of Commit c14b0366 ("rds: tcp: set linger to 1
      when unloading a rds-tcp") is that we always send a RST on the tcp
      connection for rds_conn_destroy(), so rds_tcp_conn_paths_destroy()
      is not needed any more and is removed in this patch.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d746c93
  2. 17 7月, 2017 1 次提交
    • S
      rds: cancel send/recv work before queuing connection shutdown · aed20a53
      Sowmini Varadhan 提交于
      We could end up executing rds_conn_shutdown before the rds_recv_worker
      thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
      sock_release and set sock->sk to null, which may interleave in bad
      ways with rds_recv_worker, e.g., it could result in:
      
      "BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
          [ffff881769f6fd70] release_sock at ffffffff815f337b
          [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
          [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
          [ffff881769f6fde0] process_one_work at ffffffff810a14c1
          [ffff881769f6fe40] worker_thread at ffffffff810a1940
          [ffff881769f6fec0] kthread at ffffffff810a6b1e
      
      Also, do not enqueue any new shutdown workq items when the connection is
      shutting down (this may happen for rds-tcp in softirq mode, if a FIN
      or CLOSE is received while the modules is in the middle of an unload)
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aed20a53
  3. 06 4月, 2017 1 次提交
  4. 08 3月, 2017 3 次提交
    • S
      rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races · b21dd450
      Sowmini Varadhan 提交于
      Commit a93d01f5 ("RDS: TCP: avoid bad page reference in
      rds_tcp_listen_data_ready") added the function
      rds_tcp_listen_sock_def_readable()  to handle the case when a
      partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
      However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
      through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
      null and we would hit a panic  of the form
        BUG: unable to handle kernel NULL pointer dereference at   (null)
        IP:           (null)
         :
        ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
        tcp_data_queue+0x39d/0x5b0
        tcp_rcv_established+0x2e5/0x660
        tcp_v4_do_rcv+0x122/0x220
        tcp_v4_rcv+0x8b7/0x980
          :
      In the above case, it is not fatal to encounter a NULL value for
      ready- we should just drop the packet and let the flush of the
      acceptor thread finish gracefully.
      
      In general, the tear-down sequence for listen() and accept() socket
      that is ensured by this commit is:
           rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
           In rds_tcp_listen_stop():
               serialize with, and prevent, further callbacks using lock_sock()
               flush rds_wq
               flush acceptor workq
               sock_release(listen socket)
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b21dd450
    • S
      rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races · 16c09b1c
      Sowmini Varadhan 提交于
      Order of initialization in rds_tcp_init needs to be done so
      that resources are set up and destroyed in the correct synchronization
      sequence with both the data path, as well as netns create/destroy
      path. Specifically,
      
      - we must call register_pernet_subsys and get the rds_tcp_netid
        before calling register_netdevice_notifier, otherwise we risk
        the sequence
          1. register_netdevice_notifier sets up netdev notifier callback
          2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
             the wrong rtn, resulting in a panic with string that is of the form:
      
        BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
        IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
               :
      
      - the rds_tcp_incoming_slab kmem_cache must be initialized before the
        datapath starts up. The latter can happen any time after the
        pernet_subsys registration of rds_tcp_net_ops, whose -> init
        function sets up the listen socket. If the rds_tcp_incoming_slab has
        not been set up at that time, a panic of the form below may be
        encountered
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
        IP: kmem_cache_alloc+0x90/0x1c0
           :
        rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
        tcp_read_sock+0x96/0x1c0
        rds_tcp_recv_path+0x65/0x80 [rds_tcp]
           :
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16c09b1c
    • S
      rds: tcp: Take explicit refcounts on struct net · 8edc3aff
      Sowmini Varadhan 提交于
      It is incorrect for the rds_connection to piggyback on the
      sock_net() refcount for the netns because this gives rise to
      a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
      take a ref on the net, and hold the netns down till the connection
      tear-down is complete.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8edc3aff
  5. 04 3月, 2017 1 次提交
  6. 25 2月, 2017 1 次提交
  7. 03 12月, 2016 1 次提交
  8. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  9. 10 11月, 2016 1 次提交
  10. 16 7月, 2016 3 次提交
  11. 05 7月, 2016 1 次提交
  12. 02 7月, 2016 7 次提交
  13. 18 6月, 2016 1 次提交
  14. 15 6月, 2016 2 次提交
  15. 08 6月, 2016 3 次提交
  16. 04 5月, 2016 2 次提交
    • S
      RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock. · bd7c5f98
      Sowmini Varadhan 提交于
      An arbitration scheme for duelling SYNs is implemented as part of
      commit 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
      involved will arrive at the same arbitration decision. However, this
      needs to be synchronized with an outgoing SYN to be generated by
      rds_tcp_conn_connect(). This commit achieves the synchronization
      through the t_conn_lock mutex in struct rds_tcp_connection.
      
      The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
      the t_conn_lock mutex.  A SYN is sent out only if the RDS connection is
      not already UP (an UP would indicate that rds_tcp_accept_one() has
      completed 3WH, so no SYN needs to be generated).
      
      Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
      acquiring the t_conn_lock mutex. The only acceptable states (to
      allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
      was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
      outgoing SYN before we saw incoming SYN).
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd7c5f98
    • S
      RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock · eb192840
      Sowmini Varadhan 提交于
      There is a race condition between rds_send_xmit -> rds_tcp_xmit
      and the code that deals with resolution of duelling syns added
      by commit 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()").
      
      Specifically, we may end up derefencing a null pointer in rds_send_xmit
      if we have the interleaving sequence:
                 rds_tcp_accept_one                  rds_send_xmit
      
                                                   conn is RDS_CONN_UP, so
          					 invoke rds_tcp_xmit
      
                                                   tc = conn->c_transport_data
              rds_tcp_restore_callbacks
                  /* reset t_sock */
          					 null ptr deref from tc->t_sock
      
      The race condition can be avoided without adding the overhead of
      additional locking in the xmit path: have rds_tcp_accept_one wait
      for rds_tcp_xmit threads to complete before resetting callbacks.
      The synchronization can be done in the same manner as rds_conn_shutdown().
      First set the rds_conn_state to something other than RDS_CONN_UP
      (so that new threads cannot get into rds_tcp_xmit()), then wait for
      RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
      threads in rds_tcp_xmit are done.
      
      Fixes: 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()")
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb192840
  17. 19 3月, 2016 2 次提交
  18. 11 2月, 2016 1 次提交
  19. 05 10月, 2015 1 次提交
    • S
      RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune · 1edd6a14
      Sowmini Varadhan 提交于
      Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
      clobbers efficient use of TSO because it inflates the size_goal
      that is computed in tcp_sendmsg/tcp_sendpage and skews packet
      latency, and the default values for these parameters actually
      results in significantly better performance.
      
      In request-response tests using rds-stress with a packet size of
      100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
      between a single pair of IP addresses achieves a throughput of
      6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
      equivalent conditions on these platforms.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1edd6a14
  20. 08 8月, 2015 2 次提交
  21. 01 11月, 2011 1 次提交
  22. 04 11月, 2010 1 次提交