1. 17 7月, 2017 1 次提交
    • S
      rds: cancel send/recv work before queuing connection shutdown · aed20a53
      Sowmini Varadhan 提交于
      We could end up executing rds_conn_shutdown before the rds_recv_worker
      thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
      sock_release and set sock->sk to null, which may interleave in bad
      ways with rds_recv_worker, e.g., it could result in:
      
      "BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
          [ffff881769f6fd70] release_sock at ffffffff815f337b
          [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
          [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
          [ffff881769f6fde0] process_one_work at ffffffff810a14c1
          [ffff881769f6fe40] worker_thread at ffffffff810a1940
          [ffff881769f6fec0] kthread at ffffffff810a6b1e
      
      Also, do not enqueue any new shutdown workq items when the connection is
      shutting down (this may happen for rds-tcp in softirq mode, if a FIN
      or CLOSE is received while the modules is in the middle of an unload)
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aed20a53
  2. 22 6月, 2017 1 次提交
  3. 17 6月, 2017 1 次提交
  4. 03 4月, 2017 1 次提交
  5. 08 3月, 2017 1 次提交
  6. 03 1月, 2017 1 次提交
  7. 18 11月, 2016 2 次提交
    • S
      RDS: TCP: Force every connection to be initiated by numerically smaller IP address · 1a0e100f
      Sowmini Varadhan 提交于
      When 2 RDS peers initiate an RDS-TCP connection simultaneously,
      there is a potential for "duelling syns" on either/both sides.
      See commit 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()") for a description of this
      condition, and the arbitration logic which ensures that the
      numerically large IP address in the TCP connection is bound to the
      RDS_TCP_PORT ("canonical ordering").
      
      The rds_connection should not be marked as RDS_CONN_UP until the
      arbitration logic has converged for the following reason. The sender
      may start transmitting RDS datagrams as soon as RDS_CONN_UP is set,
      and since the sender removes all datagrams from the rds_connection's
      cp_retrans queue based on TCP acks. If the TCP ack was sent from
      a tcp socket that got reset as part of duel aribitration (but
      before data was delivered to the receivers RDS socket layer),
      the sender may end up prematurely freeing the datagram, and
      the datagram is no longer reliably deliverable.
      
      This patch remedies that condition by making sure that, upon
      receipt of 3WH completion state change notification of TCP_ESTABLISHED
      in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP
      if, and only if, the IP addresses and ports for the connection are
      canonically ordered. In all other cases, rds_tcp_state_change will
      force an rds_conn_path_drop(), and rds_queue_reconnect() on
      both peers will restart the connection to ensure canonical ordering.
      
      A side-effect of enforcing this condition in rds_tcp_state_change()
      is that rds_tcp_accept_one_path() can now be refactored for simplicity.
      It is also no longer possible to encounter an RDS_CONN_UP connection in
      the arbitration logic in rds_tcp_accept_one().
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a0e100f
    • S
      RDS: TCP: Track peer's connection generation number · 905dd418
      Sowmini Varadhan 提交于
      The RDS transport has to be able to distinguish between
      two types of failure events:
      (a) when the transport fails (e.g., TCP connection reset)
          but the RDS socket/connection layer on both sides stays
          the same
      (b) when the peer's RDS layer itself resets (e.g., due to module
          reload or machine reboot at the peer)
      In case (a) both sides must reconnect and continue the RDS messaging
      without any message loss or disruption to the message sequence numbers,
      and this is achieved by rds_send_path_reset().
      
      In case (b) we should reset all rds_connection state to the
      new incarnation of the peer. Examples of state that needs to
      be reset are next expected rx sequence number from, or messages to be
      retransmitted to, the new incarnation of the peer.
      
      To achieve this, the RDS handshake probe added as part of
      commit 5916e2c1 ("RDS: TCP: Enable multipath RDS for TCP")
      is enhanced so that sender and receiver of the RDS ping-probe
      will add a generation number as part of the RDS_EXTHDR_GEN_NUM
      extension header. Each peer stores local and remote generation
      numbers as part of each rds_connection. Changes in generation
      number will be detected via incoming handshake probe ping
      request or response and will allow the receiver to reset rds_connection
      state.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      905dd418
  8. 17 10月, 2016 1 次提交
  9. 16 7月, 2016 1 次提交
  10. 02 7月, 2016 3 次提交
  11. 15 6月, 2016 7 次提交
  12. 25 11月, 2015 1 次提交
  13. 05 10月, 2015 1 次提交
    • S
      RDS: Use a single TCP socket for both send and receive. · 3b20fc38
      Sowmini Varadhan 提交于
      Commit f711a6ae ("net/rds: RDS-TCP: Always create a new rds_sock
      for an incoming connection.") modified rds-tcp so that an incoming SYN
      would ignore an existing "client" TCP connection which had the local
      port set to the transient port.  The motivation for ignoring the existing
      "client" connection in f711a6ae was to avoid race conditions and an
      endless duel of reconnect attempts triggered by a restart/abort of one
      of the nodes in the TCP connection.
      
      However, having separate sockets for active and passive sides
      is avoidable, and the simpler model of a single TCP socket for
      both send and receives of all RDS connections associated with
      that tcp socket makes for easier observability. We avoid the race
      conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
      if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
      The c_outgoing bit is initialized in __rds_conn_create().
      
      A side-effect of re-using the client rds_connection for an incoming
      SYN is the potential of encountering duelling SYNs, i.e., we
      have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
      SYN. The logic to arbitrate this criss-crossing SYN exchange in
      rds_tcp_accept_one() has been modified to emulate the BGP state
      machine: the smaller IP address should back off from the connection attempt.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b20fc38
  14. 10 9月, 2015 1 次提交
    • S
      RDS: verify the underlying transport exists before creating a connection · 74e98eb0
      Sasha Levin 提交于
      There was no verification that an underlying transport exists when creating
      a connection, this would cause dereferencing a NULL ptr.
      
      It might happen on sockets that weren't properly bound before attempting to
      send a message, which will cause a NULL ptr deref:
      
      [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [135546.051270] Modules linked in:
      [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
      [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
      [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
      [135546.055666] RSP: 0018:ffff8800bc70fab0  EFLAGS: 00010202
      [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
      [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
      [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
      [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
      [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
      [135546.061668] FS:  00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
      [135546.062836] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
      [135546.064723] Stack:
      [135546.065048]  ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
      [135546.066247]  0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
      [135546.067438]  1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
      [135546.068629] Call Trace:
      [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
      [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
      [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
      [135546.071981] rds_sendmsg (net/rds/send.c:1058)
      [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
      [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
      [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
      [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
      [135546.076349] ? __might_fault (mm/memory.c:3795)
      [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
      [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
      [135546.078856] SYSC_sendto (net/socket.c:1657)
      [135546.079596] ? SYSC_connect (net/socket.c:1628)
      [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
      [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
      [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
      [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
      [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
      [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
      [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74e98eb0
  15. 06 9月, 2015 1 次提交
  16. 26 8月, 2015 1 次提交
  17. 08 8月, 2015 1 次提交
  18. 10 5月, 2015 2 次提交
    • S
      net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket. · c82ac7e6
      Sowmini Varadhan 提交于
      When the peer of an RDS-TCP connection restarts, a reconnect
      attempt should only be made from the active side  of the TCP
      connection, i.e. the side that has a transient TCP port
      number. Do not add the passive side of the TCP connection
      to the c_hash_node and thus avoid triggering rds_queue_reconnect()
      for passive rds connections.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c82ac7e6
    • S
      net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection. · f711a6ae
      Sowmini Varadhan 提交于
      When running RDS over TCP, the active (client) side connects to the
      listening ("passive") side at the RDS_TCP_PORT.  After the connection
      is established, if the client side reboots (potentially without even
      sending a FIN) the server still has a TCP socket in the esablished
      state.  If the server-side now gets a new SYN comes from the client
      with a different client port, TCP will create a new socket-pair, but
      the RDS layer will incorrectly pull up the old rds_connection (which
      is still associated with the stale t_sock and RDS socket state).
      
      This patch corrects this behavior by having rds_tcp_accept_one()
      always create a new connection for an incoming TCP SYN.
      The rds and tcp state associated with the old socket-pair is cleaned
      up via the rds_tcp_state_change() callback which would typically be
      invoked in most cases when the client-TCP sends a FIN on TCP restart,
      triggering a transition to CLOSE_WAIT state. In the rarer event of client
      death without a FIN, TCP_KEEPALIVE probes on the socket will detect
      the stale socket, and the TCP transition to CLOSE state will trigger
      the RDS state cleanup.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f711a6ae
  19. 09 4月, 2015 2 次提交
    • S
      RDS: make sure not to loop forever inside rds_send_xmit · 443be0e5
      Sowmini Varadhan 提交于
      If a determined set of concurrent senders keep the send queue full,
      we can loop forever inside rds_send_xmit.  This fix has two parts.
      
      First we are dropping out of the while(1) loop after we've processed a
      large batch of messages.
      
      Second we add a generation number that gets bumped each time the
      xmit bit lock is acquired.  If someone else has jumped in and
      made progress in the queue, we skip our goto restart.
      
      Original patch by Chris Mason.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      443be0e5
    • S
      RDS: only use passive connections when addresses match · 1789b2c0
      Sowmini Varadhan 提交于
      Passive connections were added for the case where one loopback IB
      connection between identical addresses needs another connection to store
      the second QP.  Unfortunately, they were also created in the case where
      the addesses differ and we already have both QPs.
      
      This lead to a message reordering bug.
      
      - two different IB interfaces and addresses on a machine: A B
      - traffic is sent from A to B
      - connection from A-B is created, connect request sent
      - listening accepts connect request, B-A is created
      - traffic flows, next_rx is incremented
      - unacked messages exist on the retrans list
      - connection A-B is shut down, new connect request sent
      - listen sees existing loopback B-A, creates new passive B-A
      - retrans messages are sent and delivered because of 0 next_rx
      
      The problem is that the second connection request saw the previously
      existing parent connection.  Instead of using it, and using the existing
      next_rx_seq state for the traffic between those IPs, it mistakenly
      thought that it had to create a passive connection.
      
      We fix this by only using passive connections in the special case where
      laddr and faddr match.  In this case we'll only ever have one parent
      sending connection requests and one passive connection created as the
      listening path sees the existing parent connection which initiated the
      request.
      
      Original patch by Zach Brown
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1789b2c0
  20. 20 10月, 2013 2 次提交
  21. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  22. 01 11月, 2011 1 次提交
  23. 21 10月, 2010 1 次提交
  24. 09 9月, 2010 5 次提交