1. 24 7月, 2018 2 次提交
    • K
      rds: Enable RDS IPv6 support · 1e2b44e7
      Ka-Cheong Poon 提交于
      This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
      listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
      connection requests.  RDS/RDMA/IB uses a private data (struct
      rds_ib_connect_private) exchange between endpoints at RDS connection
      establishment time to support RDMA. This private data exchange uses a
      32 bit integer to represent an IP address. This needs to be changed in
      order to support IPv6. A new private data struct
      rds6_ib_connect_private is introduced to handle this. To ensure
      backward compatibility, an IPv6 capable RDS stack uses another RDMA
      listener port (RDS_CM_PORT) to accept IPv6 connection. And it
      continues to use the original RDS_PORT for IPv4 RDS connections. When
      it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
      send the connection set up request.
      
      v5: Fixed syntax problem (David Miller).
      
      v4: Changed port history comments in rds.h (Sowmini Varadhan).
      
      v3: Added support to set up IPv4 connection using mapped address
          (David Miller).
          Added support to set up connection between link local and non-link
          addresses.
          Various review comments from Santosh Shilimkar and Sowmini Varadhan.
      
      v2: Fixed bound and peer address scope mismatched issue.
          Added back rds_connect() IPv6 changes.
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e2b44e7
    • K
      rds: Changing IP address internal representation to struct in6_addr · eee2fa6a
      Ka-Cheong Poon 提交于
      This patch changes the internal representation of an IP address to use
      struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
      All the functions which take an IP address as argument are also
      changed to use struct in6_addr.  But RDS socket layer is not modified
      such that it still does not accept IPv6 address from an application.
      And RDS layer does not accept nor initiate IPv6 connections.
      
      v2: Fixed sparse warnings.
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eee2fa6a
  2. 11 4月, 2018 1 次提交
  3. 24 2月, 2018 1 次提交
  4. 22 2月, 2018 1 次提交
  5. 17 2月, 2018 2 次提交
    • S
      rds: zerocopy Tx support. · 0cebacce
      Sowmini Varadhan 提交于
      If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
      if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
      application pages sent down with rds_sendmsg() are pinned.
      
      The pinning uses the accounting infrastructure added by
      Commit a91dbff5 ("sock: ulimit on MSG_ZEROCOPY pages")
      
      The payload bytes in the message may not be modified for the
      duration that the message has been pinned. A multi-threaded
      application using this infrastructure may thus need to be notified
      about send-completion so that it can free/reuse the buffers
      passed to rds_sendmsg(). Notification of send-completion will
      identify each message-buffer by a cookie that the application
      must specify as ancillary data to rds_sendmsg().
      The ancillary data in this case has cmsg_level == SOL_RDS
      and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cebacce
    • S
      rds: hold a sock ref from rds_message to the rds_sock · ea8994cb
      Sowmini Varadhan 提交于
      The existing model holds a reference from the rds_sock to the
      rds_message, but the rds_message does not itself hold a sock_put()
      on the rds_sock. Instead the m_rs field in the rds_message is
      assigned when the message is queued on the sock, and nulled when
      the message is dequeued from the sock.
      
      We want to be able to notify userspace when the rds_message
      is actually freed (from rds_message_purge(), after the refcounts
      to the rds_message go to 0). At the time that rds_message_purge()
      is called, the message is no longer on the rds_sock retransmit
      queue. Thus the explicit reference for the m_rs is needed to
      send a notification that will signal to userspace that
      it is now safe to free/reuse any pages that may have
      been pinned down for zerocopy.
      
      This patch manages the m_rs assignment in the rds_message with
      the necessary refcount book-keeping.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea8994cb
  6. 09 2月, 2018 1 次提交
    • S
      rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and... · ebeeb1ad
      Sowmini Varadhan 提交于
      rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management
      
      An rds_connection can get added during netns deletion between lines 528
      and 529 of
      
        506 static void rds_tcp_kill_sock(struct net *net)
        :
        /* code to pull out all the rds_connections that should be destroyed */
        :
        528         spin_unlock_irq(&rds_tcp_conn_lock);
        529         list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)
        530                 rds_conn_destroy(tc->t_cpath->cp_conn);
      
      Such an rds_connection would miss out the rds_conn_destroy()
      loop (that cancels all pending work) and (if it was scheduled
      after netns deletion) could trigger the use-after-free.
      
      A similar race-window exists for the module unload path
      in rds_tcp_exit -> rds_tcp_destroy_conns
      
      Concurrency with netns deletion (rds_tcp_kill_sock()) must be handled
      by checking check_net() before enqueuing new work or adding new
      connections.
      
      Concurrency with module-unload is handled by maintaining a module
      specific flag that is set at the start of the module exit function,
      and must be checked before enqueuing new work or adding new connections.
      
      This commit refactors existing RDS_DESTROY_PENDING checks added by
      commit 3db6e0d1 ("rds: use RCU to synchronize work-enqueue with
      connection teardown") and consolidates all the concurrency checks
      listed above into the function rds_destroy_pending().
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebeeb1ad
  7. 06 1月, 2018 1 次提交
  8. 27 12月, 2017 1 次提交
    • A
      RDS: Check cmsg_len before dereferencing CMSG_DATA · 14e138a8
      Avinash Repaka 提交于
      RDS currently doesn't check if the length of the control message is
      large enough to hold the required data, before dereferencing the control
      message data. This results in following crash:
      
      BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013
      [inline]
      BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90
      net/rds/send.c:1066
      Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157
      
      CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x25b/0x340 mm/kasan/report.c:409
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
       rds_rdma_bytes net/rds/send.c:1013 [inline]
       rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066
       sock_sendmsg_nosec net/socket.c:628 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:638
       ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018
       __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108
       SYSC_sendmmsg net/socket.c:2139 [inline]
       SyS_sendmmsg+0x35/0x60 net/socket.c:2134
       entry_SYSCALL_64_fastpath+0x1f/0x96
      RIP: 0033:0x43fe49
      RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49
      RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0
      R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000
      
      To fix this, we verify that the cmsg_len is large enough to hold the
      data to be read, before proceeding further.
      Reported-by: Nsyzbot <syzkaller-bugs@googlegroups.com>
      Signed-off-by: NAvinash Repaka <avinash.repaka@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14e138a8
  9. 08 9月, 2017 1 次提交
  10. 06 9月, 2017 1 次提交
  11. 21 7月, 2017 1 次提交
  12. 22 6月, 2017 1 次提交
  13. 17 6月, 2017 1 次提交
  14. 03 1月, 2017 4 次提交
  15. 18 11月, 2016 1 次提交
    • S
      RDS: TCP: Track peer's connection generation number · 905dd418
      Sowmini Varadhan 提交于
      The RDS transport has to be able to distinguish between
      two types of failure events:
      (a) when the transport fails (e.g., TCP connection reset)
          but the RDS socket/connection layer on both sides stays
          the same
      (b) when the peer's RDS layer itself resets (e.g., due to module
          reload or machine reboot at the peer)
      In case (a) both sides must reconnect and continue the RDS messaging
      without any message loss or disruption to the message sequence numbers,
      and this is achieved by rds_send_path_reset().
      
      In case (b) we should reset all rds_connection state to the
      new incarnation of the peer. Examples of state that needs to
      be reset are next expected rx sequence number from, or messages to be
      retransmitted to, the new incarnation of the peer.
      
      To achieve this, the RDS handshake probe added as part of
      commit 5916e2c1 ("RDS: TCP: Enable multipath RDS for TCP")
      is enhanced so that sender and receiver of the RDS ping-probe
      will add a generation number as part of the RDS_EXTHDR_GEN_NUM
      extension header. Each peer stores local and remote generation
      numbers as part of each rds_connection. Changes in generation
      number will be detected via incoming handshake probe ping
      request or response and will allow the receiver to reset rds_connection
      state.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      905dd418
  16. 16 7月, 2016 1 次提交
  17. 02 7月, 2016 1 次提交
  18. 15 6月, 2016 10 次提交
  19. 08 6月, 2016 1 次提交
  20. 25 11月, 2015 1 次提交
  21. 19 10月, 2015 1 次提交
    • S
      RDS: fix rds-ping deadlock over TCP transport · 7b4b0009
      santosh.shilimkar@oracle.com 提交于
      Sowmini found hang with rds-ping while testing RDS over TCP. Its
      a corner case and doesn't happen always. The issue is not reproducible
      with IB transport. Its clear from below dump why we see it with RDS TCP.
      
       [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740
       [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30
       [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20
       [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
       [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds]
       [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds]
       [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds]
       [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130
       [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0
       [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
       [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0
       [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
       [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0
       [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
       [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0
       [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
       [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0
       [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220
       [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880
       [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220
      
      This happens because rds_send_xmit() chain wants to take
      sock_lock which is already taken by tcp_v4_rcv() on its
      way to rds_tcp_data_ready(). Commit db6526dc ("RDS: use
      rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
      was trying to opportunistically finish the send request
      in same thread context.
      
      But because of above recursive lock hang with RDS TCP,
      the send work from rds_send_pong() needs to deferred to
      worker to avoid lock up. Given RDS ping is more of connectivity
      test than performance critical path, its should be ok even
      for transport like IB.
      Reported-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Acked-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b4b0009
  22. 06 10月, 2015 3 次提交
  23. 26 8月, 2015 2 次提交
    • M
      RDS: return EMSGSIZE for oversize requests before processing/queueing · 06e8941e
      Mukesh Kacker 提交于
      rds_send_queue_rm() allows for the "current datagram" being queued
      to exceed SO_SNDBUF thresholds by checking bytes queued without
      counting in length of current datagram. (Since sk_sndbuf is set
      to twice requested SO_SNDBUF value as a kernel heuristic this
      is usually fine!)
      
      If this "current datagram" squeezing past the threshold is itself
      many times the size of the sk_sndbuf threshold itself then even
      twice the SO_SNDBUF does not save us and it gets queued but
      cannot be transmitted. Threads block and deadlock and device
      becomes unusable. The check for this datagram not exceeding
      SNDBUF thresholds (EMSGSIZE) is not done on this datagram as
      that check is only done if queueing attempt fails.
      (Datagrams that follow this datagram fail queueing attempts, go
      through the check and eventually trip EMSGSIZE error but zero
      length datagrams silently fail!)
      
      This fix moves the check for datagrams exceeding SNDBUF limits
      before any processing or queueing is attempted and returns EMSGSIZE
      early in the rds_sndmsg() code. This change also ensures that all
      datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits
      and the large datagrams that exceed those limits do not get to
      rds_send_queue_rm() code for processing.
      Signed-off-by: NMukesh Kacker <mukesh.kacker@oracle.com>
      Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06e8941e
    • S
      RDS: make sure rds_send_drop_to properly takes the m_rs_lock · dfcec251
      santosh.shilimkar@oracle.com 提交于
      rds_send_drop_to() is used during socket tear down to find all the
      messages on the socket and flush them .  It can race with the
      acking code unless it takes the m_rs_lock on each and every message.
      
      This plugs a hole where we didn't take m_rs_lock on any message that
      didn't have the RDS_MSG_ON_CONN set.  Taking m_rs_lock avoids
      double frees and other memory corruptions as the ack code trusts
      the message m_rs pointer on a socket that had actually been freed.
      
      We must take m_rs_lock to access m_rs.  Because of lock nesting and
      rs access, we also need to acquire rs_lock.
      Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
      Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfcec251