1. 17 11月, 2020 2 次提交
    • F
      mptcp: rework poll+nospace handling · 8edf0864
      Florian Westphal 提交于
      MPTCP maintains a status bit, MPTCP_SEND_SPACE, that is set when at
      least one subflow and the mptcp socket itself are writeable.
      
      mptcp_poll returns EPOLLOUT if the bit is set.
      
      mptcp_sendmsg makes sure MPTCP_SEND_SPACE gets cleared when last write
      has used up all subflows or the mptcp socket wmem.
      
      This reworks nospace handling as follows:
      
      MPTCP_SEND_SPACE is replaced with MPTCP_NOSPACE, i.e. inverted meaning.
      This bit is set when the mptcp socket is not writeable.
      The mptcp-level ack path schedule will then schedule the mptcp worker
      to allow it to free already-acked data (and reduce wmem usage).
      
      This will then wake userspace processes that wait for a POLLOUT event.
      
      sendmsg will set MPTCP_NOSPACE only when it has to wait for more
      wmem (blocking I/O case).
      
      poll path will set MPTCP_NOSPACE in case the mptcp socket is
      not writeable.
      
      Normal tcp-level notification (SOCK_NOSPACE) is only enabled
      in case the subflow socket has no available wmem.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8edf0864
    • P
      mptcp: refactor shutdown and close · e16163b6
      Paolo Abeni 提交于
      We must not close the subflows before all the MPTCP level
      data, comprising the DATA_FIN has been acked at the MPTCP
      level, otherwise we could be unable to retransmit as needed.
      
      __mptcp_wr_shutdown() shutdown is responsible to check for the
      correct status and close all subflows. Is called by the output
      path after spooling any data and at shutdown/close time.
      
      In a similar way, __mptcp_destroy_sock() is responsible to clean-up
      the MPTCP level status, and is called when the msk transition
      to TCP_CLOSE.
      
      The protocol level close() does not force anymore the TCP_CLOSE
      status, but orphan the msk socket and all the subflows.
      Orphaned msk sockets are forciby closed after a timeout or
      when all MPTCP-level data is acked.
      
      There is a caveat about keeping the orphaned subflows around:
      the TCP stack can asynchronusly call tcp_cleanup_ulp() on them via
      tcp_close(). To prevent accessing freed memory on later MPTCP
      level operations, the msk acquires a reference to each subflow
      socket and prevent subflow_ulp_release() from releasing the
      subflow context before __mptcp_destroy_sock().
      
      The additional subflow references are released by __mptcp_done()
      and the async ULP release is detected checking ULP ops. If such
      field has been already cleared by the ULP release path, the
      dangling context is freed directly by __mptcp_done().
      Co-developed-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e16163b6
  2. 11 10月, 2020 2 次提交
  3. 09 10月, 2020 1 次提交
  4. 06 10月, 2020 1 次提交
  5. 30 9月, 2020 1 次提交
  6. 25 9月, 2020 2 次提交
  7. 24 9月, 2020 1 次提交
  8. 18 9月, 2020 1 次提交
  9. 15 9月, 2020 9 次提交
  10. 11 9月, 2020 1 次提交
  11. 08 8月, 2020 1 次提交
  12. 06 8月, 2020 1 次提交
    • P
      mptcp: be careful on subflow creation · adf73410
      Paolo Abeni 提交于
      Nicolas reported the following oops:
      
      [ 1521.392541] BUG: kernel NULL pointer dereference, address: 00000000000000c0
      [ 1521.394189] #PF: supervisor read access in kernel mode
      [ 1521.395376] #PF: error_code(0x0000) - not-present page
      [ 1521.396607] PGD 0 P4D 0
      [ 1521.397156] Oops: 0000 [#1] SMP PTI
      [ 1521.398020] CPU: 0 PID: 22986 Comm: kworker/0:2 Not tainted 5.8.0-rc4+ #109
      [ 1521.399618] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 1521.401728] Workqueue: events mptcp_worker
      [ 1521.402651] RIP: 0010:mptcp_subflow_create_socket+0xf1/0x1c0
      [ 1521.403954] Code: 24 08 89 44 24 04 48 8b 7a 18 e8 2a 48 d4 ff 8b 44 24 04 85 c0 75 7a 48 8b 8b 78 02 00 00 48 8b 54 24 08 48 8d bb 80 00 00 00 <48> 8b 89 c0 00 00 00 48 89 8a c0 00 00 00 48 8b 8b 78 02 00 00 8b
      [ 1521.408201] RSP: 0000:ffffabc4002d3c60 EFLAGS: 00010246
      [ 1521.409433] RAX: 0000000000000000 RBX: ffffa0b9ad8c9a00 RCX: 0000000000000000
      [ 1521.411096] RDX: ffffa0b9ae78a300 RSI: 00000000fffffe01 RDI: ffffa0b9ad8c9a80
      [ 1521.412734] RBP: ffffa0b9adff2e80 R08: ffffa0b9af02d640 R09: ffffa0b9ad923a00
      [ 1521.414333] R10: ffffabc4007139f8 R11: fefefefefefefeff R12: ffffabc4002d3cb0
      [ 1521.415918] R13: ffffa0b9ad91fa58 R14: ffffa0b9ad8c9f9c R15: 0000000000000000
      [ 1521.417592] FS:  0000000000000000(0000) GS:ffffa0b9af000000(0000) knlGS:0000000000000000
      [ 1521.419490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1521.420839] CR2: 00000000000000c0 CR3: 000000002951e006 CR4: 0000000000160ef0
      [ 1521.422511] Call Trace:
      [ 1521.423103]  __mptcp_subflow_connect+0x94/0x1f0
      [ 1521.425376]  mptcp_pm_create_subflow_or_signal_addr+0x200/0x2a0
      [ 1521.426736]  mptcp_worker+0x31b/0x390
      [ 1521.431324]  process_one_work+0x1fc/0x3f0
      [ 1521.432268]  worker_thread+0x2d/0x3b0
      [ 1521.434197]  kthread+0x117/0x130
      [ 1521.435783]  ret_from_fork+0x22/0x30
      
      on some unconventional configuration.
      
      The MPTCP protocol is trying to create a subflow for an
      unaccepted server socket. That is allowed by the RFC, even
      if subflow creation will likely fail.
      Unaccepted sockets have still a NULL sk_socket field,
      avoid the issue by failing earlier.
      Reported-and-tested-by: NNicolas Rybowski <nicolas.rybowski@tessares.net>
      Fixes: 7d14b0d2 ("mptcp: set correct vfs info for subflows")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adf73410
  13. 01 8月, 2020 5 次提交
  14. 29 7月, 2020 2 次提交
    • M
      mptcp: Only use subflow EOF signaling on fallback connections · 067a0b3d
      Mat Martineau 提交于
      The MPTCP state machine handles disconnections on non-fallback connections,
      but the mptcp_sock still needs to get notified when fallback subflows
      disconnect.
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      067a0b3d
    • M
      mptcp: Use full MPTCP-level disconnect state machine · 43b54c6e
      Mat Martineau 提交于
      RFC 8684 appendix D describes the connection state machine for
      MPTCP. This patch implements the DATA_FIN / DATA_ACK exchanges and
      MPTCP-level socket state changes described in that appendix, rather than
      simply sending DATA_FIN along with TCP FIN when disconnecting subflows.
      
      DATA_FIN is now sent and acknowledged before shutting down the
      subflows. Received DATA_FIN information (if not part of a data packet)
      is written to the MPTCP socket when the incoming DSS option is parsed by
      the subflow, and the MPTCP worker is scheduled to process the
      flag. DATA_FIN received as part of a full DSS mapping will be handled
      when the mapping is processed.
      
      The DATA_FIN is acknowledged by the worker if the reader is caught
      up. If there is still data to be moved to the MPTCP-level queue, ack_seq
      will be incremented to account for the DATA_FIN when it reaches the end
      of the stream and a DATA_ACK will be sent to the peer.
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43b54c6e
  15. 24 7月, 2020 6 次提交
  16. 18 7月, 2020 1 次提交
  17. 07 7月, 2020 1 次提交
    • D
      mptcp: fix race in subflow_data_ready() · d47a7215
      Davide Caratti 提交于
      syzkaller was able to make the kernel reach subflow_data_ready() for a
      server subflow that was closed before subflow_finish_connect() completed.
      In these cases we can avoid using the path for regular/fallback MPTCP
      data, and just wake the main socket, to avoid the following warning:
      
       WARNING: CPU: 0 PID: 9370 at net/mptcp/subflow.c:885
       subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885
       Kernel panic - not syncing: panic_on_warn set ...
       CPU: 0 PID: 9370 Comm: syz-executor.0 Not tainted 5.7.0 #106
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
       rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <IRQ>
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0xb7/0xfe lib/dump_stack.c:118
        panic+0x29e/0x692 kernel/panic.c:221
        __warn.cold+0x2f/0x3d kernel/panic.c:582
        report_bug+0x28b/0x2f0 lib/bug.c:195
        fixup_bug arch/x86/kernel/traps.c:105 [inline]
        fixup_bug arch/x86/kernel/traps.c:100 [inline]
        do_error_trap+0x10f/0x180 arch/x86/kernel/traps.c:197
        do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:216
        invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:1027
       RIP: 0010:subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885
       Code: 04 02 84 c0 74 06 0f 8e 91 00 00 00 41 0f b6 5e 48 31 ff 83 e3 18
       89 de e8 37 ec 3d fe 84 db 0f 85 65 ff ff ff e8 fa ea 3d fe <0f> 0b e9
       59 ff ff ff e8 ee ea 3d fe 48 89 ee 4c 89 ef e8 f3 77 ff
       RSP: 0018:ffff88811b2099b0 EFLAGS: 00010206
       RAX: ffff888111197000 RBX: 0000000000000000 RCX: ffffffff82fbc609
       RDX: 0000000000000100 RSI: ffffffff82fbc616 RDI: 0000000000000001
       RBP: ffff8881111bc800 R08: ffff888111197000 R09: ffffed10222a82af
       R10: ffff888111541577 R11: ffffed10222a82ae R12: 1ffff11023641336
       R13: ffff888111541000 R14: ffff88810fd4ca00 R15: ffff888111541570
        tcp_child_process+0x754/0x920 net/ipv4/tcp_minisocks.c:841
        tcp_v4_do_rcv+0x749/0x8b0 net/ipv4/tcp_ipv4.c:1642
        tcp_v4_rcv+0x2666/0x2e60 net/ipv4/tcp_ipv4.c:1999
        ip_protocol_deliver_rcu+0x29/0x1f0 net/ipv4/ip_input.c:204
        ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
        NF_HOOK include/linux/netfilter.h:421 [inline]
        ip_local_deliver+0x2da/0x390 net/ipv4/ip_input.c:252
        dst_input include/net/dst.h:441 [inline]
        ip_rcv_finish net/ipv4/ip_input.c:428 [inline]
        ip_rcv_finish net/ipv4/ip_input.c:414 [inline]
        NF_HOOK include/linux/netfilter.h:421 [inline]
        ip_rcv+0xef/0x140 net/ipv4/ip_input.c:539
        __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5268
        __netif_receive_skb+0x27/0x1c0 net/core/dev.c:5382
        process_backlog+0x1e5/0x6d0 net/core/dev.c:6226
        napi_poll net/core/dev.c:6671 [inline]
        net_rx_action+0x3e3/0xd70 net/core/dev.c:6739
        __do_softirq+0x18c/0x634 kernel/softirq.c:292
        do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
        </IRQ>
        do_softirq.part.0+0x26/0x30 kernel/softirq.c:337
        do_softirq arch/x86/include/asm/preempt.h:26 [inline]
        __local_bh_enable_ip+0x46/0x50 kernel/softirq.c:189
        local_bh_enable include/linux/bottom_half.h:32 [inline]
        rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline]
        ip_finish_output2+0x78a/0x19c0 net/ipv4/ip_output.c:229
        __ip_finish_output+0x471/0x720 net/ipv4/ip_output.c:306
        dst_output include/net/dst.h:435 [inline]
        ip_local_out+0x181/0x1e0 net/ipv4/ip_output.c:125
        __ip_queue_xmit+0x7a1/0x14e0 net/ipv4/ip_output.c:530
        __tcp_transmit_skb+0x19dc/0x35e0 net/ipv4/tcp_output.c:1238
        __tcp_send_ack.part.0+0x3c2/0x5b0 net/ipv4/tcp_output.c:3785
        __tcp_send_ack net/ipv4/tcp_output.c:3791 [inline]
        tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3791
        tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6040 [inline]
        tcp_rcv_state_process+0x36a4/0x49c2 net/ipv4/tcp_input.c:6209
        tcp_v4_do_rcv+0x343/0x8b0 net/ipv4/tcp_ipv4.c:1651
        sk_backlog_rcv include/net/sock.h:996 [inline]
        __release_sock+0x1ad/0x310 net/core/sock.c:2548
        release_sock+0x54/0x1a0 net/core/sock.c:3064
        inet_wait_for_connect net/ipv4/af_inet.c:594 [inline]
        __inet_stream_connect+0x57e/0xd50 net/ipv4/af_inet.c:686
        inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725
        mptcp_stream_connect+0x171/0x5f0 net/mptcp/protocol.c:1920
        __sys_connect_file net/socket.c:1854 [inline]
        __sys_connect+0x267/0x2f0 net/socket.c:1871
        __do_sys_connect net/socket.c:1882 [inline]
        __se_sys_connect net/socket.c:1879 [inline]
        __x64_sys_connect+0x6f/0xb0 net/socket.c:1879
        do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:295
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fb577d06469
       Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89
       f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
       f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
       RSP: 002b:00007fb5783d5dd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
       RAX: ffffffffffffffda RBX: 000000000068bfa0 RCX: 00007fb577d06469
       RDX: 000000000000004d RSI: 0000000020000040 RDI: 0000000000000003
       RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
       R13: 000000000041427c R14: 00007fb5783d65c0 R15: 0000000000000003
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/39Reported-by: NChristoph Paasch <cpaasch@apple.com>
      Fixes: e1ff9e82 ("net: mptcp: improve fallback to TCP")
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d47a7215
  18. 02 7月, 2020 1 次提交
    • F
      mptcp: add receive buffer auto-tuning · a6b118fe
      Florian Westphal 提交于
      When mptcp is used, userspace doesn't read from the tcp (subflow)
      socket but from the parent (mptcp) socket receive queue.
      
      skbs are moved from the subflow socket to the mptcp rx queue either from
      'data_ready' callback (if mptcp socket can be locked), a work queue, or
      the socket receive function.
      
      This means tcp_rcv_space_adjust() is never called and thus no receive
      buffer size auto-tuning is done.
      
      An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
      function that moves skbs from subflow to mptcp socket.
      While this enabled autotuning, it also meant tuning was done even if
      userspace was reading the mptcp socket very slowly.
      
      This adds mptcp_rcv_space_adjust() and calls it after userspace has
      read data from the mptcp socket rx queue.
      
      Its very similar to tcp_rcv_space_adjust, with two differences:
      
      1. The rtt estimate is the largest one observed on a subflow
      2. The rcvbuf size and window clamp of all subflows is adjusted
         to the mptcp-level rcvbuf.
      
      Otherwise, we get spurious drops at tcp (subflow) socket level if
      the skbs are not moved to the mptcp socket fast enough.
      
      Before:
      time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
      [..]
      ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration 40823ms) [ OK ]
      ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration 23119ms) [ OK ]
      ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5421ms) [ OK ]
      ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration 41446ms) [ OK ]
      ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration 23427ms) [ OK ]
      ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5426ms) [ OK ]
      Time: 1396 seconds
      
      After:
      ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration  5417ms) [ OK ]
      ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration  5427ms) [ OK ]
      ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5422ms) [ OK ]
      ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration  5415ms) [ OK ]
      ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration  5422ms) [ OK ]
      ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5423ms) [ OK ]
      Time: 296 seconds
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6b118fe
  19. 01 7月, 2020 1 次提交