1. 30 12月, 2021 1 次提交
  2. 11 12月, 2021 1 次提交
  3. 10 12月, 2021 1 次提交
  4. 02 12月, 2021 1 次提交
    • E
      net: avoid uninit-value from tcp_conn_request · a37a0ee4
      Eric Dumazet 提交于
      A recent change triggers a KMSAN warning, because request
      sockets do not initialize @sk_rx_queue_mapping field.
      
      Add sk_rx_queue_update() helper to make our intent clear.
      
      BUG: KMSAN: uninit-value in sk_rx_queue_set include/net/sock.h:1922 [inline]
      BUG: KMSAN: uninit-value in tcp_conn_request+0x3bcc/0x4dc0 net/ipv4/tcp_input.c:6922
       sk_rx_queue_set include/net/sock.h:1922 [inline]
       tcp_conn_request+0x3bcc/0x4dc0 net/ipv4/tcp_input.c:6922
       tcp_v4_conn_request+0x218/0x2a0 net/ipv4/tcp_ipv4.c:1528
       tcp_rcv_state_process+0x2c5/0x3290 net/ipv4/tcp_input.c:6406
       tcp_v4_do_rcv+0xb4e/0x1330 net/ipv4/tcp_ipv4.c:1738
       tcp_v4_rcv+0x468d/0x4ed0 net/ipv4/tcp_ipv4.c:2100
       ip_protocol_deliver_rcu+0x760/0x10b0 net/ipv4/ip_input.c:204
       ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       ip_local_deliver+0x584/0x8c0 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:460 [inline]
       ip_sublist_rcv_finish net/ipv4/ip_input.c:551 [inline]
       ip_list_rcv_finish net/ipv4/ip_input.c:601 [inline]
       ip_sublist_rcv+0x11fd/0x1520 net/ipv4/ip_input.c:609
       ip_list_rcv+0x95f/0x9a0 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5505 [inline]
       __netif_receive_skb_list_core+0xe34/0x1240 net/core/dev.c:5553
       __netif_receive_skb_list+0x7fc/0x960 net/core/dev.c:5605
       netif_receive_skb_list_internal+0x868/0xde0 net/core/dev.c:5696
       gro_normal_list net/core/dev.c:5850 [inline]
       napi_complete_done+0x579/0xdd0 net/core/dev.c:6587
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0x17b6/0x2350 drivers/net/virtio_net.c:1557
       __napi_poll+0x14e/0xbc0 net/core/dev.c:7020
       napi_poll net/core/dev.c:7087 [inline]
       net_rx_action+0x824/0x1880 net/core/dev.c:7174
       __do_softirq+0x1fe/0x7eb kernel/softirq.c:558
       invoke_softirq+0xa4/0x130 kernel/softirq.c:432
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x76/0x130 kernel/softirq.c:648
       common_interrupt+0xb6/0xd0 arch/x86/kernel/irq.c:240
       asm_common_interrupt+0x1e/0x40
       smap_restore arch/x86/include/asm/smap.h:67 [inline]
       get_shadow_origin_ptr mm/kmsan/instrumentation.c:31 [inline]
       __msan_metadata_ptr_for_load_1+0x28/0x30 mm/kmsan/instrumentation.c:63
       tomoyo_check_acl+0x1b0/0x630 security/tomoyo/domain.c:173
       tomoyo_path_permission security/tomoyo/file.c:586 [inline]
       tomoyo_check_open_permission+0x61f/0xe10 security/tomoyo/file.c:777
       tomoyo_file_open+0x24f/0x2d0 security/tomoyo/tomoyo.c:311
       security_file_open+0xb1/0x1f0 security/security.c:1635
       do_dentry_open+0x4e4/0x1bf0 fs/open.c:809
       vfs_open+0xaf/0xe0 fs/open.c:957
       do_open fs/namei.c:3426 [inline]
       path_openat+0x52f1/0x5dd0 fs/namei.c:3559
       do_filp_open+0x306/0x760 fs/namei.c:3586
       do_sys_openat2+0x263/0x8f0 fs/open.c:1212
       do_sys_open fs/open.c:1228 [inline]
       __do_sys_open fs/open.c:1236 [inline]
       __se_sys_open fs/open.c:1232 [inline]
       __x64_sys_open+0x314/0x380 fs/open.c:1232
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Uninit was created at:
       __alloc_pages+0xbc7/0x10a0 mm/page_alloc.c:5409
       alloc_pages+0x8a5/0xb80
       alloc_slab_page mm/slub.c:1810 [inline]
       allocate_slab+0x287/0x1c20 mm/slub.c:1947
       new_slab mm/slub.c:2010 [inline]
       ___slab_alloc+0xbdf/0x1e90 mm/slub.c:3039
       __slab_alloc mm/slub.c:3126 [inline]
       slab_alloc_node mm/slub.c:3217 [inline]
       slab_alloc mm/slub.c:3259 [inline]
       kmem_cache_alloc+0xbb3/0x11c0 mm/slub.c:3264
       reqsk_alloc include/net/request_sock.h:91 [inline]
       inet_reqsk_alloc+0xaf/0x8b0 net/ipv4/tcp_input.c:6712
       tcp_conn_request+0x910/0x4dc0 net/ipv4/tcp_input.c:6852
       tcp_v4_conn_request+0x218/0x2a0 net/ipv4/tcp_ipv4.c:1528
       tcp_rcv_state_process+0x2c5/0x3290 net/ipv4/tcp_input.c:6406
       tcp_v4_do_rcv+0xb4e/0x1330 net/ipv4/tcp_ipv4.c:1738
       tcp_v4_rcv+0x468d/0x4ed0 net/ipv4/tcp_ipv4.c:2100
       ip_protocol_deliver_rcu+0x760/0x10b0 net/ipv4/ip_input.c:204
       ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       ip_local_deliver+0x584/0x8c0 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:460 [inline]
       ip_sublist_rcv_finish net/ipv4/ip_input.c:551 [inline]
       ip_list_rcv_finish net/ipv4/ip_input.c:601 [inline]
       ip_sublist_rcv+0x11fd/0x1520 net/ipv4/ip_input.c:609
       ip_list_rcv+0x95f/0x9a0 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5505 [inline]
       __netif_receive_skb_list_core+0xe34/0x1240 net/core/dev.c:5553
       __netif_receive_skb_list+0x7fc/0x960 net/core/dev.c:5605
       netif_receive_skb_list_internal+0x868/0xde0 net/core/dev.c:5696
       gro_normal_list net/core/dev.c:5850 [inline]
       napi_complete_done+0x579/0xdd0 net/core/dev.c:6587
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0x17b6/0x2350 drivers/net/virtio_net.c:1557
       __napi_poll+0x14e/0xbc0 net/core/dev.c:7020
       napi_poll net/core/dev.c:7087 [inline]
       net_rx_action+0x824/0x1880 net/core/dev.c:7174
       __do_softirq+0x1fe/0x7eb kernel/softirq.c:558
      
      Fixes: 342159ee ("net: avoid dirtying sk->sk_rx_queue_mapping")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20211130182939.2584764-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a37a0ee4
  5. 29 11月, 2021 1 次提交
    • P
      tcp: fix page frag corruption on page fault · dacb5d88
      Paolo Abeni 提交于
      Steffen reported a TCP stream corruption for HTTP requests
      served by the apache web-server using a cifs mount-point
      and memory mapping the relevant file.
      
      The root cause is quite similar to the one addressed by
      commit 20eb4f29 ("net: fix sk_page_frag() recursion from
      memory reclaim"). Here the nested access to the task page frag
      is caused by a page fault on the (mmapped) user-space memory
      buffer coming from the cifs file.
      
      The page fault handler performs an smb transaction on a different
      socket, inside the same process context. Since sk->sk_allaction
      for such socket does not prevent the usage for the task_frag,
      the nested allocation modify "under the hood" the page frag
      in use by the outer sendmsg call, corrupting the stream.
      
      The overall relevant stack trace looks like the following:
      
      httpd 78268 [001] 3461630.850950:      probe:tcp_sendmsg_locked:
              ffffffff91461d91 tcp_sendmsg_locked+0x1
              ffffffff91462b57 tcp_sendmsg+0x27
              ffffffff9139814e sock_sendmsg+0x3e
              ffffffffc06dfe1d smb_send_kvec+0x28
              [...]
              ffffffffc06cfaf8 cifs_readpages+0x213
              ffffffff90e83c4b read_pages+0x6b
              ffffffff90e83f31 __do_page_cache_readahead+0x1c1
              ffffffff90e79e98 filemap_fault+0x788
              ffffffff90eb0458 __do_fault+0x38
              ffffffff90eb5280 do_fault+0x1a0
              ffffffff90eb7c84 __handle_mm_fault+0x4d4
              ffffffff90eb8093 handle_mm_fault+0xc3
              ffffffff90c74f6d __do_page_fault+0x1ed
              ffffffff90c75277 do_page_fault+0x37
              ffffffff9160111e page_fault+0x1e
              ffffffff9109e7b5 copyin+0x25
              ffffffff9109eb40 _copy_from_iter_full+0xe0
              ffffffff91462370 tcp_sendmsg_locked+0x5e0
              ffffffff91462370 tcp_sendmsg_locked+0x5e0
              ffffffff91462b57 tcp_sendmsg+0x27
              ffffffff9139815c sock_sendmsg+0x4c
              ffffffff913981f7 sock_write_iter+0x97
              ffffffff90f2cc56 do_iter_readv_writev+0x156
              ffffffff90f2dff0 do_iter_write+0x80
              ffffffff90f2e1c3 vfs_writev+0xa3
              ffffffff90f2e27c do_writev+0x5c
              ffffffff90c042bb do_syscall_64+0x5b
              ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65
      
      The cifs filesystem rightfully sets sk_allocations to GFP_NOFS,
      we can avoid the nesting using the sk page frag for allocation
      lacking the __GFP_FS flag. Do not define an additional mm-helper
      for that, as this is strictly tied to the sk page frag usage.
      
      v1 -> v2:
       - use a stricted sk_page_frag() check instead of reordering the
         code (Eric)
      Reported-by: NSteffen Froemer <sfroemer@redhat.com>
      Fixes: 5640f768 ("net: use a per task frag allocator")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dacb5d88
  6. 23 11月, 2021 1 次提交
  7. 18 11月, 2021 1 次提交
  8. 16 11月, 2021 12 次提交
  9. 02 11月, 2021 1 次提交
  10. 28 10月, 2021 3 次提交
  11. 27 10月, 2021 1 次提交
  12. 26 10月, 2021 5 次提交
  13. 15 10月, 2021 1 次提交
    • E
      tcp: switch orphan_count to bare per-cpu counters · 19757ceb
      Eric Dumazet 提交于
      Use of percpu_counter structure to track count of orphaned
      sockets is causing problems on modern hosts with 256 cpus
      or more.
      
      Stefan Bach reported a serious spinlock contention in real workloads,
      that I was able to reproduce with a netfilter rule dropping
      incoming FIN packets.
      
          53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
                  |
                  ---queued_spin_lock_slowpath
                     |
                      --53.51%--_raw_spin_lock_irqsave
                                |
                                 --53.51%--__percpu_counter_sum
                                           tcp_check_oom
                                           |
                                           |--39.03%--__tcp_close
                                           |          tcp_close
                                           |          inet_release
                                           |          inet6_release
                                           |          sock_close
                                           |          __fput
                                           |          ____fput
                                           |          task_work_run
                                           |          exit_to_usermode_loop
                                           |          do_syscall_64
                                           |          entry_SYSCALL_64_after_hwframe
                                           |          __GI___libc_close
                                           |
                                            --14.48%--tcp_out_of_resources
                                                      tcp_write_timeout
                                                      tcp_retransmit_timer
                                                      tcp_write_timer_handler
                                                      tcp_write_timer
                                                      call_timer_fn
                                                      expire_timers
                                                      __run_timers
                                                      run_timer_softirq
                                                      __softirqentry_text_start
      
      As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
      batch for dst entries accounting"), default batch size is too big
      for the default value of tcp_max_orphans (262144).
      
      But even if we reduce batch sizes, there would still be cases
      where the estimated count of orphans is beyond the limit,
      and where tcp_too_many_orphans() has to call the expensive
      percpu_counter_sum_positive().
      
      One solution is to use plain per-cpu counters, and have
      a timer to periodically refresh this cache.
      
      Updating this cache every 100ms seems about right, tcp pressure
      state is not radically changing over shorter periods.
      
      percpu_counter was nice 15 years ago while hosts had less
      than 16 cpus, not anymore by current standards.
      
      v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
          reported by kernel test robot <lkp@intel.com>
          Remove unused socket argument from tcp_too_many_orphans()
      
      Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NStefan Bach <sfb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19757ceb
  14. 08 10月, 2021 1 次提交
  15. 02 10月, 2021 1 次提交
  16. 30 9月, 2021 4 次提交
    • E
      af_unix: fix races in sk_peer_pid and sk_peer_cred accesses · 35306eb2
      Eric Dumazet 提交于
      Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
      are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
      
      In order to fix this issue, this patch adds a new spinlock that needs
      to be used whenever these fields are read or written.
      
      Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
      reading sk->sk_peer_pid which makes no sense, as this field
      is only possibly set by AF_UNIX sockets.
      We will have to clean this in a separate patch.
      This could be done by reverting b48596d1 "Bluetooth: L2CAP: Add get_peer_pid callback"
      or implementing what was truly expected.
      
      Fixes: 109f6e39 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35306eb2
    • W
      tcp: adjust sndbuf according to sk_reserved_mem · ca057051
      Wei Wang 提交于
      If user sets SO_RESERVE_MEM socket option, in order to fully utilize the
      reserved memory in memory pressure state on the tx path, we modify the
      logic in sk_stream_moderate_sndbuf() to set sk_sndbuf according to
      available reserved memory, instead of MIN_SOCK_SNDBUF, and adjust it
      when new data is acked.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca057051
    • W
      net: add new socket option SO_RESERVE_MEM · 2bb2f5fb
      Wei Wang 提交于
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      
      Note:
      This socket option is only available when memory cgroup is enabled and we
      require this reserved memory to be charged to the user's memcg. We hope
      this could avoid mis-behaving users to abused this feature to reserve a
      large amount on certain sockets and cause unfairness for others.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb2f5fb
    • P
      net: introduce and use lock_sock_fast_nested() · 49054556
      Paolo Abeni 提交于
      Syzkaller reported a false positive deadlock involving
      the nl socket lock and the subflow socket lock:
      
      MPTCP: kernel_bind error, err=-98
      ============================================
      WARNING: possible recursive locking detected
      5.15.0-rc1-syzkaller #0 Not tainted
      --------------------------------------------
      syz-executor998/6520 is trying to acquire lock:
      ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
      
      but task is already holding lock:
      ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
      ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(k-sk_lock-AF_INET);
        lock(k-sk_lock-AF_INET);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by syz-executor998/6520:
       #0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
       #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
       #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
       #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
       #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
      
      stack backtrace:
      CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
       check_deadlock kernel/locking/lockdep.c:2987 [inline]
       validate_chain kernel/locking/lockdep.c:3776 [inline]
       __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
       lock_acquire kernel/locking/lockdep.c:5625 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
       lock_sock_fast+0x36/0x100 net/core/sock.c:3229
       mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
       inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
       __sock_release net/socket.c:649 [inline]
       sock_release+0x87/0x1b0 net/socket.c:677
       mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
       mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
       genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
       genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
       genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
       genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
       netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
       netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
       kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
       kernel_sendpage net/socket.c:3501 [inline]
       sock_sendpage+0xe5/0x140 net/socket.c:1003
       pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
       splice_from_pipe_feed fs/splice.c:418 [inline]
       __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
       splice_from_pipe fs/splice.c:597 [inline]
       generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
       do_splice_from fs/splice.c:767 [inline]
       direct_splice_actor+0x110/0x180 fs/splice.c:936
       splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
       do_splice_direct+0x1b3/0x280 fs/splice.c:979
       do_sendfile+0xae9/0x1240 fs/read_write.c:1249
       __do_sys_sendfile64 fs/read_write.c:1314 [inline]
       __se_sys_sendfile64 fs/read_write.c:1300 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f215cb69969
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
      RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
      R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
      R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
      
      the problem originates from uncorrect lock annotation in the mptcp
      code and is only visible since commit 2dcb96ba ("net: core: Correct
      the sock::sk_lock.owned lockdep annotations"), but is present since
      the port-based endpoint support initial implementation.
      
      This patch addresses the issue introducing a nested variant of
      lock_sock_fast() and using it in the relevant code path.
      
      Fixes: 1729cf18 ("mptcp: create the listening socket for new port")
      Fixes: 2dcb96ba ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49054556
  17. 23 9月, 2021 1 次提交
  18. 19 9月, 2021 1 次提交
    • T
      net: core: Correct the sock::sk_lock.owned lockdep annotations · 2dcb96ba
      Thomas Gleixner 提交于
      lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
      sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
      is just lockdep wise equivalent. In fact it's an open coded trivial mutex
      implementation with some interesting features.
      
      sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
      representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
      true, then some other task holds the 'mutex', otherwise it is uncontended.
      As this locking construct is obviously endangered by lock ordering issues as
      any other locking primitive it got lockdep annotated via a dedicated
      dependency map sock::sk_lock.dep_map which has to be updated at the lock
      and unlock sites.
      
      lock_sock_nested() is a straight forward 'mutex' lock operation:
      
        might_sleep();
        spin_lock_bh(sock::sk_lock.slock)
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
      
      The lockdep annotation for sock::sk_lock.owned is for unknown reasons
      _after_ the lock has been acquired, i.e. after the code block above and
      after releasing sock::sk_lock.slock, but inside the bottom halves disabled
      region:
      
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
      
      The placement after the unlock is obvious because otherwise the
      mutex_acquire() would nest into the spin lock held region.
      
      But that's from the lockdep perspective still the wrong place:
      
       1) The mutex_acquire() is issued _after_ the successful acquisition which
          is pointless because in a dead lock scenario this point is never
          reached which means that if the deadlock is the first instance of
          exposing the wrong lock order lockdep does not have a chance to detect
          it.
      
       2) It only works because lockdep is rather lax on the context from which
          the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
          and therefore non-preemptible region is obviously invalid, except for a
          trylock which is clearly not the case here.
      
          This 'works' stops working on RT enabled kernels where the bottom halves
          serialization is done via a local lock, which exposes this misplacement
          because the 'mutex' and the local lock nest the wrong way around and
          lockdep complains rightfully about a lock inversion.
      
      The placement is wrong since the initial commit a5b5bb9a ("[PATCH]
      lockdep: annotate sk_locks") which introduced this.
      
      Fix it by moving the mutex_acquire() in front of the actual lock
      acquisition, which is what the regular mutex_lock() operation does as well.
      
      lock_sock_fast() is not that straight forward. It looks at the first glance
      like a convoluted trylock operation:
      
        spin_lock_bh(sock::sk_lock.slock)
        if (!sock::sk_lock.owned)
            return false;
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
        return true;
      
      But that's not the case: lock_sock_fast() is an interesting optimization
      for short critical sections which can run with bottom halves disabled and
      sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
      the non contended case by preventing other lockers to acquire
      sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
      in turn avoids the overhead of doing the heavy processing in release_sock()
      including waking up wait queue waiters.
      
      In the contended case, i.e. when sock::sk_lock.owned == true the behavior
      is the same as lock_sock_nested().
      
      Semantically this shortcut means, that the task acquired the 'mutex' even
      if it does not touch the sock::sk_lock.owned field in the non-contended
      case. Not telling lockdep about this shortcut acquisition is hiding
      potential lock ordering violations in the fast path.
      
      As a consequence the same reasoning as for the above lock_sock_nested()
      case vs. the placement of the lockdep annotation applies.
      
      The current placement of the lockdep annotation was just copied from
      the original lock_sock(), now renamed to lock_sock_nested(),
      implementation.
      
      Fix this by moving the mutex_acquire() in front of the actual lock
      acquisition and adding the corresponding mutex_release() into
      unlock_sock_fast(). Also document the fast path return case with a comment.
      Reported-by: NSebastian Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dcb96ba
  19. 26 8月, 2021 1 次提交
  20. 18 8月, 2021 1 次提交