1. 30 4月, 2022 1 次提交
  2. 29 4月, 2022 1 次提交
  3. 27 4月, 2022 1 次提交
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  4. 12 4月, 2022 1 次提交
    • O
      net: remove noblock parameter from recvmsg() entities · ec095263
      Oliver Hartkopp 提交于
      The internal recvmsg() functions have two parameters 'flags' and 'noblock'
      that were merged inside skb_recv_datagram(). As a follow up patch to commit
      f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
      this patch removes the separate 'noblock' parameter for recvmsg().
      
      Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
      'noblock' parameters are unnecessarily split up with e.g.
      
      err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                                 flags & ~MSG_DONTWAIT, &addr_len);
      
      or in
      
      err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                            sk, msg, size, flags & MSG_DONTWAIT,
                            flags & ~MSG_DONTWAIT, &addr_len);
      
      instead of simply using only flags all the time and check for MSG_DONTWAIT
      where needed (to preserve for the formerly separated no(n)block condition).
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      ec095263
  5. 11 4月, 2022 1 次提交
  6. 08 4月, 2022 1 次提交
    • J
      net: extract a few internals from netdevice.h · 6264f58c
      Jakub Kicinski 提交于
      There's a number of functions and static variables used
      under net/core/ but not from the outside. We currently
      dump most of them into netdevice.h. That bad for many
      reasons:
       - netdevice.h is very cluttered, hard to figure out
         what the APIs are;
       - netdevice.h is very long;
       - we have to touch netdevice.h more which causes expensive
         incremental builds.
      
      Create a header under net/core/ and move some declarations.
      
      The new header is also a bit of a catch-all but that's
      fine, if we create more specific headers people will
      likely over-think where their declaration fit best.
      And end up putting them in netdevice.h, again.
      
      More work should be done on splitting netdevice.h into more
      targeted headers, but that'd be more time consuming so small
      steps.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6264f58c
  7. 09 3月, 2022 1 次提交
  8. 18 2月, 2022 2 次提交
    • E
      net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57
      Eric Dumazet 提交于
      UDP sendmsg() can be lockless, this is causing all kinds
      of data races.
      
      This patch converts sk->sk_tskey to remove one of these races.
      
      BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
      
      read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
       __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
       __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000054d -> 0x0000054e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1cdec57
    • E
      net: add sanity check in proto_register() · f20cfd66
      Eric Dumazet 提交于
      prot->memory_allocated should only be set if prot->sysctl_mem
      is also set.
      
      This is a followup of commit 25206111 ("crypto: af_alg - get
      rid of alg_memory_allocated").
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220216171801.3604366-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f20cfd66
  9. 02 2月, 2022 1 次提交
  10. 31 1月, 2022 2 次提交
  11. 26 1月, 2022 1 次提交
  12. 17 1月, 2022 1 次提交
  13. 12 1月, 2022 1 次提交
  14. 11 12月, 2021 1 次提交
  15. 10 12月, 2021 1 次提交
  16. 25 11月, 2021 2 次提交
    • M
      net: allow SO_MARK with CAP_NET_RAW · 079925cc
      Maciej Żenczykowski 提交于
      A CAP_NET_RAW capable process can already spoof (on transmit) anything
      it desires via raw packet sockets...  There is no good reason to not
      allow it to also be able to play routing tricks on packets from its
      own normal sockets.
      
      There is a desire to be able to use SO_MARK for routing table selection
      (via ip rule fwmark) from within a user process without having to run
      it as root.  Granting it CAP_NET_RAW is much less dangerous than
      CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change,
      while CAP_NET_ADMIN does - by for example allowing the reconfiguration
      of the routing tables and/or bringing up/down devices).
      
      Let's keep CAP_NET_ADMIN for persistent state changes,
      while using CAP_NET_RAW for non-configuration related stuff.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      079925cc
    • M
      net: allow CAP_NET_RAW to setsockopt SO_PRIORITY · a1b519b7
      Maciej Żenczykowski 提交于
      CAP_NET_ADMIN is and should continue to be about configuring the
      system as a whole, not about configuring per-socket or per-packet
      parameters.
      Sending and receiving raw packets is what CAP_NET_RAW is all about.
      
      It can already send packets with any VLAN tag, and any IPv4 TOS
      mark, and any IPv6 TCLASS mark, simply by virtue of building
      such a raw packet.  Not to mention using any protocol and source/
      /destination ip address/port tuple.
      
      These are the fields that networking gear uses to prioritize packets.
      
      Hence, a CAP_NET_RAW process is already capable of affecting traffic
      prioritization after it hits the wire.  This change makes it capable
      of affecting traffic prioritization even in the host at the nic and
      before that in the queueing disciplines (provided skb->priority is
      actually being used for prioritization, and not the TOS/TCLASS field)
      
      Hence it makes sense to allow a CAP_NET_RAW process to set the
      priority of sockets and thus packets it sends.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a1b519b7
  17. 22 11月, 2021 2 次提交
  18. 16 11月, 2021 7 次提交
  19. 15 11月, 2021 1 次提交
  20. 04 11月, 2021 1 次提交
    • E
      net: fix possible NULL deref in sock_reserve_memory · d00c8ee3
      Eric Dumazet 提交于
      Sanity check in sock_reserve_memory() was not enough to prevent malicious
      user to trigger a NULL deref.
      
      In this case, the isse is that sk_prot->memory_allocated is NULL.
      
      Use standard sk_has_account() helper to deal with this.
      
      BUG: KASAN: null-ptr-deref in instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
      BUG: KASAN: null-ptr-deref in atomic_long_add_return include/linux/atomic/atomic-instrumented.h:1218 [inline]
      BUG: KASAN: null-ptr-deref in sk_memory_allocated_add include/net/sock.h:1371 [inline]
      BUG: KASAN: null-ptr-deref in sock_reserve_memory net/core/sock.c:994 [inline]
      BUG: KASAN: null-ptr-deref in sock_setsockopt+0x22ab/0x2b30 net/core/sock.c:1443
      Write of size 8 at addr 0000000000000000 by task syz-executor.0/11270
      
      CPU: 1 PID: 11270 Comm: syz-executor.0 Not tainted 5.15.0-syzkaller #0
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       __kasan_report mm/kasan/report.c:446 [inline]
       kasan_report.cold+0x66/0xdf mm/kasan/report.c:459
       check_region_inline mm/kasan/generic.c:183 [inline]
       kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
       instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
       atomic_long_add_return include/linux/atomic/atomic-instrumented.h:1218 [inline]
       sk_memory_allocated_add include/net/sock.h:1371 [inline]
       sock_reserve_memory net/core/sock.c:994 [inline]
       sock_setsockopt+0x22ab/0x2b30 net/core/sock.c:1443
       __sys_setsockopt+0x4f8/0x610 net/socket.c:2172
       __do_sys_setsockopt net/socket.c:2187 [inline]
       __se_sys_setsockopt net/socket.c:2184 [inline]
       __x64_sys_setsockopt+0xba/0x150 net/socket.c:2184
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f56076d5ae9
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f5604c4b188 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00007f56077e8f60 RCX: 00007f56076d5ae9
      RDX: 0000000000000049 RSI: 0000000000000001 RDI: 0000000000000003
      RBP: 00007f560772ff25 R08: 000000000000fec7 R09: 0000000000000000
      R10: 0000000020000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fffb61a100f R14: 00007f5604c4b300 R15: 0000000000022000
       </TASK>
      
      Fixes: 2bb2f5fb ("net: add new socket option SO_RESERVE_MEM")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d00c8ee3
  21. 08 10月, 2021 1 次提交
  22. 30 9月, 2021 3 次提交
    • E
      af_unix: fix races in sk_peer_pid and sk_peer_cred accesses · 35306eb2
      Eric Dumazet 提交于
      Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
      are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
      
      In order to fix this issue, this patch adds a new spinlock that needs
      to be used whenever these fields are read or written.
      
      Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
      reading sk->sk_peer_pid which makes no sense, as this field
      is only possibly set by AF_UNIX sockets.
      We will have to clean this in a separate patch.
      This could be done by reverting b48596d1 "Bluetooth: L2CAP: Add get_peer_pid callback"
      or implementing what was truly expected.
      
      Fixes: 109f6e39 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35306eb2
    • W
      net: add new socket option SO_RESERVE_MEM · 2bb2f5fb
      Wei Wang 提交于
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      
      Note:
      This socket option is only available when memory cgroup is enabled and we
      require this reserved memory to be charged to the user's memcg. We hope
      this could avoid mis-behaving users to abused this feature to reserve a
      large amount on certain sockets and cause unfairness for others.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb2f5fb
    • P
      net: introduce and use lock_sock_fast_nested() · 49054556
      Paolo Abeni 提交于
      Syzkaller reported a false positive deadlock involving
      the nl socket lock and the subflow socket lock:
      
      MPTCP: kernel_bind error, err=-98
      ============================================
      WARNING: possible recursive locking detected
      5.15.0-rc1-syzkaller #0 Not tainted
      --------------------------------------------
      syz-executor998/6520 is trying to acquire lock:
      ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
      
      but task is already holding lock:
      ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
      ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(k-sk_lock-AF_INET);
        lock(k-sk_lock-AF_INET);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by syz-executor998/6520:
       #0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
       #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
       #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
       #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
       #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
      
      stack backtrace:
      CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
       check_deadlock kernel/locking/lockdep.c:2987 [inline]
       validate_chain kernel/locking/lockdep.c:3776 [inline]
       __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
       lock_acquire kernel/locking/lockdep.c:5625 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
       lock_sock_fast+0x36/0x100 net/core/sock.c:3229
       mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
       inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
       __sock_release net/socket.c:649 [inline]
       sock_release+0x87/0x1b0 net/socket.c:677
       mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
       mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
       genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
       genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
       genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
       genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
       netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
       netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
       kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
       kernel_sendpage net/socket.c:3501 [inline]
       sock_sendpage+0xe5/0x140 net/socket.c:1003
       pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
       splice_from_pipe_feed fs/splice.c:418 [inline]
       __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
       splice_from_pipe fs/splice.c:597 [inline]
       generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
       do_splice_from fs/splice.c:767 [inline]
       direct_splice_actor+0x110/0x180 fs/splice.c:936
       splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
       do_splice_direct+0x1b3/0x280 fs/splice.c:979
       do_sendfile+0xae9/0x1240 fs/read_write.c:1249
       __do_sys_sendfile64 fs/read_write.c:1314 [inline]
       __se_sys_sendfile64 fs/read_write.c:1300 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f215cb69969
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
      RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
      R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
      R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
      
      the problem originates from uncorrect lock annotation in the mptcp
      code and is only visible since commit 2dcb96ba ("net: core: Correct
      the sock::sk_lock.owned lockdep annotations"), but is present since
      the port-based endpoint support initial implementation.
      
      This patch addresses the issue introducing a nested variant of
      lock_sock_fast() and using it in the relevant code path.
      
      Fixes: 1729cf18 ("mptcp: create the listening socket for new port")
      Fixes: 2dcb96ba ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49054556
  23. 19 9月, 2021 1 次提交
    • T
      net: core: Correct the sock::sk_lock.owned lockdep annotations · 2dcb96ba
      Thomas Gleixner 提交于
      lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
      sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
      is just lockdep wise equivalent. In fact it's an open coded trivial mutex
      implementation with some interesting features.
      
      sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
      representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
      true, then some other task holds the 'mutex', otherwise it is uncontended.
      As this locking construct is obviously endangered by lock ordering issues as
      any other locking primitive it got lockdep annotated via a dedicated
      dependency map sock::sk_lock.dep_map which has to be updated at the lock
      and unlock sites.
      
      lock_sock_nested() is a straight forward 'mutex' lock operation:
      
        might_sleep();
        spin_lock_bh(sock::sk_lock.slock)
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
      
      The lockdep annotation for sock::sk_lock.owned is for unknown reasons
      _after_ the lock has been acquired, i.e. after the code block above and
      after releasing sock::sk_lock.slock, but inside the bottom halves disabled
      region:
      
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
      
      The placement after the unlock is obvious because otherwise the
      mutex_acquire() would nest into the spin lock held region.
      
      But that's from the lockdep perspective still the wrong place:
      
       1) The mutex_acquire() is issued _after_ the successful acquisition which
          is pointless because in a dead lock scenario this point is never
          reached which means that if the deadlock is the first instance of
          exposing the wrong lock order lockdep does not have a chance to detect
          it.
      
       2) It only works because lockdep is rather lax on the context from which
          the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
          and therefore non-preemptible region is obviously invalid, except for a
          trylock which is clearly not the case here.
      
          This 'works' stops working on RT enabled kernels where the bottom halves
          serialization is done via a local lock, which exposes this misplacement
          because the 'mutex' and the local lock nest the wrong way around and
          lockdep complains rightfully about a lock inversion.
      
      The placement is wrong since the initial commit a5b5bb9a ("[PATCH]
      lockdep: annotate sk_locks") which introduced this.
      
      Fix it by moving the mutex_acquire() in front of the actual lock
      acquisition, which is what the regular mutex_lock() operation does as well.
      
      lock_sock_fast() is not that straight forward. It looks at the first glance
      like a convoluted trylock operation:
      
        spin_lock_bh(sock::sk_lock.slock)
        if (!sock::sk_lock.owned)
            return false;
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
        return true;
      
      But that's not the case: lock_sock_fast() is an interesting optimization
      for short critical sections which can run with bottom halves disabled and
      sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
      the non contended case by preventing other lockers to acquire
      sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
      in turn avoids the overhead of doing the heavy processing in release_sock()
      including waking up wait queue waiters.
      
      In the contended case, i.e. when sock::sk_lock.owned == true the behavior
      is the same as lock_sock_nested().
      
      Semantically this shortcut means, that the task acquired the 'mutex' even
      if it does not touch the sock::sk_lock.owned field in the non-contended
      case. Not telling lockdep about this shortcut acquisition is hiding
      potential lock ordering violations in the fast path.
      
      As a consequence the same reasoning as for the above lock_sock_nested()
      case vs. the placement of the lockdep annotation applies.
      
      The current placement of the lockdep annotation was just copied from
      the original lock_sock(), now renamed to lock_sock_nested(),
      implementation.
      
      Fix this by moving the mutex_acquire() in front of the actual lock
      acquisition and adding the corresponding mutex_release() into
      unlock_sock_fast(). Also document the fast path return case with a comment.
      Reported-by: NSebastian Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dcb96ba
  24. 26 8月, 2021 1 次提交
  25. 18 8月, 2021 1 次提交
  26. 04 8月, 2021 1 次提交
    • P
      sock: allow reading and changing sk_userlocks with setsockopt · 04190bf8
      Pavel Tikhomirov 提交于
      SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
      buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
      tcp_sndbuf_expand()). If we've just created a new socket this adjustment
      is enabled on it, but if one changes the socket buffer size by
      setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.
      
      CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
      restore as it first needs to increase buffer sizes for packet queues
      restore and second it needs to restore back original buffer sizes. So
      after CRIU restore all sockets become non-auto-adjustable, which can
      decrease network performance of restored applications significantly.
      
      CRIU need to be able to restore sockets with enabled/disabled adjustment
      to the same state it was before dump, so let's add special setsockopt
      for it.
      
      Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
      that using these interface one can reenable automatic socket buffer
      adjustment on their sockets.
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04190bf8
  27. 29 7月, 2021 1 次提交
  28. 08 7月, 2021 1 次提交