1. 16 11月, 2021 6 次提交
  2. 02 11月, 2021 1 次提交
  3. 28 10月, 2021 3 次提交
  4. 27 10月, 2021 1 次提交
  5. 26 10月, 2021 5 次提交
  6. 15 10月, 2021 1 次提交
    • E
      tcp: switch orphan_count to bare per-cpu counters · 19757ceb
      Eric Dumazet 提交于
      Use of percpu_counter structure to track count of orphaned
      sockets is causing problems on modern hosts with 256 cpus
      or more.
      
      Stefan Bach reported a serious spinlock contention in real workloads,
      that I was able to reproduce with a netfilter rule dropping
      incoming FIN packets.
      
          53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
                  |
                  ---queued_spin_lock_slowpath
                     |
                      --53.51%--_raw_spin_lock_irqsave
                                |
                                 --53.51%--__percpu_counter_sum
                                           tcp_check_oom
                                           |
                                           |--39.03%--__tcp_close
                                           |          tcp_close
                                           |          inet_release
                                           |          inet6_release
                                           |          sock_close
                                           |          __fput
                                           |          ____fput
                                           |          task_work_run
                                           |          exit_to_usermode_loop
                                           |          do_syscall_64
                                           |          entry_SYSCALL_64_after_hwframe
                                           |          __GI___libc_close
                                           |
                                            --14.48%--tcp_out_of_resources
                                                      tcp_write_timeout
                                                      tcp_retransmit_timer
                                                      tcp_write_timer_handler
                                                      tcp_write_timer
                                                      call_timer_fn
                                                      expire_timers
                                                      __run_timers
                                                      run_timer_softirq
                                                      __softirqentry_text_start
      
      As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
      batch for dst entries accounting"), default batch size is too big
      for the default value of tcp_max_orphans (262144).
      
      But even if we reduce batch sizes, there would still be cases
      where the estimated count of orphans is beyond the limit,
      and where tcp_too_many_orphans() has to call the expensive
      percpu_counter_sum_positive().
      
      One solution is to use plain per-cpu counters, and have
      a timer to periodically refresh this cache.
      
      Updating this cache every 100ms seems about right, tcp pressure
      state is not radically changing over shorter periods.
      
      percpu_counter was nice 15 years ago while hosts had less
      than 16 cpus, not anymore by current standards.
      
      v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
          reported by kernel test robot <lkp@intel.com>
          Remove unused socket argument from tcp_too_many_orphans()
      
      Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NStefan Bach <sfb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19757ceb
  7. 08 10月, 2021 1 次提交
  8. 02 10月, 2021 1 次提交
  9. 30 9月, 2021 4 次提交
    • E
      af_unix: fix races in sk_peer_pid and sk_peer_cred accesses · 35306eb2
      Eric Dumazet 提交于
      Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
      are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
      
      In order to fix this issue, this patch adds a new spinlock that needs
      to be used whenever these fields are read or written.
      
      Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
      reading sk->sk_peer_pid which makes no sense, as this field
      is only possibly set by AF_UNIX sockets.
      We will have to clean this in a separate patch.
      This could be done by reverting b48596d1 "Bluetooth: L2CAP: Add get_peer_pid callback"
      or implementing what was truly expected.
      
      Fixes: 109f6e39 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35306eb2
    • W
      tcp: adjust sndbuf according to sk_reserved_mem · ca057051
      Wei Wang 提交于
      If user sets SO_RESERVE_MEM socket option, in order to fully utilize the
      reserved memory in memory pressure state on the tx path, we modify the
      logic in sk_stream_moderate_sndbuf() to set sk_sndbuf according to
      available reserved memory, instead of MIN_SOCK_SNDBUF, and adjust it
      when new data is acked.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca057051
    • W
      net: add new socket option SO_RESERVE_MEM · 2bb2f5fb
      Wei Wang 提交于
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      
      Note:
      This socket option is only available when memory cgroup is enabled and we
      require this reserved memory to be charged to the user's memcg. We hope
      this could avoid mis-behaving users to abused this feature to reserve a
      large amount on certain sockets and cause unfairness for others.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb2f5fb
    • P
      net: introduce and use lock_sock_fast_nested() · 49054556
      Paolo Abeni 提交于
      Syzkaller reported a false positive deadlock involving
      the nl socket lock and the subflow socket lock:
      
      MPTCP: kernel_bind error, err=-98
      ============================================
      WARNING: possible recursive locking detected
      5.15.0-rc1-syzkaller #0 Not tainted
      --------------------------------------------
      syz-executor998/6520 is trying to acquire lock:
      ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
      
      but task is already holding lock:
      ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
      ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(k-sk_lock-AF_INET);
        lock(k-sk_lock-AF_INET);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by syz-executor998/6520:
       #0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
       #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
       #1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
       #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
       #2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
      
      stack backtrace:
      CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
       check_deadlock kernel/locking/lockdep.c:2987 [inline]
       validate_chain kernel/locking/lockdep.c:3776 [inline]
       __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
       lock_acquire kernel/locking/lockdep.c:5625 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
       lock_sock_fast+0x36/0x100 net/core/sock.c:3229
       mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
       inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
       __sock_release net/socket.c:649 [inline]
       sock_release+0x87/0x1b0 net/socket.c:677
       mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
       mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
       genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
       genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
       genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
       genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
       netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
       netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
       kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
       kernel_sendpage net/socket.c:3501 [inline]
       sock_sendpage+0xe5/0x140 net/socket.c:1003
       pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
       splice_from_pipe_feed fs/splice.c:418 [inline]
       __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
       splice_from_pipe fs/splice.c:597 [inline]
       generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
       do_splice_from fs/splice.c:767 [inline]
       direct_splice_actor+0x110/0x180 fs/splice.c:936
       splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
       do_splice_direct+0x1b3/0x280 fs/splice.c:979
       do_sendfile+0xae9/0x1240 fs/read_write.c:1249
       __do_sys_sendfile64 fs/read_write.c:1314 [inline]
       __se_sys_sendfile64 fs/read_write.c:1300 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f215cb69969
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
      RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
      RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
      R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
      R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
      
      the problem originates from uncorrect lock annotation in the mptcp
      code and is only visible since commit 2dcb96ba ("net: core: Correct
      the sock::sk_lock.owned lockdep annotations"), but is present since
      the port-based endpoint support initial implementation.
      
      This patch addresses the issue introducing a nested variant of
      lock_sock_fast() and using it in the relevant code path.
      
      Fixes: 1729cf18 ("mptcp: create the listening socket for new port")
      Fixes: 2dcb96ba ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49054556
  10. 23 9月, 2021 1 次提交
  11. 19 9月, 2021 1 次提交
    • T
      net: core: Correct the sock::sk_lock.owned lockdep annotations · 2dcb96ba
      Thomas Gleixner 提交于
      lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
      sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
      is just lockdep wise equivalent. In fact it's an open coded trivial mutex
      implementation with some interesting features.
      
      sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
      representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
      true, then some other task holds the 'mutex', otherwise it is uncontended.
      As this locking construct is obviously endangered by lock ordering issues as
      any other locking primitive it got lockdep annotated via a dedicated
      dependency map sock::sk_lock.dep_map which has to be updated at the lock
      and unlock sites.
      
      lock_sock_nested() is a straight forward 'mutex' lock operation:
      
        might_sleep();
        spin_lock_bh(sock::sk_lock.slock)
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
      
      The lockdep annotation for sock::sk_lock.owned is for unknown reasons
      _after_ the lock has been acquired, i.e. after the code block above and
      after releasing sock::sk_lock.slock, but inside the bottom halves disabled
      region:
      
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
      
      The placement after the unlock is obvious because otherwise the
      mutex_acquire() would nest into the spin lock held region.
      
      But that's from the lockdep perspective still the wrong place:
      
       1) The mutex_acquire() is issued _after_ the successful acquisition which
          is pointless because in a dead lock scenario this point is never
          reached which means that if the deadlock is the first instance of
          exposing the wrong lock order lockdep does not have a chance to detect
          it.
      
       2) It only works because lockdep is rather lax on the context from which
          the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
          and therefore non-preemptible region is obviously invalid, except for a
          trylock which is clearly not the case here.
      
          This 'works' stops working on RT enabled kernels where the bottom halves
          serialization is done via a local lock, which exposes this misplacement
          because the 'mutex' and the local lock nest the wrong way around and
          lockdep complains rightfully about a lock inversion.
      
      The placement is wrong since the initial commit a5b5bb9a ("[PATCH]
      lockdep: annotate sk_locks") which introduced this.
      
      Fix it by moving the mutex_acquire() in front of the actual lock
      acquisition, which is what the regular mutex_lock() operation does as well.
      
      lock_sock_fast() is not that straight forward. It looks at the first glance
      like a convoluted trylock operation:
      
        spin_lock_bh(sock::sk_lock.slock)
        if (!sock::sk_lock.owned)
            return false;
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
        return true;
      
      But that's not the case: lock_sock_fast() is an interesting optimization
      for short critical sections which can run with bottom halves disabled and
      sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
      the non contended case by preventing other lockers to acquire
      sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
      in turn avoids the overhead of doing the heavy processing in release_sock()
      including waking up wait queue waiters.
      
      In the contended case, i.e. when sock::sk_lock.owned == true the behavior
      is the same as lock_sock_nested().
      
      Semantically this shortcut means, that the task acquired the 'mutex' even
      if it does not touch the sock::sk_lock.owned field in the non-contended
      case. Not telling lockdep about this shortcut acquisition is hiding
      potential lock ordering violations in the fast path.
      
      As a consequence the same reasoning as for the above lock_sock_nested()
      case vs. the placement of the lockdep annotation applies.
      
      The current placement of the lockdep annotation was just copied from
      the original lock_sock(), now renamed to lock_sock_nested(),
      implementation.
      
      Fix this by moving the mutex_acquire() in front of the actual lock
      acquisition and adding the corresponding mutex_release() into
      unlock_sock_fast(). Also document the fast path return case with a comment.
      Reported-by: NSebastian Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dcb96ba
  12. 26 8月, 2021 1 次提交
  13. 18 8月, 2021 1 次提交
  14. 04 8月, 2021 1 次提交
    • P
      sock: allow reading and changing sk_userlocks with setsockopt · 04190bf8
      Pavel Tikhomirov 提交于
      SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
      buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
      tcp_sndbuf_expand()). If we've just created a new socket this adjustment
      is enabled on it, but if one changes the socket buffer size by
      setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.
      
      CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
      restore as it first needs to increase buffer sizes for packet queues
      restore and second it needs to restore back original buffer sizes. So
      after CRIU restore all sockets become non-auto-adjustable, which can
      decrease network performance of restored applications significantly.
      
      CRIU need to be able to restore sockets with enabled/disabled adjustment
      to the same state it was before dump, so let's add special setsockopt
      for it.
      
      Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
      that using these interface one can reenable automatic socket buffer
      adjustment on their sockets.
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04190bf8
  15. 29 7月, 2021 1 次提交
    • P
      skbuff: allow 'slow_gro' for skb carring sock reference · 5e10da53
      Paolo Abeni 提交于
      This change leverages the infrastructure introduced by the previous
      patches to allow soft devices passing to the GRO engine owned skbs
      without impacting the fast-path.
      
      It's up to the GRO caller ensuring the slow_gro bit validity before
      invoking the GRO engine. The new helper skb_prepare_for_gro() is
      introduced for that goal.
      
      On slow_gro, skbs are aggregated only with equal sk.
      Additionally, skb truesize on GRO recycle and free is correctly
      updated so that sk wmem is not changed by the GRO processing.
      
      rfc-> v1:
       - fixed bad truesize on dev_gro_receive NAPI_FREE
       - use the existing state bit
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e10da53
  16. 02 7月, 2021 1 次提交
    • Y
      net: sock: extend SO_TIMESTAMPING for PHC binding · d463126e
      Yangbo Lu 提交于
      Since PTP virtual clock support is added, there can be
      several PTP virtual clocks based on one PTP physical
      clock for timestamping.
      
      This patch is to extend SO_TIMESTAMPING API to support
      PHC (PTP Hardware Clock) binding by adding a new flag
      SOF_TIMESTAMPING_BIND_PHC. When PTP virtual clocks are
      in use, user space can configure to bind one for
      timestamping, but PTP physical clock is not supported
      and not needed to bind.
      
      This patch is preparation for timestamp conversion from
      raw timestamp to a specific PTP virtual clock time in
      core net.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d463126e
  17. 30 6月, 2021 1 次提交
  18. 11 6月, 2021 2 次提交
    • E
      inet: annotate date races around sk->sk_txhash · b71eaed8
      Eric Dumazet 提交于
      UDP sendmsg() path can be lockless, it is possible for another
      thread to re-connect an change sk->sk_txhash under us.
      
      There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE()
      pair to document the race.
      
      BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w
      
      write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1:
       sk_set_txhash include/net/sock.h:1937 [inline]
       __ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75
       __ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189
       ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
       inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580
       __sys_connect_file net/socket.c:1837 [inline]
       __sys_connect+0x245/0x280 net/socket.c:1854
       __do_sys_connect net/socket.c:1864 [inline]
       __se_sys_connect net/socket.c:1861 [inline]
       __x64_sys_connect+0x3d/0x50 net/socket.c:1861
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0:
       skb_set_hash_from_sk include/net/sock.h:2211 [inline]
       skb_set_owner_w+0x118/0x220 net/core/sock.c:2101
       sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359
       sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373
       __ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621
       ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983
       udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527
       inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xbca3c43d -> 0xfdb309e0
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b71eaed8
    • E
      net: annotate data race in sock_error() · f13ef100
      Eric Dumazet 提交于
      sock_error() is known to be racy. The code avoids
      an atomic operation is sk_err is zero, and this field
      could be changed under us, this is fine.
      
      Sysbot reported:
      
      BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock
      
      write to 0xffff888131855630 of 4 bytes by task 9365 on cpu 1:
       unix_release_sock+0x2e9/0x6e0 net/unix/af_unix.c:550
       unix_release+0x2f/0x50 net/unix/af_unix.c:859
       __sock_release net/socket.c:599 [inline]
       sock_close+0x6c/0x150 net/socket.c:1258
       __fput+0x25b/0x4e0 fs/file_table.c:280
       ____fput+0x11/0x20 fs/file_table.c:313
       task_work_run+0xae/0x130 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:189 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
       exit_to_user_mode_prepare+0x156/0x190 kernel/entry/common.c:208
       __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
       syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:301
       do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888131855630 of 4 bytes by task 9385 on cpu 0:
       sock_error include/net/sock.h:2269 [inline]
       sock_alloc_send_pskb+0xe4/0x4e0 net/core/sock.c:2336
       unix_dgram_sendmsg+0x478/0x1610 net/unix/af_unix.c:1671
       unix_seqpacket_sendmsg+0xc2/0x100 net/unix/af_unix.c:2055
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       __sys_sendmsg_sock+0x25/0x30 net/socket.c:2416
       io_sendmsg fs/io_uring.c:4367 [inline]
       io_issue_sqe+0x231a/0x6750 fs/io_uring.c:6135
       __io_queue_sqe+0xe9/0x360 fs/io_uring.c:6414
       __io_req_task_submit fs/io_uring.c:2039 [inline]
       io_async_task_func+0x312/0x590 fs/io_uring.c:5074
       __tctx_task_work fs/io_uring.c:1910 [inline]
       tctx_task_work+0x1d4/0x3d0 fs/io_uring.c:1924
       task_work_run+0xae/0x130 kernel/task_work.c:164
       tracehook_notify_signal include/linux/tracehook.h:212 [inline]
       handle_signal_work kernel/entry/common.c:145 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0xf8/0x190 kernel/entry/common.c:208
       __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
       syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:301
       do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000000 -> 0x00000068
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 9385 Comm: syz-executor.3 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f13ef100
  19. 05 6月, 2021 2 次提交
  20. 13 5月, 2021 1 次提交
  21. 12 4月, 2021 1 次提交
  22. 02 4月, 2021 1 次提交
  23. 01 4月, 2021 1 次提交
  24. 31 3月, 2021 1 次提交
    • P
      net: let skb_orphan_partial wake-up waiters. · 9adc89af
      Paolo Abeni 提交于
      Currently the mentioned helper can end-up freeing the socket wmem
      without waking-up any processes waiting for more write memory.
      
      If the partially orphaned skb is attached to an UDP (or raw) socket,
      the lack of wake-up can hang the user-space.
      
      Even for TCP sockets not calling the sk destructor could have bad
      effects on TSQ.
      
      Address the issue using skb_orphan to release the sk wmem before
      setting the new sock_efree destructor. Additionally bundle the
      whole ownership update in a new helper, so that later other
      potential users could avoid duplicate code.
      
      v1 -> v2:
       - use skb_orphan() instead of sort of open coding it (Eric)
       - provide an helper for the ownership change (Eric)
      
      Fixes: f6ba8d33 ("netem: fix skb_orphan_partial()")
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9adc89af