1. 24 8月, 2022 2 次提交
  2. 06 7月, 2022 1 次提交
  3. 13 6月, 2022 1 次提交
  4. 11 6月, 2022 3 次提交
  5. 10 6月, 2022 1 次提交
  6. 16 5月, 2022 2 次提交
    • E
      net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if · e5fccaa1
      Eric Dumazet 提交于
      sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
      because other cpus/threads might locklessly read this field.
      
      sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
      because they run without socket lock held.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5fccaa1
    • A
      net: allow gso_max_size to exceed 65536 · 7c4e983c
      Alexander Duyck 提交于
      The code for gso_max_size was added originally to allow for debugging and
      workaround of buggy devices that couldn't support TSO with blocks 64K in
      size. The original reason for limiting it to 64K was because that was the
      existing limits of IPv4 and non-jumbogram IPv6 length fields.
      
      With the addition of Big TCP we can remove this limit and allow the value
      to potentially go up to UINT_MAX and instead be limited by the tso_max_size
      value.
      
      So in order to support this we need to go through and clean up the
      remaining users of the gso_max_size value so that the values will cap at
      64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
      so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
      limit for GSO_MAX_SIZE.
      
      v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                     in a new sk_trim_gso_size() helper.
                     netif_set_tso_max_size() caps the requested TSO size
                     with GSO_MAX_SIZE.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4e983c
  7. 06 5月, 2022 1 次提交
  8. 01 5月, 2022 3 次提交
  9. 30 4月, 2022 1 次提交
  10. 29 4月, 2022 1 次提交
  11. 27 4月, 2022 1 次提交
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  12. 12 4月, 2022 1 次提交
    • O
      net: remove noblock parameter from recvmsg() entities · ec095263
      Oliver Hartkopp 提交于
      The internal recvmsg() functions have two parameters 'flags' and 'noblock'
      that were merged inside skb_recv_datagram(). As a follow up patch to commit
      f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
      this patch removes the separate 'noblock' parameter for recvmsg().
      
      Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
      'noblock' parameters are unnecessarily split up with e.g.
      
      err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                                 flags & ~MSG_DONTWAIT, &addr_len);
      
      or in
      
      err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                            sk, msg, size, flags & MSG_DONTWAIT,
                            flags & ~MSG_DONTWAIT, &addr_len);
      
      instead of simply using only flags all the time and check for MSG_DONTWAIT
      where needed (to preserve for the formerly separated no(n)block condition).
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      ec095263
  13. 11 4月, 2022 1 次提交
  14. 08 4月, 2022 1 次提交
    • J
      net: extract a few internals from netdevice.h · 6264f58c
      Jakub Kicinski 提交于
      There's a number of functions and static variables used
      under net/core/ but not from the outside. We currently
      dump most of them into netdevice.h. That bad for many
      reasons:
       - netdevice.h is very cluttered, hard to figure out
         what the APIs are;
       - netdevice.h is very long;
       - we have to touch netdevice.h more which causes expensive
         incremental builds.
      
      Create a header under net/core/ and move some declarations.
      
      The new header is also a bit of a catch-all but that's
      fine, if we create more specific headers people will
      likely over-think where their declaration fit best.
      And end up putting them in netdevice.h, again.
      
      More work should be done on splitting netdevice.h into more
      targeted headers, but that'd be more time consuming so small
      steps.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6264f58c
  15. 09 3月, 2022 1 次提交
  16. 18 2月, 2022 2 次提交
    • E
      net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57
      Eric Dumazet 提交于
      UDP sendmsg() can be lockless, this is causing all kinds
      of data races.
      
      This patch converts sk->sk_tskey to remove one of these races.
      
      BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
      
      read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
       __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
       __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000054d -> 0x0000054e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1cdec57
    • E
      net: add sanity check in proto_register() · f20cfd66
      Eric Dumazet 提交于
      prot->memory_allocated should only be set if prot->sysctl_mem
      is also set.
      
      This is a followup of commit 25206111 ("crypto: af_alg - get
      rid of alg_memory_allocated").
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220216171801.3604366-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f20cfd66
  17. 02 2月, 2022 1 次提交
  18. 31 1月, 2022 2 次提交
  19. 26 1月, 2022 1 次提交
  20. 17 1月, 2022 1 次提交
  21. 12 1月, 2022 1 次提交
  22. 11 12月, 2021 1 次提交
  23. 10 12月, 2021 1 次提交
  24. 25 11月, 2021 2 次提交
    • M
      net: allow SO_MARK with CAP_NET_RAW · 079925cc
      Maciej Żenczykowski 提交于
      A CAP_NET_RAW capable process can already spoof (on transmit) anything
      it desires via raw packet sockets...  There is no good reason to not
      allow it to also be able to play routing tricks on packets from its
      own normal sockets.
      
      There is a desire to be able to use SO_MARK for routing table selection
      (via ip rule fwmark) from within a user process without having to run
      it as root.  Granting it CAP_NET_RAW is much less dangerous than
      CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change,
      while CAP_NET_ADMIN does - by for example allowing the reconfiguration
      of the routing tables and/or bringing up/down devices).
      
      Let's keep CAP_NET_ADMIN for persistent state changes,
      while using CAP_NET_RAW for non-configuration related stuff.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      079925cc
    • M
      net: allow CAP_NET_RAW to setsockopt SO_PRIORITY · a1b519b7
      Maciej Żenczykowski 提交于
      CAP_NET_ADMIN is and should continue to be about configuring the
      system as a whole, not about configuring per-socket or per-packet
      parameters.
      Sending and receiving raw packets is what CAP_NET_RAW is all about.
      
      It can already send packets with any VLAN tag, and any IPv4 TOS
      mark, and any IPv6 TCLASS mark, simply by virtue of building
      such a raw packet.  Not to mention using any protocol and source/
      /destination ip address/port tuple.
      
      These are the fields that networking gear uses to prioritize packets.
      
      Hence, a CAP_NET_RAW process is already capable of affecting traffic
      prioritization after it hits the wire.  This change makes it capable
      of affecting traffic prioritization even in the host at the nic and
      before that in the queueing disciplines (provided skb->priority is
      actually being used for prioritization, and not the TOS/TCLASS field)
      
      Hence it makes sense to allow a CAP_NET_RAW process to set the
      priority of sockets and thus packets it sends.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a1b519b7
  25. 22 11月, 2021 2 次提交
  26. 16 11月, 2021 5 次提交