1. 19 6月, 2022 1 次提交
  2. 11 6月, 2022 1 次提交
  3. 10 6月, 2022 4 次提交
    • A
      net: seg6: fix seg6_lookup_any_nexthop() to handle VRFs using flowi_l3mdev · a3bd2102
      Andrea Mayer 提交于
      Commit 40867d74 ("net: Add l3mdev index to flow struct and avoid oif
      reset for port devices") adds a new entry (flowi_l3mdev) in the common
      flow struct used for indicating the l3mdev index for later rule and
      table matching.
      The l3mdev_update_flow() has been adapted to properly set the
      flowi_l3mdev based on the flowi_oif/flowi_iif. In fact, when a valid
      flowi_iif is supplied to the l3mdev_update_flow(), this function can
      update the flowi_l3mdev entry only if it has not yet been set (i.e., the
      flowi_l3mdev entry is equal to 0).
      
      The SRv6 End.DT6 behavior in VRF mode leverages a VRF device in order to
      force the routing lookup into the associated routing table. This routing
      operation is performed by seg6_lookup_any_nextop() preparing a flowi6
      data structure used by ip6_route_input_lookup() which, in turn,
      (indirectly) invokes l3mdev_update_flow().
      
      However, seg6_lookup_any_nexthop() does not initialize the new
      flowi_l3mdev entry which is filled with random garbage data. This
      prevents l3mdev_update_flow() from properly updating the flowi_l3mdev
      with the VRF index, and thus SRv6 End.DT6 (VRF mode)/DT46 behaviors are
      broken.
      
      This patch correctly initializes the flowi6 instance allocated and used
      by seg6_lookup_any_nexhtop(). Specifically, the entire flowi6 instance
      is wiped out: in case new entries are added to flowi/flowi6 (as happened
      with the flowi_l3mdev entry), we should no longer have incorrectly
      initialized values. As a result of this operation, the value of
      flowi_l3mdev is also set to 0.
      
      The proposed fix can be tested easily. Starting from the commit
      referenced in the Fixes, selftests [1],[2] indicate that the SRv6
      End.DT6 (VRF mode)/DT46 behaviors no longer work correctly. By applying
      this patch, those behaviors are back to work properly again.
      
      [1] - tools/testing/selftests/net/srv6_end_dt46_l3vpn_test.sh
      [2] - tools/testing/selftests/net/srv6_end_dt6_l3vpn_test.sh
      
      Fixes: 40867d74 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices")
      Reported-by: NAnton Makarov <am@3a-alliance.com>
      Signed-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220608091917.20345-1-andrea.mayer@uniroma2.itSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      a3bd2102
    • E
      ip6_tunnel: use dev_sw_netstats_rx_add() · afd2051b
      Eric Dumazet 提交于
      We have a convenient helper, let's use it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      afd2051b
    • E
      sit: use dev_sw_netstats_rx_add() · 3a960ca7
      Eric Dumazet 提交于
      We have a convenient helper, let's use it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3a960ca7
    • J
      net: rename reference+tracking helpers · d62607c3
      Jakub Kicinski 提交于
      Netdev reference helpers have a dev_ prefix for historic
      reasons. Renaming the old helpers would be too much churn
      but we can rename the tracking ones which are relatively
      recent and should be the default for new code.
      
      Rename:
       dev_hold_track()    -> netdev_hold()
       dev_put_track()     -> netdev_put()
       dev_replace_track() -> netdev_ref_replace()
      
      Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d62607c3
  4. 09 6月, 2022 2 次提交
    • W
      ipv6: Fix signed integer overflow in __ip6_append_data · f93431c8
      Wang Yufen 提交于
      Resurrect ubsan overflow checks and ubsan report this warning,
      fix it by change the variable [length] type to size_t.
      
      UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19
      2147479552 + 8567 cannot be represented in type 'int'
      CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1
      Hardware name: linux,dummy-virt (DT)
      Call trace:
        dump_backtrace+0x214/0x230
        show_stack+0x30/0x78
        dump_stack_lvl+0xf8/0x118
        dump_stack+0x18/0x30
        ubsan_epilogue+0x18/0x60
        handle_overflow+0xd0/0xf0
        __ubsan_handle_add_overflow+0x34/0x44
        __ip6_append_data.isra.48+0x1598/0x1688
        ip6_append_data+0x128/0x260
        udpv6_sendmsg+0x680/0xdd0
        inet6_sendmsg+0x54/0x90
        sock_sendmsg+0x70/0x88
        ____sys_sendmsg+0xe8/0x368
        ___sys_sendmsg+0x98/0xe0
        __sys_sendmmsg+0xf4/0x3b8
        __arm64_sys_sendmmsg+0x34/0x48
        invoke_syscall+0x64/0x160
        el0_svc_common.constprop.4+0x124/0x300
        do_el0_svc+0x44/0xc8
        el0_svc+0x3c/0x1e8
        el0t_64_sync_handler+0x88/0xb0
        el0t_64_sync+0x16c/0x170
      
      Changes since v1:
      -Change the variable [length] type to unsigned, as Eric Dumazet suggested.
      Changes since v2:
      -Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested.
      Changes since v3:
      -Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as
      Jakub Kicinski suggested.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f93431c8
    • M
      net: ipv6: unexport __init-annotated seg6_hmac_init() · 5801f064
      Masahiro Yamada 提交于
      EXPORT_SYMBOL and __init is a bad combination because the .init.text
      section is freed up after the initialization. Hence, modules cannot
      use symbols annotated __init. The access to a freed symbol may end up
      with kernel panic.
      
      modpost used to detect it, but it has been broken for a decade.
      
      Recently, I fixed modpost so it started to warn it again, then this
      showed up in linux-next builds.
      
      There are two ways to fix it:
      
        - Remove __init
        - Remove EXPORT_SYMBOL
      
      I chose the latter for this case because the caller (net/ipv6/seg6.c)
      and the callee (net/ipv6/seg6_hmac.c) belong to the same module.
      It seems an internal function call in ipv6.ko.
      
      Fixes: bf355b8d ("ipv6: sr: add core files for SR HMAC support")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5801f064
  5. 01 6月, 2022 1 次提交
  6. 31 5月, 2022 1 次提交
  7. 21 5月, 2022 1 次提交
  8. 18 5月, 2022 1 次提交
    • J
      random32: use real rng for non-deterministic randomness · d4150779
      Jason A. Donenfeld 提交于
      random32.c has two random number generators in it: one that is meant to
      be used deterministically, with some predefined seed, and one that does
      the same exact thing as random.c, except does it poorly. The first one
      has some use cases. The second one no longer does and can be replaced
      with calls to random.c's proper random number generator.
      
      The relatively recent siphash-based bad random32.c code was added in
      response to concerns that the prior random32.c was too deterministic.
      Out of fears that random.c was (at the time) too slow, this code was
      anonymously contributed. Then out of that emerged a kind of shadow
      entropy gathering system, with its own tentacles throughout various net
      code, added willy nilly.
      
      Stop👏making👏bespoke👏random👏number👏generators👏.
      
      Fortunately, recent advances in random.c mean that we can stop playing
      with this sketchiness, and just use get_random_u32(), which is now fast
      enough. In micro benchmarks using RDPMC, I'm seeing the same median
      cycle count between the two functions, with the mean being _slightly_
      higher due to batches refilling (which we can optimize further need be).
      However, when doing *real* benchmarks of the net functions that actually
      use these random numbers, the mean cycles actually *decreased* slightly
      (with the median still staying the same), likely because the additional
      prandom code means icache misses and complexity, whereas random.c is
      generally already being used by something else nearby.
      
      The biggest benefit of this is that there are many users of prandom who
      probably should be using cryptographically secure random numbers. This
      makes all of those accidental cases become secure by just flipping a
      switch. Later on, we can do a tree-wide cleanup to remove the static
      inline wrapper functions that this commit adds.
      
      There are also some low-ish hanging fruits for making this even faster
      in the future: a get_random_u16() function for use in the networking
      stack will give a 2x performance boost there, using SIMD for ChaCha20
      will let us compute 4 or 8 or 16 blocks of output in parallel, instead
      of just one, giving us large buffers for cheap, and introducing a
      get_random_*_bh() function that assumes irqs are already disabled will
      shave off a few cycles for ordinary calls. These are things we can chip
      away at down the road.
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      d4150779
  9. 16 5月, 2022 7 次提交
  10. 14 5月, 2022 1 次提交
  11. 13 5月, 2022 2 次提交
    • E
      Revert "tcp/dccp: get rid of inet_twsk_purge()" · 04c494e6
      Eric Dumazet 提交于
      This reverts commits:
      
      0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      d507204d ("tcp/dccp: add tw->tw_bslot")
      
      As Leonard pointed out, a newly allocated netns can happen
      to reuse a freed 'struct net'.
      
      While TCP TW timers were covered by my patches, other things were not:
      
      1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
        at 4-tuple plus the 'struct net' pointer.
      
      2) /proc/net/tcp[6] and inet_diag, same reason.
      
      3) hashinfo->bhash[], same reason.
      
      Fixing all this seems risky, lets instead revert.
      
      In the future, we might have a per netns tcp hash table, or
      a per netns list of timewait sockets...
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NLeonard Crestez <cdleonard@gmail.com>
      Tested-by: NLeonard Crestez <cdleonard@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04c494e6
    • M
      net: inet: Retire port only listening_hash · cae3873c
      Martin KaFai Lau 提交于
      The listen sk is currently stored in two hash tables,
      listening_hash (hashed by port) and lhash2 (hashed by port and address).
      
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the TCP-SYN lookup fast path does not use listening_hash.
      
      The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
      also moved the seq_file (/proc/net/tcp) iteration usage from
      listening_hash to lhash2.
      
      There are still a few listening_hash usages left.
      One of them is inet_reuseport_add_sock() which uses the listening_hash
      to search a listen sk during the listen() system call.  This turns
      out to be very slow on use cases that listen on many different
      VIPs at a popular port (e.g. 443).  [ On top of the slowness in
      adding to the tail in the IPv6 case ].  The latter patch has a
      selftest to demonstrate this case.
      
      This patch takes this chance to move all remaining listening_hash
      usages to lhash2 and then retire listening_hash.
      
      Since most changes need to be done together, it is hard to cut
      the listening_hash to lhash2 switch into small patches.  The
      changes in this patch is highlighted here for the review
      purpose.
      
      1. Because of the listening_hash removal, lhash2 can use the
         sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
         This will also keep the sk_unhashed() check to work as is
         after stop adding sk to listening_hash.
      
         The union is removed from inet_listen_hashbucket because
         only nulls_head is needed.
      
      2. icsk->icsk_listen_portaddr_node and its helpers are removed.
      
      3. The current lhash2 users needs to iterate with sk_nulls_node
         instead of icsk_listen_portaddr_node.
      
         One case is in the inet[6]_lhash2_lookup().
      
         Another case is the seq_file iterator in tcp_ipv4.c.
         One thing to note is sk_nulls_next() is needed
         because the old inet_lhash2_for_each_icsk_continue()
         does a "next" first before iterating.
      
      4. Move the remaining listening_hash usage to lhash2
      
         inet_reuseport_add_sock() which this series is
         trying to improve.
      
         inet_diag.c and mptcp_diag.c are the final two
         remaining use cases and is moved to lhash2 now also.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      cae3873c
  12. 06 5月, 2022 1 次提交
  13. 05 5月, 2022 2 次提交
    • W
      secure_seq: use the 64 bits of the siphash for port offset calculation · b2d05756
      Willy Tarreau 提交于
      SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
      7cd23e53 ("secure_seq: use SipHash in place of MD5"), but the output
      remained truncated to 32-bit only. In order to exploit more bits from the
      hash, let's make the functions return the full 64-bit of siphash_3u32().
      We also make sure the port offset calculation in __inet_hash_connect()
      remains done on 32-bit to avoid the need for div_u64_rem() and an extra
      cost on 32-bit systems.
      
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
      Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
      Cc: Amit Klein <aksecurity@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      b2d05756
    • V
      memcg: accounting for objects allocated for new netdevice · 425b9c7f
      Vasily Averin 提交于
      Creating a new netdevice allocates at least ~50Kb of memory for various
      kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
      creating an unlimited number of netdevice inside a memcg-limited container
      does not fall within memcg restrictions, consumes a significant part
      of the host's memory, can cause global OOM and lead to random kills of
      host processes.
      
      The main consumers of non-accounted memory are:
       ~10Kb   80+ kernfs nodes
       ~6Kb    ipv6_add_dev() allocations
        6Kb    __register_sysctl_table() allocations
        4Kb    neigh_sysctl_register() allocations
        4Kb    __devinet_sysctl_register() allocations
        4Kb    __addrconf_sysctl_register() allocations
      
      Accounting of these objects allows to increase the share of memcg-related
      memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
      on typical VM with default Fedora 35 kernel) and this should be enough
      to somehow protect the host from misuse inside container.
      
      Other related objects are quite small and may not be taken into account
      to minimize the expected performance degradation.
      
      It should be separately mentonied ~300 bytes of percpu allocation
      of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
      it can become the main consumer of memory.
      
      This patch does not enables kernfs accounting as it affects
      other parts of the kernel and should be discussed separately.
      However, even without kernfs, this patch significantly improves the
      current situation and allows to take into account more than half
      of all netdevice allocations.
      Signed-off-by: NVasily Averin <vvs@openvz.org>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      425b9c7f
  14. 03 5月, 2022 3 次提交
    • T
      net: sysctl: introduce sysctl SYSCTL_THREE · 4c7f24f8
      Tonghao Zhang 提交于
      This patch introdues the SYSCTL_THREE.
      
      KUnit:
      [00:10:14] ================ sysctl_test (10 subtests) =================
      [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero
      [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set
      [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive
      [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative
      [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive
      [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative
      [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min
      [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max
      [00:10:14] =================== [PASSED] sysctl_test ===================
      
      ./run_kselftest.sh -c sysctl
      ...
      ok 1 selftests: sysctl: sysctl.sh
      
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Lorenz Bauer <lmb@cloudflare.com>
      Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      4c7f24f8
    • T
      net: sysctl: use shared sysctl macro · bd8a5367
      Tonghao Zhang 提交于
      This patch replace two, four and long_one to SYSCTL_XXX.
      
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Lorenz Bauer <lmb@cloudflare.com>
      Cc: Akhmat Karakotov <hmukos@yandex-team.ru>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      bd8a5367
    • J
      ipv6: Don't send rs packets to the interface of ARPHRD_TUNNEL · b52e1cce
      jianghaoran 提交于
      ARPHRD_TUNNEL interface can't process rs packets
      and will generate TX errors
      
      ex:
      ip tunnel add ethn mode ipip local 192.168.1.1 remote 192.168.1.2
      ifconfig ethn x.x.x.x
      
      ethn: flags=209<UP,POINTOPOINT,RUNNING,NOARP>  mtu 1480
      	inet x.x.x.x  netmask 255.255.255.255  destination x.x.x.x
      	inet6 fe80::5efe:ac1e:3cdb  prefixlen 64  scopeid 0x20<link>
      	tunnel   txqueuelen 1000  (IPIP Tunnel)
      	RX packets 0  bytes 0 (0.0 B)
      	RX errors 0  dropped 0  overruns 0  frame 0
      	TX packets 0  bytes 0 (0.0 B)
      	TX errors 3  dropped 0 overruns 0  carrier 0  collisions 0
      Signed-off-by: Njianghaoran <jianghaoran@kylinos.cn>
      Link: https://lore.kernel.org/r/20220429053802.246681-1-jianghaoran@kylinos.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b52e1cce
  15. 02 5月, 2022 1 次提交
  16. 30 4月, 2022 3 次提交
  17. 29 4月, 2022 1 次提交
  18. 27 4月, 2022 1 次提交
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  19. 25 4月, 2022 4 次提交
    • E
      tcp: make sure treq->af_specific is initialized · ba5a4fdd
      Eric Dumazet 提交于
      syzbot complained about a recent change in TCP stack,
      hitting a NULL pointer [1]
      
      tcp request sockets have an af_specific pointer, which
      was used before the blamed change only for SYNACK generation
      in non SYNCOOKIE mode.
      
      tcp requests sockets momentarily created when third packet
      coming from client in SYNCOOKIE mode were not using
      treq->af_specific.
      
      Make sure this field is populated, in the same way normal
      TCP requests sockets do in tcp_conn_request().
      
      [1]
      TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  Check SNMP counters.
      general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
      CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe48 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48
      RSP: 0018:ffffc90000de0588 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100
      RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008
      RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0
      R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267
       tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207
       cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline]
       tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486
       tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725
       ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422
       ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473
       dst_input include/net/dst.h:461 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297
       __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405
       __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519
       process_backlog+0x3a0/0x7c0 net/core/dev.c:5847
       __napi_poll+0xb3/0x6e0 net/core/dev.c:6413
       napi_poll net/core/dev.c:6480 [inline]
       net_rx_action+0x8ec/0xc60 net/core/dev.c:6567
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
      
      Fixes: 5b0b9e4c ("tcp: md5: incorrect tcp_header_len for incoming connections")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Francesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5a4fdd
    • P
      ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode · 31c417c9
      Peilin Ye 提交于
      As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
      collect_md mode is racy for [IP6]GRE[TAP] devices.  Consider the
      following sequence of events:
      
      1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
         add ... external".  "ip" ignores "[o]seq" if "external" is specified,
         so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
         it uses lockless TX);
      2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
         bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
      3. gre_fb_xmit() or __gre6_xmit() processes these skb's:
      
      	gre_build_header(skb, tun_hlen,
      			 flags, protocol,
      			 tunnel_id_to_key32(tun_info->key.tun_id),
      			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
      					      : 0);   ^^^^^^^^^^^^^^^^^
      
      Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
      try to do this tunnel->o_seqno++ in parallel, which is racy.  Fix it by
      making o_seqno atomic_t.
      
      As mentioned by Eric Dumazet in commit b790e01a ("ip_gre: lockless
      xmit"), making o_seqno atomic_t increases "chance for packets being out
      of order at receiver" when NETIF_F_LLTX is on.
      
      Maybe a better fix would be:
      
      1. Do not ignore "oseq" in external mode.  Users MUST specify "oseq" if
         they want the kernel to allow sequencing of outgoing packets;
      2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
         with "oseq".
      
      Unfortunately, that would break userspace.
      
      We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
      do it in separate patches to keep this fix minimal.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Fixes: 77a5196a ("gre: add sequence number for collect md mode.")
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c417c9
    • P
      ip6_gre: Make o_seqno start from 0 in native mode · fde98ae9
      Peilin Ye 提交于
      For IP6GRE and IP6GRETAP devices, currently o_seqno starts from 1 in
      native mode.  According to RFC 2890 2.2., "The first datagram is sent
      with a sequence number of 0."  Fix it.
      
      It is worth mentioning that o_seqno already starts from 0 in collect_md
      mode, see the "if (tunnel->parms.collect_md)" clause in __gre6_xmit(),
      where tunnel->o_seqno is passed to gre_build_header() before getting
      incremented.
      
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fde98ae9
    • M
      netfilter: Update ip6_route_me_harder to consider L3 domain · 8ddffdb9
      Martin Willi 提交于
      The commit referenced below fixed packet re-routing if Netfilter mangles
      a routing key property of a packet and the packet is routed in a VRF L3
      domain. The fix, however, addressed IPv4 re-routing, only.
      
      This commit applies the same behavior for IPv6. While at it, untangle
      the nested ternary operator to make the code more readable.
      
      Fixes: 6d8b49c3 ("netfilter: Update ip_route_me_harder to consider L3 domain")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8ddffdb9
  20. 22 4月, 2022 2 次提交