1. 16 11月, 2022 4 次提交
  2. 10 11月, 2022 1 次提交
  3. 09 11月, 2022 1 次提交
    • A
      net/core: Allow live renaming when an interface is up · bd039b5e
      Andy Ren 提交于
      Allow a network interface to be renamed when the interface
      is up.
      
      As described in the netconsole documentation [1], when netconsole is
      used as a built-in, it will bring up the specified interface as soon as
      possible. As a result, user space will not be able to rename the
      interface since the kernel disallows renaming of interfaces that are
      administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
      by the kernel.
      
      The original solution [2] to this problem was to add a new parameter to
      the netconsole configuration parameters that allows renaming of
      the interface used by netconsole while it is administratively up.
      However, during the discussion that followed, it became apparent that we
      have no reason to keep the current restriction and instead we should
      allow user space to rename interfaces regardless of their administrative
      state:
      
      1. The restriction was put in place over 20 years ago when renaming was
      only possible via IOCTL and before rtnetlink started notifying user
      space about such changes like it does today.
      
      2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
      5.2 and no regressions were reported.
      
      3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
      the administrative state of interface.
      
      Therefore, allow user space to rename running interfaces by removing the
      restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
      possible triage by emitting a message to the kernel log that an
      interface was renamed while UP.
      
      [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
      [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/Signed-off-by: NAndy Ren <andy.ren@getcruise.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd039b5e
  4. 04 11月, 2022 1 次提交
    • J
      net: devlink: track netdev with devlink_port assigned · 02a68a47
      Jiri Pirko 提交于
      Currently, ethernet drivers are using devlink_port_type_eth_set() and
      devlink_port_type_clear() to set devlink port type and link to related
      netdev.
      
      Instead of calling them directly, let the driver use
      SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let
      devlink to track it. Note the devlink port pointer is static during
      the time netdevice is registered.
      
      In devlink code, use per-namespace netdev notifier to track
      the netdevices with devlink_port assigned and change the internal
      devlink_port type and related type pointer accordingly.
      Signed-off-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      02a68a47
  5. 01 11月, 2022 2 次提交
  6. 29 10月, 2022 1 次提交
  7. 26 10月, 2022 1 次提交
    • K
      net: dev: Convert sa_data to flexible array in struct sockaddr · b5f0de6d
      Kees Cook 提交于
      One of the worst offenders of "fake flexible arrays" is struct sockaddr,
      as it is the classic example of why GCC and Clang have been traditionally
      forced to treat all trailing arrays as fake flexible arrays: in the
      distant misty past, sa_data became too small, and code started just
      treating it as a flexible array, even though it was fixed-size. The
      special case by the compiler is specifically that sizeof(sa->sa_data)
      and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1))
      do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat
      it as a flexible array.
      
      However, the coming -fstrict-flex-arrays compiler flag will remove
      these special cases so that FORTIFY_SOURCE can gain coverage over all
      the trailing arrays in the kernel that are _not_ supposed to be treated
      as a flexible array. To deal with this change, convert sa_data to a true
      flexible array. To keep the structure size the same, move sa_data into
      a union with a newly introduced sa_data_min with the original size. The
      result is that FORTIFY_SOURCE can continue to have no idea how large
      sa_data may actually be, but anything using sizeof(sa->sa_data) must
      switch to sizeof(sa->sa_data_min).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Dylan Yudaken <dylany@fb.com>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Cc: Petr Machata <petrm@nvidia.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: syzbot <syzkaller@googlegroups.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b5f0de6d
  8. 19 10月, 2022 1 次提交
    • P
      net: Fix return value of qdisc ingress handling on success · 672e97ef
      Paul Blakey 提交于
      Currently qdisc ingress handling (sch_handle_ingress()) doesn't
      set a return value and it is left to the old return value of
      the caller (__netif_receive_skb_core()) which is RX drop, so if
      the packet is consumed, caller will stop and return this value
      as if the packet was dropped.
      
      This causes a problem in the kernel tcp stack when having a
      egress tc rule forwarding to a ingress tc rule.
      The tcp stack sending packets on the device having the egress rule
      will see the packets as not successfully transmitted (although they
      actually were), will not advance it's internal state of sent data,
      and packets returning on such tcp stream will be dropped by the tcp
      stack with reason ack-of-unsent-data. See reproduction in [0] below.
      
      Fix that by setting the return value to RX success if
      the packet was handled successfully.
      
      [0] Reproduction steps:
       $ ip link add veth1 type veth peer name peer1
       $ ip link add veth2 type veth peer name peer2
       $ ifconfig peer1 5.5.5.6/24 up
       $ ip netns add ns0
       $ ip link set dev peer2 netns ns0
       $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
       $ ifconfig veth2 0 up
       $ ifconfig veth1 0 up
      
       #ingress forwarding veth1 <-> veth2
       $ tc qdisc add dev veth2 ingress
       $ tc qdisc add dev veth1 ingress
       $ tc filter add dev veth2 ingress prio 1 proto all flower \
         action mirred egress redirect dev veth1
       $ tc filter add dev veth1 ingress prio 1 proto all flower \
         action mirred egress redirect dev veth2
      
       #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
       $ tc qdisc add dev peer1 clsact
       $ tc filter add dev peer1 egress prio 20 proto ip flower \
         action mirred ingress redirect dev veth1
      
       #run iperf and see connection not running
       $ iperf3 -s&
       $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
      
       #delete egress rule, and run again, now should work
       $ tc filter del dev peer1 egress
       $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
      
      Fixes: f697c3e8 ("[NET]: Avoid unnecessary cloning for ingress filtering")
      Signed-off-by: NPaul Blakey <paulb@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      672e97ef
  9. 30 9月, 2022 1 次提交
  10. 24 8月, 2022 6 次提交
  11. 23 8月, 2022 1 次提交
  12. 22 8月, 2022 1 次提交
  13. 20 7月, 2022 1 次提交
  14. 06 7月, 2022 1 次提交
    • J
      xdp: Fix spurious packet loss in generic XDP TX path · 1fd6e567
      Johan Almbladh 提交于
      The byte queue limits (BQL) mechanism is intended to move queuing from
      the driver to the network stack in order to reduce latency caused by
      excessive queuing in hardware. However, when transmitting or redirecting
      a packet using generic XDP, the qdisc layer is bypassed and there are no
      additional queues. Since netif_xmit_stopped() also takes BQL limits into
      account, but without having any alternative queuing, packets are
      silently dropped.
      
      This patch modifies the drop condition to only consider cases when the
      driver itself cannot accept any more packets. This is analogous to the
      condition in __dev_direct_xmit(). Dropped packets are also counted on
      the device.
      
      Bypassing the qdisc layer in the generic XDP TX path means that XDP
      packets are able to starve other packets going through a qdisc, and
      DDOS attacks will be more effective. In-driver-XDP use dedicated TX
      queues, so they do not have this starvation issue.
      Signed-off-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com
      1fd6e567
  15. 17 6月, 2022 1 次提交
    • E
      net: fix data-race in dev_isalive() · cc26c266
      Eric Dumazet 提交于
      dev_isalive() is called under RTNL or dev_base_lock protection.
      
      This means that changes to dev->reg_state should be done with both locks held.
      
      syzbot reported:
      
      BUG: KCSAN: data-race in register_netdevice / type_show
      
      write to 0xffff888144ecf518 of 1 bytes by task 20886 on cpu 0:
      register_netdevice+0xb9f/0xdf0 net/core/dev.c:10050
      lapbeth_new_device drivers/net/wan/lapbether.c:414 [inline]
      lapbeth_device_event+0x4a0/0x6c0 drivers/net/wan/lapbether.c:456
      notifier_call_chain kernel/notifier.c:87 [inline]
      raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:455
      __dev_notify_flags+0x1d6/0x3a0
      dev_change_flags+0xa2/0xc0 net/core/dev.c:8607
      do_setlink+0x778/0x2230 net/core/rtnetlink.c:2780
      __rtnl_newlink net/core/rtnetlink.c:3546 [inline]
      rtnl_newlink+0x114c/0x16a0 net/core/rtnetlink.c:3593
      rtnetlink_rcv_msg+0x811/0x8c0 net/core/rtnetlink.c:6089
      netlink_rcv_skb+0x13e/0x240 net/netlink/af_netlink.c:2501
      rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6107
      netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
      netlink_unicast+0x58a/0x660 net/netlink/af_netlink.c:1345
      netlink_sendmsg+0x661/0x750 net/netlink/af_netlink.c:1921
      sock_sendmsg_nosec net/socket.c:714 [inline]
      sock_sendmsg net/socket.c:734 [inline]
      __sys_sendto+0x21e/0x2c0 net/socket.c:2119
      __do_sys_sendto net/socket.c:2131 [inline]
      __se_sys_sendto net/socket.c:2127 [inline]
      __x64_sys_sendto+0x74/0x90 net/socket.c:2127
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      read to 0xffff888144ecf518 of 1 bytes by task 20423 on cpu 1:
      dev_isalive net/core/net-sysfs.c:38 [inline]
      netdev_show net/core/net-sysfs.c:50 [inline]
      type_show+0x24/0x90 net/core/net-sysfs.c:112
      dev_attr_show+0x35/0x90 drivers/base/core.c:2095
      sysfs_kf_seq_show+0x175/0x240 fs/sysfs/file.c:59
      kernfs_seq_show+0x75/0x80 fs/kernfs/file.c:162
      seq_read_iter+0x2c3/0x8e0 fs/seq_file.c:230
      kernfs_fop_read_iter+0xd1/0x2f0 fs/kernfs/file.c:235
      call_read_iter include/linux/fs.h:2052 [inline]
      new_sync_read fs/read_write.c:401 [inline]
      vfs_read+0x5a5/0x6a0 fs/read_write.c:482
      ksys_read+0xe8/0x1a0 fs/read_write.c:620
      __do_sys_read fs/read_write.c:630 [inline]
      __se_sys_read fs/read_write.c:628 [inline]
      __x64_sys_read+0x3e/0x50 fs/read_write.c:628
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      value changed: 0x00 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 20423 Comm: udevd Tainted: G W 5.19.0-rc2-syzkaller-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc26c266
  16. 10 6月, 2022 4 次提交
  17. 18 5月, 2022 1 次提交
    • J
      random32: use real rng for non-deterministic randomness · d4150779
      Jason A. Donenfeld 提交于
      random32.c has two random number generators in it: one that is meant to
      be used deterministically, with some predefined seed, and one that does
      the same exact thing as random.c, except does it poorly. The first one
      has some use cases. The second one no longer does and can be replaced
      with calls to random.c's proper random number generator.
      
      The relatively recent siphash-based bad random32.c code was added in
      response to concerns that the prior random32.c was too deterministic.
      Out of fears that random.c was (at the time) too slow, this code was
      anonymously contributed. Then out of that emerged a kind of shadow
      entropy gathering system, with its own tentacles throughout various net
      code, added willy nilly.
      
      Stop👏making👏bespoke👏random👏number👏generators👏.
      
      Fortunately, recent advances in random.c mean that we can stop playing
      with this sketchiness, and just use get_random_u32(), which is now fast
      enough. In micro benchmarks using RDPMC, I'm seeing the same median
      cycle count between the two functions, with the mean being _slightly_
      higher due to batches refilling (which we can optimize further need be).
      However, when doing *real* benchmarks of the net functions that actually
      use these random numbers, the mean cycles actually *decreased* slightly
      (with the median still staying the same), likely because the additional
      prandom code means icache misses and complexity, whereas random.c is
      generally already being used by something else nearby.
      
      The biggest benefit of this is that there are many users of prandom who
      probably should be using cryptographically secure random numbers. This
      makes all of those accidental cases become secure by just flipping a
      switch. Later on, we can do a tree-wide cleanup to remove the static
      inline wrapper functions that this commit adds.
      
      There are also some low-ish hanging fruits for making this even faster
      in the future: a get_random_u16() function for use in the networking
      stack will give a 2x performance boost there, using SIMD for ChaCha20
      will let us compute 4 or 8 or 16 blocks of output in parallel, instead
      of just one, giving us large buffers for cheap, and introducing a
      get_random_*_bh() function that assumes irqs are already disabled will
      shave off a few cycles for ordinary calls. These are things we can chip
      away at down the road.
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      d4150779
  18. 16 5月, 2022 7 次提交
    • F
      net: fix dev_fill_forward_path with pppoe + bridge · cf2df74e
      Felix Fietkau 提交于
      When calling dev_fill_forward_path on a pppoe device, the provided destination
      address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
      code needs to update ctx->daddr to the correct value.
      Fix this by storing the address inside struct net_device_path_ctx
      
      Fixes: f6efc675 ("net: ppp: resolve forwarding path for bridge pppoe devices")
      Signed-off-by: NFelix Fietkau <nbd@nbd.name>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cf2df74e
    • E
      net: call skb_defer_free_flush() before each napi_poll() · 90987650
      Eric Dumazet 提交于
      skb_defer_free_flush() can consume cpu cycles,
      it seems better to call it in the inner loop:
      
      - Potentially frees page/skb that will be reallocated while hot.
      
      - Account for the cpu cycles in the @Time_Limit determination.
      
      - Keep softnet_data.defer_count small to reduce chances for
        skb_attempt_defer_free() to send an IPI.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90987650
    • E
      net: add skb_defer_max sysctl · 39564c3f
      Eric Dumazet 提交于
      commit 68822bdf ("net: generalize skb freeing
      deferral to per-cpu lists") added another per-cpu
      cache of skbs. It was expected to be small,
      and an IPI was forced whenever the list reached 128
      skbs.
      
      We might need to be able to control more precisely
      queue capacity and added latency.
      
      An IPI is generated whenever queue reaches half capacity.
      
      Default value of the new limit is 64.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39564c3f
    • E
      net: use napi_consume_skb() in skb_defer_free_flush() · 2db60eed
      Eric Dumazet 提交于
      skb_defer_free_flush() runs from softirq context,
      we have the opportunity to refill the napi_alloc_cache,
      and/or use kmem_cache_free_bulk() when this cache is full.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2db60eed
    • E
      net: fix possible race in skb_attempt_defer_free() · 97e719a8
      Eric Dumazet 提交于
      A cpu can observe sd->defer_count reaching 128,
      and call smp_call_function_single_async()
      
      Problem is that the remote CPU can clear sd->defer_count
      before the IPI is run/acknowledged.
      
      Other cpus can queue more packets and also decide
      to call smp_call_function_single_async() while the pending
      IPI was not yet delivered.
      
      This is a common issue with smp_call_function_single_async().
      Callers must ensure correct synchronization and serialization.
      
      I triggered this issue while experimenting smaller threshold.
      Performing the call to smp_call_function_single_async()
      under sd->defer_lock protection did not solve the problem.
      
      Commit 5a18ceca ("smp: Allow smp_call_function_single_async()
      to insert locked csd") replaced an informative WARN_ON_ONCE()
      with a return of -EBUSY, which is often ignored.
      Test of CSD_FLAG_LOCK presence is racy anyway.
      
      Fixes: 68822bdf ("net: generalize skb freeing deferral to per-cpu lists")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97e719a8
    • A
      net: allow gro_max_size to exceed 65536 · 0fe79f28
      Alexander Duyck 提交于
      Allow the gro_max_size to exceed a value larger than 65536.
      
      There weren't really any external limitations that prevented this other
      than the fact that IPv4 only supports a 16 bit length field. Since we have
      the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
      exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
      via a constant rather than relying on gro_max_size.
      
      [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fe79f28
    • A
      net: allow gso_max_size to exceed 65536 · 7c4e983c
      Alexander Duyck 提交于
      The code for gso_max_size was added originally to allow for debugging and
      workaround of buggy devices that couldn't support TSO with blocks 64K in
      size. The original reason for limiting it to 64K was because that was the
      existing limits of IPv4 and non-jumbogram IPv6 length fields.
      
      With the addition of Big TCP we can remove this limit and allow the value
      to potentially go up to UINT_MAX and instead be limited by the tso_max_size
      value.
      
      So in order to support this we need to go through and clean up the
      remaining users of the gso_max_size value so that the values will cap at
      64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
      so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
      limit for GSO_MAX_SIZE.
      
      v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                     in a new sk_trim_gso_size() helper.
                     netif_set_tso_max_size() caps the requested TSO size
                     with GSO_MAX_SIZE.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4e983c
  19. 13 5月, 2022 1 次提交
  20. 11 5月, 2022 3 次提交