1. 20 2月, 2022 5 次提交
  2. 07 2月, 2022 6 次提交
  3. 28 1月, 2022 1 次提交
  4. 27 1月, 2022 1 次提交
  5. 22 1月, 2022 1 次提交
  6. 10 1月, 2022 4 次提交
  7. 18 12月, 2021 2 次提交
  8. 10 12月, 2021 1 次提交
    • K
      skbuff: Extract list pointers to silence compiler warnings · 1a2fb220
      Kees Cook 提交于
      Under both -Warray-bounds and the object_size sanitizer, the compiler is
      upset about accessing prev/next of sk_buff when the object it thinks it
      is coming from is sk_buff_head. The warning is a false positive due to
      the compiler taking a conservative approach, opting to warn at casting
      time rather than access time.
      
      However, in support of enabling -Warray-bounds globally (which has
      found many real bugs), arrange things for sk_buff so that the compiler
      can unambiguously see that there is no intention to access anything
      except prev/next.  Introduce and cast to a separate struct sk_buff_list,
      which contains _only_ the first two fields, silencing the warnings:
      
      In file included from ./include/net/net_namespace.h:39,
                       from ./include/linux/netdevice.h:37,
                       from net/core/netpoll.c:17:
      net/core/netpoll.c: In function 'refill_skbs':
      ./include/linux/skbuff.h:2086:9: warning: array subscript 'struct sk_buff[0]' is partly outside array bounds of 'struct sk_buff_head[1]' [-Warray-bounds]
       2086 |         __skb_insert(newsk, next->prev, next, list);
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      net/core/netpoll.c:49:28: note: while referencing 'skb_pool'
         49 | static struct sk_buff_head skb_pool;
            |                            ^~~~~~~~
      
      This change results in no executable instruction differences.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20211207062758.2324338-1-keescook@chromium.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1a2fb220
  9. 08 12月, 2021 1 次提交
  10. 07 12月, 2021 1 次提交
  11. 26 11月, 2021 1 次提交
  12. 22 11月, 2021 2 次提交
    • K
      skbuff: Switch structure bounds to struct_group() · 03f61041
      Kees Cook 提交于
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Replace the existing empty member position markers "headers_start" and
      "headers_end" with a struct_group(). This will allow memcpy() and sizeof()
      to more easily reason about sizes, and improve readability.
      
      "pahole" shows no size nor member offset changes to struct sk_buff.
      "objdump -d" shows no object code changes (outside of WARNs affected by
      source line number changes).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/*
      Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedorSigned-off-by: NDavid S. Miller <davem@davemloft.net>
      03f61041
    • K
      skbuff: Move conditional preprocessor directives out of struct sk_buff · fba84957
      Kees Cook 提交于
      In preparation for using the struct_group() macro in struct sk_buff,
      move the conditional preprocessor directives out of the region of struct
      sk_buff that will be enclosed by struct_group(). While GCC and Clang are
      happy with conditional preprocessor directives here, sparse is not, even
      under -Wno-directive-within-macro[1], as would be seen under a C=1 build:
      
      net/core/filter.c: note: in included file (through include/linux/netlink.h, include/linux/sock_diag.h):
      ./include/linux/skbuff.h:820:1: warning: directive in macro's argument list
      ./include/linux/skbuff.h:822:1: warning: directive in macro's argument list
      ./include/linux/skbuff.h:846:1: warning: directive in macro's argument list
      ./include/linux/skbuff.h:848:1: warning: directive in macro's argument list
      
      Additionally remove empty macro argument definitions and usage.
      
      "objdump -d" shows no object code differences.
      
      [1] https://www.spinics.net/lists/linux-sparse/msg10857.htmlSigned-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fba84957
  13. 18 11月, 2021 1 次提交
  14. 16 11月, 2021 1 次提交
    • E
      tcp: defer skb freeing after socket lock is released · f35f8219
      Eric Dumazet 提交于
      tcp recvmsg() (or rx zerocopy) spends a fair amount of time
      freeing skbs after their payload has been consumed.
      
      A typical ~64KB GRO packet has to release ~45 page
      references, eventually going to page allocator
      for each of them.
      
      Currently, this freeing is performed while socket lock
      is held, meaning that there is a high chance that
      BH handler has to queue incoming packets to tcp socket backlog.
      
      This can cause additional latencies, because the user
      thread has to process the backlog at release_sock() time,
      and while doing so, additional frames can be added
      by BH handler.
      
      This patch adds logic to defer these frees after socket
      lock is released, or directly from BH handler if possible.
      
      Being able to free these skbs from BH handler helps a lot,
      because this avoids the usual alloc/free assymetry,
      when BH handler and user thread do not run on same cpu or
      NUMA node.
      
      One cpu can now be fully utilized for the kernel->user copy,
      and another cpu is handling BH processing and skb/page
      allocs/frees (assuming RFS is not forcing use of a single CPU)
      
      Tested:
       100Gbit NIC
       Max throughput for one TCP_STREAM flow, over 10 runs
      
      MTU : 1500
      Before: 55 Gbit
      After:  66 Gbit
      
      MTU : 4096+(headers)
      Before: 82 Gbit
      After:  95 Gbit
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f35f8219
  15. 15 11月, 2021 1 次提交
  16. 03 11月, 2021 2 次提交
    • T
      net: avoid double accounting for pure zerocopy skbs · 9b65b17d
      Talal Ahmad 提交于
      Track skbs containing only zerocopy data and avoid charging them to
      kernel memory to correctly account the memory utilization for
      msg_zerocopy. All of the data in such skbs is held in user pages which
      are already accounted to user. Before this change, they are charged
      again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      
      With this commit we don't see the warning we saw in previous commit
      which resulted in commit 84882cf7.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b65b17d
    • E
      net: add and use skb_unclone_keeptruesize() helper · c4777efa
      Eric Dumazet 提交于
      While commit 097b9146 ("net: fix up truesize of cloned
      skb in skb_prepare_for_shift()") fixed immediate issues found
      when KFENCE was enabled/tested, there are still similar issues,
      when tcp_trim_head() hits KFENCE while the master skb
      is cloned.
      
      This happens under heavy networking TX workloads,
      when the TX completion might be delayed after incoming ACK.
      
      This patch fixes the WARNING in sk_stream_kill_queues
      when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.
      
      Fixes: d3fb45f3 ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c4777efa
  17. 02 11月, 2021 2 次提交
    • J
      Revert "net: avoid double accounting for pure zerocopy skbs" · 84882cf7
      Jakub Kicinski 提交于
      This reverts commit f1a456f8.
      
        WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
        CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
        RIP: 0010:skb_try_coalesce+0x78b/0x7e0
        Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
        RSP: 0018:ffff88881f449688 EFLAGS: 00010282
        RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
        RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
        RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
        R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
        R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
        FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         tcp_try_coalesce+0xeb/0x290
         ? tcp_parse_options+0x610/0x610
         ? mark_held_locks+0x79/0xa0
         tcp_queue_rcv+0x69/0x2f0
         tcp_rcv_established+0xa49/0xd40
         ? tcp_data_queue+0x18a0/0x18a0
         tcp_v6_do_rcv+0x1c9/0x880
         ? rt6_mtu_change_route+0x100/0x100
         tcp_v6_rcv+0x1624/0x1830
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      84882cf7
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
  18. 29 10月, 2021 1 次提交
  19. 15 10月, 2021 1 次提交
    • L
      netfilter: Introduce egress hook · 42df6e1d
      Lukas Wunner 提交于
      Support classifying packets with netfilter on egress to satisfy user
      requirements such as:
      * outbound security policies for containers (Laura)
      * filtering and mangling intra-node Direct Server Return (DSR) traffic
        on a load balancer (Laura)
      * filtering locally generated traffic coming in through AF_PACKET,
        such as local ARP traffic generated for clustering purposes or DHCP
        (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
      * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
        and gPTP with nftables (Pablo)
      * in the future: in-kernel NAT64/NAT46 (Pablo)
      
      The egress hook introduced herein complements the ingress hook added by
      commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key").  A patch for nftables to hook up
      egress rules from user space has been submitted separately, so users may
      immediately take advantage of the feature.
      
      Alternatively or in addition to netfilter, packets can be classified
      with traffic control (tc).  On ingress, packets are classified first by
      tc, then by netfilter.  On egress, the order is reversed for symmetry.
      Conceptually, tc and netfilter can be thought of as layers, with
      netfilter layered above tc.
      
      Traffic control is capable of redirecting packets to another interface
      (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
      host namespace to a container via a veth connection:
      tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
      
      In this case, netfilter egress classifying is not performed when leaving
      the host namespace!  That's because the packet is still on the tc layer.
      If tc redirects the packet to a physical interface in the host namespace
      such that it leaves the system, the packet is never subjected to
      netfilter egress classifying.  That is only logical since it hasn't
      passed through netfilter ingress classifying either.
      
      Packets can alternatively be redirected at the netfilter layer using
      nft fwd.  Such a packet *is* subjected to netfilter egress classifying
      since it has reached the netfilter layer.
      
      Internally, the skb->nf_skip_egress flag controls whether netfilter is
      invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
      be called recursively by tunnel drivers such as vxlan, the flag is
      reverted to false after sch_handle_egress().  This ensures that
      netfilter is applied both on the overlay and underlying network.
      
      Interaction between tc and netfilter is possible by setting and querying
      skb->mark.
      
      If netfilter egress classifying is not enabled on any interface, it is
      patched out of the data path by way of a static_key and doesn't make a
      performance difference that is discernible from noise:
      
      Before:             1537 1538 1538 1537 1538 1537 Mb/sec
      After:              1536 1534 1539 1539 1539 1540 Mb/sec
      Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
      After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
      Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
      After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec
      
      When netfilter egress classifying is enabled on at least one interface,
      a minimal performance penalty is incurred for every egress packet, even
      if the interface it's transmitted over doesn't have any netfilter egress
      rules configured.  That is caused by checking dev->nf_hooks_egress
      against NULL.
      
      Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
      ip link add dev foo type dummy
      ip link set dev foo up
      modprobe pktgen
      echo "add_device foo" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
      
      Accept all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
      
      Drop all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
      
      Apply this patch when measuring packet drops to avoid errors in dmesg:
      https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Laura García Liébana <nevola@gmail.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      42df6e1d
  20. 09 9月, 2021 1 次提交
    • E
      net/af_unix: fix a data-race in unix_dgram_poll · 04f08eb4
      Eric Dumazet 提交于
      syzbot reported another data-race in af_unix [1]
      
      Lets change __skb_insert() to use WRITE_ONCE() when changing
      skb head qlen.
      
      Also, change unix_dgram_poll() to use lockless version
      of unix_recvq_full()
      
      It is verry possible we can switch all/most unix_recvq_full()
      to the lockless version, this will be done in a future kernel version.
      
      [1] HEAD commit: 8596e589
      
      BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll
      
      write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0:
       __skb_insert include/linux/skbuff.h:1938 [inline]
       __skb_queue_before include/linux/skbuff.h:2043 [inline]
       __skb_queue_tail include/linux/skbuff.h:2076 [inline]
       skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264
       unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850
       sock_sendmsg_nosec net/socket.c:703 [inline]
       sock_sendmsg net/socket.c:723 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
       ___sys_sendmsg net/socket.c:2446 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2532
       __do_sys_sendmmsg net/socket.c:2561 [inline]
       __se_sys_sendmmsg net/socket.c:2558 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
       skb_queue_len include/linux/skbuff.h:1869 [inline]
       unix_recvq_full net/unix/af_unix.c:194 [inline]
       unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
       sock_poll+0x23e/0x260 net/socket.c:1288
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll fs/eventpoll.c:846 [inline]
       ep_send_events fs/eventpoll.c:1683 [inline]
       ep_poll fs/eventpoll.c:1798 [inline]
       do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
       __do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
       __se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
       __x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000001b -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G        W         5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 86b18aaa ("skbuff: fix a data race in skb_queue_len()")
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04f08eb4
  21. 10 8月, 2021 1 次提交
  22. 03 8月, 2021 1 次提交
  23. 31 7月, 2021 1 次提交
  24. 29 7月, 2021 1 次提交