1. 22 11月, 2021 2 次提交
    • K
      skbuff: Switch structure bounds to struct_group() · 03f61041
      Kees Cook 提交于
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Replace the existing empty member position markers "headers_start" and
      "headers_end" with a struct_group(). This will allow memcpy() and sizeof()
      to more easily reason about sizes, and improve readability.
      
      "pahole" shows no size nor member offset changes to struct sk_buff.
      "objdump -d" shows no object code changes (outside of WARNs affected by
      source line number changes).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/*
      Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedorSigned-off-by: NDavid S. Miller <davem@davemloft.net>
      03f61041
    • K
      skbuff: Move conditional preprocessor directives out of struct sk_buff · fba84957
      Kees Cook 提交于
      In preparation for using the struct_group() macro in struct sk_buff,
      move the conditional preprocessor directives out of the region of struct
      sk_buff that will be enclosed by struct_group(). While GCC and Clang are
      happy with conditional preprocessor directives here, sparse is not, even
      under -Wno-directive-within-macro[1], as would be seen under a C=1 build:
      
      net/core/filter.c: note: in included file (through include/linux/netlink.h, include/linux/sock_diag.h):
      ./include/linux/skbuff.h:820:1: warning: directive in macro's argument list
      ./include/linux/skbuff.h:822:1: warning: directive in macro's argument list
      ./include/linux/skbuff.h:846:1: warning: directive in macro's argument list
      ./include/linux/skbuff.h:848:1: warning: directive in macro's argument list
      
      Additionally remove empty macro argument definitions and usage.
      
      "objdump -d" shows no object code differences.
      
      [1] https://www.spinics.net/lists/linux-sparse/msg10857.htmlSigned-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fba84957
  2. 18 11月, 2021 1 次提交
  3. 16 11月, 2021 1 次提交
    • E
      tcp: defer skb freeing after socket lock is released · f35f8219
      Eric Dumazet 提交于
      tcp recvmsg() (or rx zerocopy) spends a fair amount of time
      freeing skbs after their payload has been consumed.
      
      A typical ~64KB GRO packet has to release ~45 page
      references, eventually going to page allocator
      for each of them.
      
      Currently, this freeing is performed while socket lock
      is held, meaning that there is a high chance that
      BH handler has to queue incoming packets to tcp socket backlog.
      
      This can cause additional latencies, because the user
      thread has to process the backlog at release_sock() time,
      and while doing so, additional frames can be added
      by BH handler.
      
      This patch adds logic to defer these frees after socket
      lock is released, or directly from BH handler if possible.
      
      Being able to free these skbs from BH handler helps a lot,
      because this avoids the usual alloc/free assymetry,
      when BH handler and user thread do not run on same cpu or
      NUMA node.
      
      One cpu can now be fully utilized for the kernel->user copy,
      and another cpu is handling BH processing and skb/page
      allocs/frees (assuming RFS is not forcing use of a single CPU)
      
      Tested:
       100Gbit NIC
       Max throughput for one TCP_STREAM flow, over 10 runs
      
      MTU : 1500
      Before: 55 Gbit
      After:  66 Gbit
      
      MTU : 4096+(headers)
      Before: 82 Gbit
      After:  95 Gbit
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f35f8219
  4. 15 11月, 2021 1 次提交
  5. 03 11月, 2021 2 次提交
    • T
      net: avoid double accounting for pure zerocopy skbs · 9b65b17d
      Talal Ahmad 提交于
      Track skbs containing only zerocopy data and avoid charging them to
      kernel memory to correctly account the memory utilization for
      msg_zerocopy. All of the data in such skbs is held in user pages which
      are already accounted to user. Before this change, they are charged
      again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      
      With this commit we don't see the warning we saw in previous commit
      which resulted in commit 84882cf7.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b65b17d
    • E
      net: add and use skb_unclone_keeptruesize() helper · c4777efa
      Eric Dumazet 提交于
      While commit 097b9146 ("net: fix up truesize of cloned
      skb in skb_prepare_for_shift()") fixed immediate issues found
      when KFENCE was enabled/tested, there are still similar issues,
      when tcp_trim_head() hits KFENCE while the master skb
      is cloned.
      
      This happens under heavy networking TX workloads,
      when the TX completion might be delayed after incoming ACK.
      
      This patch fixes the WARNING in sk_stream_kill_queues
      when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.
      
      Fixes: d3fb45f3 ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c4777efa
  6. 02 11月, 2021 2 次提交
    • J
      Revert "net: avoid double accounting for pure zerocopy skbs" · 84882cf7
      Jakub Kicinski 提交于
      This reverts commit f1a456f8.
      
        WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
        CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
        RIP: 0010:skb_try_coalesce+0x78b/0x7e0
        Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
        RSP: 0018:ffff88881f449688 EFLAGS: 00010282
        RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
        RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
        RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
        R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
        R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
        FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         tcp_try_coalesce+0xeb/0x290
         ? tcp_parse_options+0x610/0x610
         ? mark_held_locks+0x79/0xa0
         tcp_queue_rcv+0x69/0x2f0
         tcp_rcv_established+0xa49/0xd40
         ? tcp_data_queue+0x18a0/0x18a0
         tcp_v6_do_rcv+0x1c9/0x880
         ? rt6_mtu_change_route+0x100/0x100
         tcp_v6_rcv+0x1624/0x1830
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      84882cf7
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
  7. 29 10月, 2021 1 次提交
  8. 15 10月, 2021 1 次提交
    • L
      netfilter: Introduce egress hook · 42df6e1d
      Lukas Wunner 提交于
      Support classifying packets with netfilter on egress to satisfy user
      requirements such as:
      * outbound security policies for containers (Laura)
      * filtering and mangling intra-node Direct Server Return (DSR) traffic
        on a load balancer (Laura)
      * filtering locally generated traffic coming in through AF_PACKET,
        such as local ARP traffic generated for clustering purposes or DHCP
        (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
      * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
        and gPTP with nftables (Pablo)
      * in the future: in-kernel NAT64/NAT46 (Pablo)
      
      The egress hook introduced herein complements the ingress hook added by
      commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key").  A patch for nftables to hook up
      egress rules from user space has been submitted separately, so users may
      immediately take advantage of the feature.
      
      Alternatively or in addition to netfilter, packets can be classified
      with traffic control (tc).  On ingress, packets are classified first by
      tc, then by netfilter.  On egress, the order is reversed for symmetry.
      Conceptually, tc and netfilter can be thought of as layers, with
      netfilter layered above tc.
      
      Traffic control is capable of redirecting packets to another interface
      (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
      host namespace to a container via a veth connection:
      tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
      
      In this case, netfilter egress classifying is not performed when leaving
      the host namespace!  That's because the packet is still on the tc layer.
      If tc redirects the packet to a physical interface in the host namespace
      such that it leaves the system, the packet is never subjected to
      netfilter egress classifying.  That is only logical since it hasn't
      passed through netfilter ingress classifying either.
      
      Packets can alternatively be redirected at the netfilter layer using
      nft fwd.  Such a packet *is* subjected to netfilter egress classifying
      since it has reached the netfilter layer.
      
      Internally, the skb->nf_skip_egress flag controls whether netfilter is
      invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
      be called recursively by tunnel drivers such as vxlan, the flag is
      reverted to false after sch_handle_egress().  This ensures that
      netfilter is applied both on the overlay and underlying network.
      
      Interaction between tc and netfilter is possible by setting and querying
      skb->mark.
      
      If netfilter egress classifying is not enabled on any interface, it is
      patched out of the data path by way of a static_key and doesn't make a
      performance difference that is discernible from noise:
      
      Before:             1537 1538 1538 1537 1538 1537 Mb/sec
      After:              1536 1534 1539 1539 1539 1540 Mb/sec
      Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
      After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
      Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
      After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec
      
      When netfilter egress classifying is enabled on at least one interface,
      a minimal performance penalty is incurred for every egress packet, even
      if the interface it's transmitted over doesn't have any netfilter egress
      rules configured.  That is caused by checking dev->nf_hooks_egress
      against NULL.
      
      Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
      ip link add dev foo type dummy
      ip link set dev foo up
      modprobe pktgen
      echo "add_device foo" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
      
      Accept all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
      
      Drop all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
      
      Apply this patch when measuring packet drops to avoid errors in dmesg:
      https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Laura García Liébana <nevola@gmail.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      42df6e1d
  9. 09 9月, 2021 1 次提交
    • E
      net/af_unix: fix a data-race in unix_dgram_poll · 04f08eb4
      Eric Dumazet 提交于
      syzbot reported another data-race in af_unix [1]
      
      Lets change __skb_insert() to use WRITE_ONCE() when changing
      skb head qlen.
      
      Also, change unix_dgram_poll() to use lockless version
      of unix_recvq_full()
      
      It is verry possible we can switch all/most unix_recvq_full()
      to the lockless version, this will be done in a future kernel version.
      
      [1] HEAD commit: 8596e589
      
      BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll
      
      write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0:
       __skb_insert include/linux/skbuff.h:1938 [inline]
       __skb_queue_before include/linux/skbuff.h:2043 [inline]
       __skb_queue_tail include/linux/skbuff.h:2076 [inline]
       skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264
       unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850
       sock_sendmsg_nosec net/socket.c:703 [inline]
       sock_sendmsg net/socket.c:723 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
       ___sys_sendmsg net/socket.c:2446 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2532
       __do_sys_sendmmsg net/socket.c:2561 [inline]
       __se_sys_sendmmsg net/socket.c:2558 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
       skb_queue_len include/linux/skbuff.h:1869 [inline]
       unix_recvq_full net/unix/af_unix.c:194 [inline]
       unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
       sock_poll+0x23e/0x260 net/socket.c:1288
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll fs/eventpoll.c:846 [inline]
       ep_send_events fs/eventpoll.c:1683 [inline]
       ep_poll fs/eventpoll.c:1798 [inline]
       do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
       __do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
       __se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
       __x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000001b -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G        W         5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 86b18aaa ("skbuff: fix a data race in skb_queue_len()")
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04f08eb4
  10. 10 8月, 2021 1 次提交
  11. 03 8月, 2021 1 次提交
  12. 31 7月, 2021 1 次提交
  13. 29 7月, 2021 2 次提交
  14. 08 7月, 2021 1 次提交
    • K
      bpf: cpumap: Implement generic cpumap · 11941f8a
      Kumar Kartikeya Dwivedi 提交于
      This change implements CPUMAP redirect support for generic XDP programs.
      The idea is to reuse the cpu map entry's queue that is used to push
      native xdp frames for redirecting skb to a different CPU. This will
      match native XDP behavior (in that RPS is invoked again for packet
      reinjected into networking stack).
      
      To be able to determine whether the incoming skb is from the driver or
      cpumap, we reuse skb->redirected bit that skips generic XDP processing
      when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
      it has been lifted and it is always available.
      
      >From the redirect side, we add the skb to ptr_ring with its lowest bit
      set to 1.  This should be safe as skb is not 1-byte aligned. This allows
      kthread to discern between xdp_frames and sk_buff. On consumption of the
      ptr_ring item, the lowest bit is unset.
      
      In the end, the skb is simply added to the list that kthread is anyway
      going to maintain for xdp_frames converted to skb, and then received
      again by using netif_receive_skb_list.
      
      Bulking optimization for generic cpumap is left as an exercise for a
      future patch for now.
      
      Since cpumap entry progs are now supported, also remove check in
      generic_xdp_install for the cpumap.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
      11941f8a
  15. 08 6月, 2021 2 次提交
    • I
      page_pool: Allow drivers to hint on SKB recycling · 6a5bcd84
      Ilias Apalodimas 提交于
      Up to now several high speed NICs have custom mechanisms of recycling
      the allocated memory they use for their payloads.
      Our page_pool API already has recycling capabilities that are always
      used when we are running in 'XDP mode'. So let's tweak the API and the
      kernel network stack slightly and allow the recycling to happen even
      during the standard operation.
      The API doesn't take into account 'split page' policies used by those
      drivers currently, but can be extended once we have users for that.
      
      The idea is to be able to intercept the packet on skb_release_data().
      If it's a buffer coming from our page_pool API recycle it back to the
      pool for further usage or just release the packet entirely.
      
      To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
      a field in struct page (page->pp) to store the page_pool pointer.
      Storing the information in page->pp allows us to recycle both SKBs and
      their fragments.
      We could have skipped the skb bit entirely, since identical information
      can bederived from struct page. However, in an effort to affect the free path
      as less as possible, reading a single bit in the skb which is already
      in cache, is better that trying to derive identical information for the
      page stored data.
      
      The driver or page_pool has to take care of the sync operations on it's own
      during the buffer recycling since the buffer is, after opting-in to the
      recycling, never unmapped.
      
      Since the gain on the drivers depends on the architecture, we are not
      enabling recycling by default if the page_pool API is used on a driver.
      In order to enable recycling the driver must call skb_mark_for_recycle()
      to store the information we need for recycling in page->pp and
      enabling the recycling bit, or page_pool_store_mem_info() for a fragment.
      Co-developed-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Co-developed-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a5bcd84
    • M
      skbuff: add a parameter to __skb_frag_unref · c420c989
      Matteo Croce 提交于
      This is a prerequisite patch, the next one is enabling recycling of
      skbs and fragments. Add an extra argument on __skb_frag_unref() to
      handle recycling, and update the current users of the function with that.
      Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c420c989
  16. 02 4月, 2021 1 次提交
  17. 17 3月, 2021 1 次提交
  18. 15 3月, 2021 3 次提交
  19. 12 3月, 2021 1 次提交
    • E
      tcp: plug skb_still_in_host_queue() to TSQ · f4dae54e
      Eric Dumazet 提交于
      Jakub and Neil reported an increase of RTO timers whenever
      TX completions are delayed a bit more (by increasing
      NIC TX coalescing parameters)
      
      Main issue is that TCP stack has a logic preventing a packet
      being retransmit if the prior clone has not yet been
      orphaned or freed.
      
      This logic came with commit 1f3279ae ("tcp: avoid
      retransmits of TCP packets hanging in host queues")
      
      Thankfully, in the case skb_still_in_host_queue() detects
      the initial clone is still in flight, it can use TSQ logic
      that will eventually retry later, at the moment the clone
      is freed or orphaned.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NNeil Spring <ntspring@fb.com>
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4dae54e
  20. 04 3月, 2021 1 次提交
  21. 27 2月, 2021 1 次提交
  22. 14 2月, 2021 3 次提交
    • A
      skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing · 9243adfc
      Alexander Lobakin 提交于
      napi_frags_finish() and napi_skb_finish() can only be called inside
      NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
      got NAPI_MERGED_FREE verdict instead of immediate freeing.
      Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
      and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
      to NAPI cache.
      As many drivers call napi_alloc_skb()/napi_get_frags() on their
      receive path, this becomes especially useful.
      Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9243adfc
    • A
      skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads · f450d539
      Alexander Lobakin 提交于
      Instead of just bulk-flushing skbuff_heads queued up through
      napi_consume_skb() or __kfree_skb_defer(), try to reuse them
      on allocation path.
      If the cache is empty on allocation, bulk-allocate the first
      16 elements, which is more efficient than per-skb allocation.
      If the cache is full on freeing, bulk-wipe the second half of
      the cache (32 elements).
      This also includes custom KASAN poisoning/unpoisoning to be
      double sure there are no use-after-free cases.
      
      To not change current behaviour, introduce a new function,
      napi_build_skb(), to optionally use a new approach later
      in drivers.
      
      Note on selected bulk size, 16:
       - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
         and especially VETH_XDP_BATCH, which is also used to
         bulk-allocate skbuff_heads and was tested on powerful
         setups;
       - this also showed the best performance in the actual
         test series (from the array of {8, 16, 32}).
      
      Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves
      Suggested-by: Eric Dumazet <edumazet@google.com>   # KASAN poisoning
      Cc: Dmitry Vyukov <dvyukov@google.com>             # Help with KASAN
      Cc: Paolo Abeni <pabeni@redhat.com>                # Reduced batch size
      Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f450d539
    • A
      skbuff: remove __kfree_skb_flush() · fec6e49b
      Alexander Lobakin 提交于
      This function isn't much needed as NAPI skb queue gets bulk-freed
      anyway when there's no more room, and even may reduce the efficiency
      of bulk operations.
      It will be even less needed after reusing skb cache on allocation path,
      so remove it and this way lighten network softirqs a bit.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fec6e49b
  23. 07 2月, 2021 1 次提交
    • K
      net: Introduce {netdev,napi}_alloc_frag_align() · 3f6e687d
      Kevin Hao 提交于
      In the current implementation of {netdev,napi}_alloc_frag(), it doesn't
      have any align guarantee for the returned buffer address, But for some
      hardwares they do require the DMA buffer to be aligned correctly,
      so we would have to use some workarounds like below if the buffers
      allocated by the {netdev,napi}_alloc_frag() are used by these hardwares
      for DMA.
          buf = napi_alloc_frag(really_needed_size + align);
          buf = PTR_ALIGN(buf, align);
      
      These codes seems ugly and would waste a lot of memories if the buffers
      are used in a network driver for the TX/RX. We have added the align
      support for the page_frag functions, so add the corresponding
      {netdev,napi}_frag functions.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3f6e687d
  24. 05 2月, 2021 2 次提交
  25. 23 1月, 2021 1 次提交
  26. 21 1月, 2021 1 次提交
  27. 20 1月, 2021 1 次提交
  28. 12 1月, 2021 2 次提交
  29. 08 1月, 2021 1 次提交