1. 16 11月, 2021 2 次提交
  2. 03 11月, 2021 2 次提交
    • T
      net: avoid double accounting for pure zerocopy skbs · 9b65b17d
      Talal Ahmad 提交于
      Track skbs containing only zerocopy data and avoid charging them to
      kernel memory to correctly account the memory utilization for
      msg_zerocopy. All of the data in such skbs is held in user pages which
      are already accounted to user. Before this change, they are charged
      again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      
      With this commit we don't see the warning we saw in previous commit
      which resulted in commit 84882cf7.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b65b17d
    • E
      net: add and use skb_unclone_keeptruesize() helper · c4777efa
      Eric Dumazet 提交于
      While commit 097b9146 ("net: fix up truesize of cloned
      skb in skb_prepare_for_shift()") fixed immediate issues found
      when KFENCE was enabled/tested, there are still similar issues,
      when tcp_trim_head() hits KFENCE while the master skb
      is cloned.
      
      This happens under heavy networking TX workloads,
      when the TX completion might be delayed after incoming ACK.
      
      This patch fixes the WARNING in sk_stream_kill_queues
      when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.
      
      Fixes: d3fb45f3 ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c4777efa
  3. 02 11月, 2021 2 次提交
    • J
      Revert "net: avoid double accounting for pure zerocopy skbs" · 84882cf7
      Jakub Kicinski 提交于
      This reverts commit f1a456f8.
      
        WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
        CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
        RIP: 0010:skb_try_coalesce+0x78b/0x7e0
        Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
        RSP: 0018:ffff88881f449688 EFLAGS: 00010282
        RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
        RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
        RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
        R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
        R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
        FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         tcp_try_coalesce+0xeb/0x290
         ? tcp_parse_options+0x610/0x610
         ? mark_held_locks+0x79/0xa0
         tcp_queue_rcv+0x69/0x2f0
         tcp_rcv_established+0xa49/0xd40
         ? tcp_data_queue+0x18a0/0x18a0
         tcp_v6_do_rcv+0x1c9/0x880
         ? rt6_mtu_change_route+0x100/0x100
         tcp_v6_rcv+0x1624/0x1830
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      84882cf7
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
  4. 29 10月, 2021 1 次提交
  5. 23 10月, 2021 1 次提交
  6. 22 9月, 2021 1 次提交
  7. 14 9月, 2021 1 次提交
  8. 03 9月, 2021 1 次提交
  9. 14 8月, 2021 1 次提交
  10. 05 8月, 2021 1 次提交
  11. 03 8月, 2021 1 次提交
  12. 29 7月, 2021 3 次提交
  13. 19 7月, 2021 1 次提交
    • P
      net: Fix zero-copy head len calculation. · a17ad096
      Pravin B Shelar 提交于
      In some cases skb head could be locked and entire header
      data is pulled from skb. When skb_zerocopy() called in such cases,
      following BUG is triggered. This patch fixes it by copying entire
      skb in such cases.
      This could be optimized incase this is performance bottleneck.
      
      ---8<---
      kernel BUG at net/core/skbuff.c:2961!
      invalid opcode: 0000 [#1] SMP PTI
      CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           OE     5.4.0-77-generic #86-Ubuntu
      Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:skb_zerocopy+0x37a/0x3a0
      RSP: 0018:ffffbcc70013ca38 EFLAGS: 00010246
      Call Trace:
       <IRQ>
       queue_userspace_packet+0x2af/0x5e0 [openvswitch]
       ovs_dp_upcall+0x3d/0x60 [openvswitch]
       ovs_dp_process_packet+0x125/0x150 [openvswitch]
       ovs_vport_receive+0x77/0xd0 [openvswitch]
       netdev_port_receive+0x87/0x130 [openvswitch]
       netdev_frame_hook+0x4b/0x60 [openvswitch]
       __netif_receive_skb_core+0x2b4/0xc90
       __netif_receive_skb_one_core+0x3f/0xa0
       __netif_receive_skb+0x18/0x60
       process_backlog+0xa9/0x160
       net_rx_action+0x142/0x390
       __do_softirq+0xe1/0x2d6
       irq_exit+0xae/0xb0
       do_IRQ+0x5a/0xf0
       common_interrupt+0xf/0xf
      
      Code that triggered BUG:
      int
      skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
      {
              int i, j = 0;
              int plen = 0; /* length of skb->head fragment */
              int ret;
              struct page *page;
              unsigned int offset;
      
              BUG_ON(!from->head_frag && !hlen);
      Signed-off-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a17ad096
  14. 17 7月, 2021 1 次提交
  15. 07 7月, 2021 1 次提交
  16. 30 6月, 2021 1 次提交
  17. 11 6月, 2021 1 次提交
  18. 08 6月, 2021 2 次提交
    • I
      page_pool: Allow drivers to hint on SKB recycling · 6a5bcd84
      Ilias Apalodimas 提交于
      Up to now several high speed NICs have custom mechanisms of recycling
      the allocated memory they use for their payloads.
      Our page_pool API already has recycling capabilities that are always
      used when we are running in 'XDP mode'. So let's tweak the API and the
      kernel network stack slightly and allow the recycling to happen even
      during the standard operation.
      The API doesn't take into account 'split page' policies used by those
      drivers currently, but can be extended once we have users for that.
      
      The idea is to be able to intercept the packet on skb_release_data().
      If it's a buffer coming from our page_pool API recycle it back to the
      pool for further usage or just release the packet entirely.
      
      To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
      a field in struct page (page->pp) to store the page_pool pointer.
      Storing the information in page->pp allows us to recycle both SKBs and
      their fragments.
      We could have skipped the skb bit entirely, since identical information
      can bederived from struct page. However, in an effort to affect the free path
      as less as possible, reading a single bit in the skb which is already
      in cache, is better that trying to derive identical information for the
      page stored data.
      
      The driver or page_pool has to take care of the sync operations on it's own
      during the buffer recycling since the buffer is, after opting-in to the
      recycling, never unmapped.
      
      Since the gain on the drivers depends on the architecture, we are not
      enabling recycling by default if the page_pool API is used on a driver.
      In order to enable recycling the driver must call skb_mark_for_recycle()
      to store the information we need for recycling in page->pp and
      enabling the recycling bit, or page_pool_store_mem_info() for a fragment.
      Co-developed-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Co-developed-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a5bcd84
    • M
      skbuff: add a parameter to __skb_frag_unref · c420c989
      Matteo Croce 提交于
      This is a prerequisite patch, the next one is enabling recycling of
      skbs and fragments. Add an extra argument on __skb_frag_unref() to
      handle recycling, and update the current users of the function with that.
      Signed-off-by: NMatteo Croce <mcroce@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c420c989
  19. 15 4月, 2021 1 次提交
    • P
      skbuff: revert "skbuff: remove some unnecessary operation in skb_segment_list()" · 17c3df70
      Paolo Abeni 提交于
      the commit 1ddc3229 ("skbuff: remove some unnecessary operation
      in skb_segment_list()") introduces an issue very similar to the
      one already fixed by commit 53475c5d ("net: fix use-after-free when
      UDP GRO with shared fraglist").
      
      If the GSO skb goes though skb_clone() and pskb_expand_head() before
      entering skb_segment_list(), the latter  will unshare the frag_list
      skbs and will release the old list. With the reverted commit in place,
      when skb_segment_list() completes, skb->next points to the just
      released list, and later on the kernel will hit UaF.
      
      Note that since commit e0e3070a ("udp: properly complete L4 GRO
      over UDP tunnel packet") the critical scenario can be reproduced also
      receiving UDP over vxlan traffic with:
      
      NIC (NETIF_F_GRO_FRAGLIST enabled) -> vxlan -> UDP sink
      
      Attaching a packet socket to the NIC will cause skb_clone() and the
      tunnel decapsulation will call pskb_expand_head().
      
      Fixes: 1ddc3229 ("skbuff: remove some unnecessary operation in skb_segment_list()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17c3df70
  20. 02 4月, 2021 1 次提交
  21. 11 3月, 2021 1 次提交
    • Y
      skbuff: remove some unnecessary operation in skb_segment_list() · 1ddc3229
      Yunsheng Lin 提交于
      gro list uses skb_shinfo(skb)->frag_list to link two skb together,
      and NAPI_GRO_CB(p)->last->next is used when there are more skb,
      see skb_gro_receive_list(). gso expects that each segmented skb is
      linked together using skb->next, so only the first skb->next need
      to set to skb_shinfo(skb)-> frag_list when doing gso list segment.
      
      It is the same reason that nskb->next does not need to be set to
      list_skb before goto the error handling, because nskb->next already
      pointers to list_skb.
      
      And nskb is also the last skb at the end of loop, so remove tail
      variable and use nskb instead.
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ddc3229
  22. 06 3月, 2021 1 次提交
  23. 02 3月, 2021 1 次提交
    • W
      net: expand textsearch ts_state to fit skb_seq_state · b228c9b0
      Willem de Bruijn 提交于
      The referenced commit expands the skb_seq_state used by
      skb_find_text with a 4B frag_off field, growing it to 48B.
      
      This exceeds container ts_state->cb, causing a stack corruption:
      
      [   73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
      is corrupted in: skb_find_text+0xc5/0xd0
      [   73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
      [   73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.14.0-2 04/01/2014
      [   73.260078] Call Trace:
      [   73.264677]  dump_stack+0x57/0x6a
      [   73.267866]  panic+0xf6/0x2b7
      [   73.270578]  ? skb_find_text+0xc5/0xd0
      [   73.273964]  __stack_chk_fail+0x10/0x10
      [   73.277491]  skb_find_text+0xc5/0xd0
      [   73.280727]  string_mt+0x1f/0x30
      [   73.283639]  ipt_do_table+0x214/0x410
      
      The struct is passed between skb_find_text and its callbacks
      skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
      the textsearch interface using TS_SKB_CB.
      
      I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
      skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.
      
      skb->cb was increased from 40B to 48B after ts_state was introduced,
      in commit 3e3850e9 ("[NETFILTER]: Fix xfrm lookup in
      ip_route_me_harder/ip6_route_me_harder").
      
      Increase ts_state.cb[] to 48 to fit the struct.
      
      Also add a BUILD_BUG_ON to avoid a repeat.
      
      The alternative is to directly add a dependency from textsearch onto
      linux/skbuff.h, but I think the intent is textsearch to have no such
      dependencies on its callers.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
      Fixes: 97550f6f ("net: compound page support in skb_seq_read")
      Reported-by: NKris Karas <bugs-a17@moonlit-rail.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b228c9b0
  24. 14 2月, 2021 11 次提交