1. 02 6月, 2021 1 次提交
    • X
      virtio-net: fix for unable to handle page fault for address · 5c37711d
      Xuan Zhuo 提交于
      In merge mode, when xdp is enabled, if the headroom of buf is smaller
      than virtnet_get_headroom(), xdp_linearize_page() will be called but the
      variable of "headroom" is still 0, which leads to wrong logic after
      entering page_to_skb().
      
      [   16.600944] BUG: unable to handle page fault for address: ffffecbfff7b43c8[   16.602175] #PF: supervisor read access in kernel mode
      [   16.603350] #PF: error_code(0x0000) - not-present page
      [   16.604200] PGD 0 P4D 0
      [   16.604686] Oops: 0000 [#1] SMP PTI
      [   16.605306] CPU: 4 PID: 715 Comm: sh Tainted: G    B             5.12.0+ #312
      [   16.606429] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/04
      [   16.608217] RIP: 0010:unmap_page_range+0x947/0xde0
      [   16.609014] Code: 00 00 08 00 48 83 f8 01 45 19 e4 41 f7 d4 41 83 e4 03 e9 a4 fd ff ff e8 b7 63 ed ff 4c 89 e0 48 c1 e0 065
      [   16.611863] RSP: 0018:ffffc90002503c58 EFLAGS: 00010286
      [   16.612720] RAX: ffffecbfff7b43c0 RBX: 00007f19f7203000 RCX: ffffffff812ff359
      [   16.613853] RDX: ffff888107778000 RSI: 0000000000000000 RDI: 0000000000000005
      [   16.614976] RBP: ffffea000425e000 R08: 0000000000000000 R09: 3030303030303030
      [   16.616124] R10: ffffffff82ed7d94 R11: 6637303030302052 R12: 7c00000afffded0f
      [   16.617276] R13: 0000000000000001 R14: ffff888119ee7010 R15: 00007f19f7202000
      [   16.618423] FS:  0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
      [   16.619738] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   16.620670] CR2: ffffecbfff7b43c8 CR3: 0000000103220005 CR4: 0000000000370ee0
      [   16.621792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   16.622920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   16.624047] Call Trace:
      [   16.624525]  ? release_pages+0x24d/0x730
      [   16.625209]  unmap_single_vma+0xa9/0x130
      [   16.625885]  unmap_vmas+0x76/0xf0
      [   16.626480]  exit_mmap+0xa0/0x210
      [   16.627129]  mmput+0x67/0x180
      [   16.627673]  do_exit+0x3d1/0xf10
      [   16.628259]  ? do_user_addr_fault+0x231/0x840
      [   16.629000]  do_group_exit+0x53/0xd0
      [   16.629631]  __x64_sys_exit_group+0x1d/0x20
      [   16.630354]  do_syscall_64+0x3c/0x80
      [   16.630988]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   16.631828] RIP: 0033:0x7f1a043d0191
      [   16.632464] Code: Unable to access opcode bytes at RIP 0x7f1a043d0167.
      [   16.633502] RSP: 002b:00007ffe3d993308 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      [   16.634737] RAX: ffffffffffffffda RBX: 00007f1a044c9490 RCX: 00007f1a043d0191
      [   16.635857] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
      [   16.636986] RBP: 0000000000000000 R08: ffffffffffffff88 R09: 0000000000000001
      [   16.638120] R10: 0000000000000008 R11: 0000000000000246 R12: 00007f1a044c9490
      [   16.639245] R13: 0000000000000001 R14: 00007f1a044c9968 R15: 0000000000000000
      [   16.640408] Modules linked in:
      [   16.640958] CR2: ffffecbfff7b43c8
      [   16.641557] ---[ end trace bc4891c6ce46354c ]---
      [   16.642335] RIP: 0010:unmap_page_range+0x947/0xde0
      [   16.643135] Code: 00 00 08 00 48 83 f8 01 45 19 e4 41 f7 d4 41 83 e4 03 e9 a4 fd ff ff e8 b7 63 ed ff 4c 89 e0 48 c1 e0 065
      [   16.645983] RSP: 0018:ffffc90002503c58 EFLAGS: 00010286
      [   16.646845] RAX: ffffecbfff7b43c0 RBX: 00007f19f7203000 RCX: ffffffff812ff359
      [   16.647970] RDX: ffff888107778000 RSI: 0000000000000000 RDI: 0000000000000005
      [   16.649091] RBP: ffffea000425e000 R08: 0000000000000000 R09: 3030303030303030
      [   16.650250] R10: ffffffff82ed7d94 R11: 6637303030302052 R12: 7c00000afffded0f
      [   16.651394] R13: 0000000000000001 R14: ffff888119ee7010 R15: 00007f19f7202000
      [   16.652529] FS:  0000000000000000(0000) GS:ffff88842fd00000(0000) knlGS:0000000000000000
      [   16.653887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   16.654841] CR2: ffffecbfff7b43c8 CR3: 0000000103220005 CR4: 0000000000370ee0
      [   16.655992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   16.657150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   16.658290] Kernel panic - not syncing: Fatal exception
      [   16.659613] Kernel Offset: disabled
      [   16.660234] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Fixes: fb32856b ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c37711d
  2. 03 5月, 2021 1 次提交
  3. 24 4月, 2021 1 次提交
    • X
      virtio-net: fix use-after-free in skb_gro_receive · f80bd740
      Xuan Zhuo 提交于
      When "headroom" > 0, the actual allocated memory space is the entire
      page, so the address of the page should be used when passing it to
      build_skb().
      
      BUG: KASAN: use-after-free in skb_gro_receive (net/core/skbuff.c:4260)
      Write of size 16 at addr ffff88811619fffc by task kworker/u9:0/534
      CPU: 2 PID: 534 Comm: kworker/u9:0 Not tainted 5.12.0-rc7-custom-16372-gb150be05b806 #3382
      Hardware name: QEMU MSN2700, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Workqueue: xprtiod xs_stream_data_receive_workfn [sunrpc]
      Call Trace:
       <IRQ>
      dump_stack (lib/dump_stack.c:122)
      print_address_description.constprop.0 (mm/kasan/report.c:233)
      kasan_report.cold (mm/kasan/report.c:400 mm/kasan/report.c:416)
      skb_gro_receive (net/core/skbuff.c:4260)
      tcp_gro_receive (net/ipv4/tcp_offload.c:266 (discriminator 1))
      tcp4_gro_receive (net/ipv4/tcp_offload.c:316)
      inet_gro_receive (net/ipv4/af_inet.c:1545 (discriminator 2))
      dev_gro_receive (net/core/dev.c:6075)
      napi_gro_receive (net/core/dev.c:6168 net/core/dev.c:6198)
      receive_buf (drivers/net/virtio_net.c:1151) virtio_net
      virtnet_poll (drivers/net/virtio_net.c:1415 drivers/net/virtio_net.c:1519) virtio_net
      __napi_poll (net/core/dev.c:6964)
      net_rx_action (net/core/dev.c:7033 net/core/dev.c:7118)
      __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:346)
      irq_exit_rcu (kernel/softirq.c:221 kernel/softirq.c:422 kernel/softirq.c:434)
      common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 14))
      </IRQ>
      
      Fixes: fb32856b ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Reported-by: NIdo Schimmel <idosch@nvidia.com>
      Tested-by: NIdo Schimmel <idosch@nvidia.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f80bd740
  4. 21 4月, 2021 2 次提交
    • E
      virtio-net: fix use-after-free in page_to_skb() · af39c8f7
      Eric Dumazet 提交于
      KASAN/syzbot had 4 reports, one of them being:
      
      BUG: KASAN: slab-out-of-bounds in memcpy include/linux/fortify-string.h:191 [inline]
      BUG: KASAN: slab-out-of-bounds in page_to_skb+0x5cf/0xb70 drivers/net/virtio_net.c:480
      Read of size 12 at addr ffff888014a5f800 by task systemd-udevd/8445
      
      CPU: 0 PID: 8445 Comm: systemd-udevd Not tainted 5.12.0-rc8-next-20210419-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:233
       __kasan_report mm/kasan/report.c:419 [inline]
       kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:436
       check_region_inline mm/kasan/generic.c:180 [inline]
       kasan_check_range+0x13d/0x180 mm/kasan/generic.c:186
       memcpy+0x20/0x60 mm/kasan/shadow.c:65
       memcpy include/linux/fortify-string.h:191 [inline]
       page_to_skb+0x5cf/0xb70 drivers/net/virtio_net.c:480
       receive_mergeable drivers/net/virtio_net.c:1009 [inline]
       receive_buf+0x2bc0/0x6250 drivers/net/virtio_net.c:1119
       virtnet_receive drivers/net/virtio_net.c:1411 [inline]
       virtnet_poll+0x568/0x10b0 drivers/net/virtio_net.c:1516
       __napi_poll+0xaf/0x440 net/core/dev.c:6962
       napi_poll net/core/dev.c:7029 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7116
       __do_softirq+0x29b/0x9fe kernel/softirq.c:559
       invoke_softirq kernel/softirq.c:433 [inline]
       __irq_exit_rcu+0x136/0x200 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       common_interrupt+0xa4/0xd0 arch/x86/kernel/irq.c:240
      
      Fixes: fb32856b ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Reported-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af39c8f7
    • E
      virtio-net: restrict build_skb() use to some arches · f5d7872a
      Eric Dumazet 提交于
      build_skb() is supposed to be followed by
      skb_reserve(skb, NET_IP_ALIGN), so that IP headers are word-aligned.
      (Best practice is to reserve NET_IP_ALIGN+NET_SKB_PAD, but the NET_SKB_PAD
      part is only a performance optimization if tunnel encaps are added.)
      
      Unfortunately virtio_net has not provisioned this reserve.
      We can only use build_skb() for arches where NET_IP_ALIGN == 0
      
      We might refine this later, with enough testing.
      
      Fixes: fb32856b ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5d7872a
  5. 17 4月, 2021 1 次提交
    • X
      virtio-net: page_to_skb() use build_skb when there's sufficient tailroom · fb32856b
      Xuan Zhuo 提交于
      In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
      can use build_skb to create skb directly. No need to alloc for
      additional space. And it can save a 'frags slot', which is very friendly
      to GRO.
      
      Here, if the payload of the received package is too small (less than
      GOOD_COPY_LEN), we still choose to copy it directly to the space got by
      napi_alloc_skb. So we can reuse these pages.
      
      Testing Machine:
          The four queues of the network card are bound to the cpu1.
      
      Test command:
          for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& done
      
      The size of the udp package is 1000, so in the case of this patch, there
      will always be enough tailroom to use build_skb. The sent udp packet
      will be discarded because there is no port to receive it. The irqsoftd
      of the machine is 100%, we observe the received quantity displayed by
      sar -n DEV 1:
      
      no build_skb:  956864.00 rxpck/s
      build_skb:    1158465.00 rxpck/s
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Suggested-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb32856b
  6. 07 4月, 2021 1 次提交
    • E
      virtio_net: Do not pull payload in skb->head · 0f6925b3
      Eric Dumazet 提交于
      Xuan Zhuo reported that commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") brought  a ~10% performance drop.
      
      The reason for the performance drop was that GRO was forced
      to chain sk_buff (using skb_shinfo(skb)->frag_list), which
      uses more memory but also cause packet consumers to go over
      a lot of overhead handling all the tiny skbs.
      
      It turns out that virtio_net page_to_skb() has a wrong strategy :
      It allocates skbs with GOOD_COPY_LEN (128) bytes in skb->head, then
      copies 128 bytes from the page, before feeding the packet to GRO stack.
      
      This was suboptimal before commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") because GRO was using 2 frags per MSS,
      meaning we were not packing MSS with 100% efficiency.
      
      Fix is to pull only the ethernet header in page_to_skb()
      
      Then, we change virtio_net_hdr_to_skb() to pull the missing
      headers, instead of assuming they were already pulled by callers.
      
      This fixes the performance regression, but could also allow virtio_net
      to accept packets with more than 128bytes of headers.
      
      Many thanks to Xuan Zhuo for his report, and his tests/help.
      
      Fixes: 3226b158 ("net: avoid 32 x truesize under-estimation for tiny skbs")
      Reported-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Link: https://www.spinics.net/lists/netdev/msg731397.htmlCo-Developed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f6925b3
  7. 19 3月, 2021 1 次提交
  8. 18 3月, 2021 2 次提交
  9. 11 3月, 2021 1 次提交
  10. 25 2月, 2021 1 次提交
  11. 23 2月, 2021 1 次提交
  12. 09 1月, 2021 2 次提交
  13. 24 12月, 2020 1 次提交
  14. 19 12月, 2020 1 次提交
  15. 01 12月, 2020 1 次提交
  16. 22 10月, 2020 1 次提交
    • M
      Revert "virtio-net: ethtool configurable RXCSUM" · cf8691cb
      Michael S. Tsirkin 提交于
      This reverts commit 3618ad2a.
      
      When control vq is not negotiated, that commit causes a crash:
      
      [   72.229171] kernel BUG at drivers/net/virtio_net.c:1667!
      [   72.230266] invalid opcode: 0000 [#1] PREEMPT SMP
      [   72.231172] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc8-02934-g3618ad2a #1
      [   72.231172] EIP: virtnet_send_command+0x120/0x140
      [   72.231172] Code: 00 0f 94 c0 8b 7d f0 65 33 3d 14 00 00 00 75 1c 8d 65 f4 5b 5e 5f 5d c3 66 90 be 01 00 00 00 e9 6e ff ff ff 8d b6 00
      +00 00 00 <0f> 0b e8 d9 bb 82 00 eb 17 8d b4 26 00 00 00 00 8d b4 26 00 00 00
      [   72.231172] EAX: 0000000d EBX: f72895c0 ECX: 00000017 EDX: 00000011
      [   72.231172] ESI: f7197800 EDI: ed69bd00 EBP: ed69bcf4 ESP: ed69bc98
      [   72.231172] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
      [   72.231172] CR0: 80050033 CR2: 00000000 CR3: 02c84000 CR4: 000406f0
      [   72.231172] Call Trace:
      [   72.231172]  ? __virt_addr_valid+0x45/0x60
      [   72.231172]  ? ___cache_free+0x51f/0x760
      [   72.231172]  ? kobject_uevent_env+0xf4/0x560
      [   72.231172]  virtnet_set_guest_offloads+0x4d/0x80
      [   72.231172]  virtnet_set_features+0x85/0x120
      [   72.231172]  ? virtnet_set_guest_offloads+0x80/0x80
      [   72.231172]  __netdev_update_features+0x27a/0x8e0
      [   72.231172]  ? kobject_uevent+0xa/0x20
      [   72.231172]  ? netdev_register_kobject+0x12c/0x160
      [   72.231172]  register_netdevice+0x4fe/0x740
      [   72.231172]  register_netdev+0x1c/0x40
      [   72.231172]  virtnet_probe+0x728/0xb60
      [   72.231172]  ? _raw_spin_unlock+0x1d/0x40
      [   72.231172]  ? virtio_vdpa_get_status+0x1c/0x20
      [   72.231172]  virtio_dev_probe+0x1c6/0x271
      [   72.231172]  really_probe+0x195/0x2e0
      [   72.231172]  driver_probe_device+0x26/0x60
      [   72.231172]  device_driver_attach+0x49/0x60
      [   72.231172]  __driver_attach+0x46/0xc0
      [   72.231172]  ? device_driver_attach+0x60/0x60
      [   72.231172]  bus_add_driver+0x197/0x1c0
      [   72.231172]  driver_register+0x66/0xc0
      [   72.231172]  register_virtio_driver+0x1b/0x40
      [   72.231172]  virtio_net_driver_init+0x61/0x86
      [   72.231172]  ? veth_init+0x14/0x14
      [   72.231172]  do_one_initcall+0x76/0x2e4
      [   72.231172]  ? rdinit_setup+0x2a/0x2a
      [   72.231172]  do_initcalls+0xb2/0xd5
      [   72.231172]  kernel_init_freeable+0x14f/0x179
      [   72.231172]  ? rest_init+0x100/0x100
      [   72.231172]  kernel_init+0xd/0xe0
      [   72.231172]  ret_from_fork+0x1c/0x30
      [   72.231172] Modules linked in:
      [   72.269563] ---[ end trace a6ebc4afea0e6cb1 ]---
      
      The reason is that virtnet_set_features now calls virtnet_set_guest_offloads
      unconditionally, it used to only call it when there is something
      to configure.
      
      If device does not have a control vq, everything breaks.
      
      Revert the original commit for now.
      
      Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com>
      Fixes: 3618ad2a ("virtio-net: ethtool configurable RXCSUM")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20201021142944.13615-1-mst@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      cf8691cb
  17. 14 10月, 2020 1 次提交
  18. 30 9月, 2020 1 次提交
  19. 11 9月, 2020 1 次提交
    • J
      net: remove napi_hash_del() from driver-facing API · 5198d545
      Jakub Kicinski 提交于
      We allow drivers to call napi_hash_del() before calling
      netif_napi_del() to batch RCU grace periods. This makes
      the API asymmetric and leaks internal implementation details.
      Soon we will want the grace period to protect more than just
      the NAPI hash table.
      
      Restructure the API and have drivers call a new function -
      __netif_napi_del() if they want to take care of RCU waits.
      
      Note that only core was checking the return status from
      napi_hash_del() so the new helper does not report if the
      NAPI was actually deleted.
      
      Some notes on driver oddness:
       - veth observed the grace period before calling netif_napi_del()
         but that should not matter
       - myri10ge observed normal RCU flavor
       - bnx2x and enic did not actually observe the grace period
         (unless they did so implicitly)
       - virtio_net and enic only unhashed Rx NAPIs
      
      The last two points seem to indicate that the calls to
      napi_hash_del() were a left over rather than an optimization.
      Regardless, it's easy enough to correct them.
      
      This patch may introduce extra synchronize_net() calls for
      interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
      free_netdev() to call netif_napi_del(). This seems inevitable
      since we want to use RCU for netpoll dev->napi_list traversal,
      and almost no drivers set IFF_DISABLE_NETPOLL.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5198d545
  20. 24 8月, 2020 1 次提交
  21. 05 8月, 2020 1 次提交
  22. 26 7月, 2020 1 次提交
  23. 02 6月, 2020 1 次提交
  24. 15 5月, 2020 1 次提交
    • J
      virtio_net: Add XDP frame size in two code paths · 9ce6146e
      Jesper Dangaard Brouer 提交于
      The virtio_net driver is running inside the guest-OS. There are two
      XDP receive code-paths in virtio_net, namely receive_small() and
      receive_mergeable(). The receive_big() function does not support XDP.
      
      In receive_small() the frame size is available in buflen. The buffer
      backing these frames are allocated in add_recvbuf_small() with same
      size, except for the headroom, but tailroom have reserved room for
      skb_shared_info. The headroom is encoded in ctx pointer as a value.
      
      In receive_mergeable() the frame size is more dynamic. There are two
      basic cases: (1) buffer size is based on a exponentially weighted
      moving average (see DECLARE_EWMA) of packet length. Or (2) in case
      virtnet_get_headroom() have any headroom then buffer size is
      PAGE_SIZE. The ctx pointer is this time used for encoding two values;
      the buffer len "truesize" and headroom. In case (1) if the rx buffer
      size is underestimated, the packet will have been split over more
      buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
      buffer area). If that happens the XDP path does a xdp_linearize_page
      operation.
      
      V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.
      
      The code is really hard to follow, so some hints to reviewers.
      The receive_mergeable() case gets frames that were allocated in
      add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
      and 'buf' ptr is advanced this headroom.  The headroom can only
      be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
      simple:
      
        static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
        {
      	return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
        }
      
      As frame_sz is an offset size from xdp.data_hard_start, reviewers
      should notice how this is calculated in receive_mergeable():
      
        int offset = buf - page_address(page);
        [...]
        data = page_address(xdp_page) + offset;
        xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
      
      The calculated offset will always be VIRTIO_XDP_HEADROOM when
      reaching this code.  Thus, xdp.data_hard_start will be page-start
      address plus vi->hdr_len.  Given this xdp.frame_sz need to be
      reduced with vi->hdr_len size.
      
      IMHO a followup patch should cleanup this code to make it easier
      to maintain and understand, but it is outside the scope of this
      patchset.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul
      9ce6146e
  25. 07 5月, 2020 1 次提交
    • M
      virtio_net: fix lockdep warning on 32 bit · 01c32598
      Michael S. Tsirkin 提交于
      When we fill up a receive VQ, try_fill_recv currently tries to count
      kicks using a 64 bit stats counter. Turns out, on a 32 bit kernel that
      uses a seqcount. sequence counts are "lock" constructs where you need to
      make sure that writers are serialized.
      
      In turn, this means that we mustn't run two try_fill_recv concurrently.
      Which of course we don't. We do run try_fill_recv sometimes from a
      softirq napi context, and sometimes from a fully preemptible context,
      but the later always runs with napi disabled.
      
      However, when it comes to the seqcount, lockdep is trying to enforce the
      rule that the same lock isn't accessed from preemptible and softirq
      context - it doesn't know about napi being enabled/disabled. This causes
      a false-positive warning:
      
      WARNING: inconsistent lock state
      ...
      inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      
      As a work around, shut down the warning by switching
      to u64_stats_update_begin_irqsave - that works by disabling
      interrupts on 32 bit only, is a NOP on 64 bit.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01c32598
  26. 06 3月, 2020 1 次提交
  27. 01 3月, 2020 1 次提交
    • C
      net/ethtool: Introduce link_ksettings API for virtual network devices · 9aedc6e2
      Cris Forno 提交于
      With the ethtool_virtdev_set_link_ksettings function in core/ethtool.c,
      ibmveth, netvsc, and virtio now use the core's helper function.
      
      Funtionality changes that pertain to ibmveth driver include:
      
        1. Changed the initial hardcoded link speed to 1GB.
      
        2. Added support for allowing a user to change the reported link
        speed via ethtool.
      
      Functionality changes to the netvsc driver include:
      
        1. When netvsc_get_link_ksettings is called, it will defer to the VF
        device if it exists to pull accelerated networking values, otherwise
        pull default or user-defined values.
      
        2. Similarly, if netvsc_set_link_ksettings called and a VF device
        exists, the real values of speed and duplex are changed.
      Signed-off-by: NCris Forno <cforno12@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aedc6e2
  28. 26 2月, 2020 2 次提交
  29. 27 1月, 2020 1 次提交
  30. 17 1月, 2020 1 次提交
    • T
      xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths · 1d233886
      Toke Høiland-Jørgensen 提交于
      Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
      we can re-use the bulking for the non-map version of the bpf_redirect()
      helper. This is a simple matter of having xdp_do_redirect_slow() queue the
      frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
      
      Unfortunately we can't make the bpf_redirect() helper return an error if
      the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
      have a reference to the network namespace of the ingress device at the time
      the helper is called. So we have to leave it as-is and keep the device
      lookup in xdp_do_redirect_slow().
      
      Since this leaves less reason to have the non-map redirect code in a
      separate function, so we get rid of the xdp_do_redirect_slow() function
      entirely. This does lose us the tracepoint disambiguation, but fortunately
      the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
      entry structures. This means both can contain a map index, so we can just
      amend the tracepoint definitions so we always emit the xdp_redirect(_err)
      tracepoints, but with the map ID only populated if a map is present. This
      means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
      the definitions around in case someone is still listening for them.
      
      With this change, the performance of the xdp_redirect sample program goes
      from 5Mpps to 8.4Mpps (a 68% increase).
      
      Since the flush functions are no longer map-specific, rename the flush()
      functions to drop _map from their names. One of the renamed functions is
      the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
      keep from having to update all drivers, use a #define to keep the old name
      working, and only update the virtual drivers in this patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
      1d233886
  31. 18 11月, 2019 1 次提交
  32. 02 10月, 2019 1 次提交
    • F
      netfilter: drop bridge nf reset from nf_reset · 895b5c9f
      Florian Westphal 提交于
      commit 174e2381
      ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
      recycle always drop skb extensions.  The additional skb_ext_del() that is
      performed via nf_reset on napi skb recycle is not needed anymore.
      
      Most nf_reset() calls in the stack are there so queued skb won't block
      'rmmod nf_conntrack' indefinitely.
      
      This removes the skb_ext_del from nf_reset, and renames it to a more
      fitting nf_reset_ct().
      
      In a few selected places, add a call to skb_ext_reset to make sure that
      no active extensions remain.
      
      I am submitting this for "net", because we're still early in the release
      cycle.  The patch applies to net-next too, but I think the rename causes
      needless divergence between those trees.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      895b5c9f
  33. 04 9月, 2019 1 次提交
    • ?
      virtio-net: lower min ring num_free for efficiency · 718be6ba
      ? jiang 提交于
      This change lowers ring buffer reclaim threshold from 1/2*queue to budget
      for better performance. According to our test with qemu + dpdk, packet
      dropping happens when the guest is not able to provide free buffer in
      avail ring timely with default 1/2*queue. The value in the patch has been
      tested and does show better performance.
      
      Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k, UDP)
      avg packets drop before: 2842
      avg packets drop after: 360(-87.3%)
      
      Further, current code suffers from a starvation problem: the amount of
      work done by try_fill_recv is not bounded by the budget parameter, thus
      (with large queues) once in a while userspace gets blocked for a long
      time while queue is being refilled. Trigger refills earlier to make sure
      the amount of work to do is limited.
      Signed-off-by: Njiangkidd <jiangkidd@hotmail.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      718be6ba
  34. 15 6月, 2019 1 次提交
  35. 21 5月, 2019 1 次提交
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 · 1ccea77e
      Thomas Gleixner 提交于
      Based on 2 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version this program is distributed in the
        hope that it will be useful but without any warranty without even
        the implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details you
        should have received a copy of the gnu general public license along
        with this program if not see http www gnu org licenses
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version this program is distributed in the
        hope that it will be useful but without any warranty without even
        the implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details [based]
        [from] [clk] [highbank] [c] you should have received a copy of the
        gnu general public license along with this program if not see http
        www gnu org licenses
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 355 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
      Reviewed-by: NSteve Winslow <swinslow@gmail.com>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ccea77e
  36. 07 4月, 2019 1 次提交