提交 · 8d9bc36de5fc7b64bbce7b479dbe0ffdcb31b3b5 · openeuler / Kernel

16 1月, 2023 3 次提交

virtio-net: set up xdp for multi buffer packets · 8d9bc36d

由 Heng Qi 提交于 1月 14, 2023

When the xdp program sets xdp.frags, which means it can process
multi-buffer packets over larger MTU, so we continue to support xdp.
Signed-off-by: NHeng Qi <hengqi@linux.alibaba.com>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d9bc36d

virtio-net: fix calculation of MTU for single-buffer xdp · e814b958

由 Heng Qi 提交于 1月 14, 2023

When single-buffer xdp is loaded, the size of the buffer filled each time
is 'sz = (PAGE_SIZE - headroom - tailroom)', which is the maximum packet
length that the driver allows the device to pass in. Otherwise, the packet
with a length greater than sz will come in, so num_buf will be greater than
or equal to 2, and xdp_linearize_page() will be performed and the packet
will be dropped because the total length is greater than PAGE_SIZE. So the
maximum value of MTU for single-buffer xdp is 'max_sz = sz - ETH_HLEN'.
Signed-off-by: NHeng Qi <hengqi@linux.alibaba.com>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e814b958

virtio-net: disable the hole mechanism for xdp · 484beac2

由 Heng Qi 提交于 1月 14, 2023

XDP core assumes that the frame_size of xdp_buff and the length of
the frag are PAGE_SIZE. The hole may cause the processing of xdp to
fail, so we disable the hole mechanism when xdp is set.
Signed-off-by: NHeng Qi <hengqi@linux.alibaba.com>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

484beac2

12 12月, 2022 1 次提交

drivers/net/virtio_net.c: Added USO support. · 418044e1

由 Andrew Melnychenko 提交于 12月 07, 2022

Now, it possible to enable GSO_UDP_L4("tx-udp-segmentation") for VirtioNet.
Signed-off-by: NAndrew Melnychenko <andrew@daynix.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

418044e1

24 11月, 2022 1 次提交

virtio_net: Fix probe failed when modprobe virtio_net · b0686565

由 Li Zetao 提交于 11月 22, 2022

When doing the following test steps, an error was found:
  step 1: modprobe virtio_net succeeded
    # modprobe virtio_net        <-- OK

  step 2: fault injection in register_netdevice()
    # modprobe -r virtio_net     <-- OK
    # ...
      FAULT_INJECTION: forcing a failure.
      name failslab, interval 1, probability 0, space 0, times 0
      CPU: 0 PID: 3521 Comm: modprobe
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      Call Trace:
       <TASK>
       ...
       should_failslab+0xa/0x20
       ...
       dev_set_name+0xc0/0x100
       netdev_register_kobject+0xc2/0x340
       register_netdevice+0xbb9/0x1320
       virtnet_probe+0x1d72/0x2658 [virtio_net]
       ...
       </TASK>
      virtio_net: probe of virtio0 failed with error -22

  step 3: modprobe virtio_net failed
    # modprobe virtio_net        <-- failed
      virtio_net: probe of virtio0 failed with error -2

The root cause of the problem is that the queues are not
disable on the error handling path when register_netdevice()
fails in virtnet_probe(), resulting in an error "-ENOENT"
returned in the next modprobe call in setup_vq().

virtio_pci_modern_device uses virtqueues to send or
receive message, and "queue_enable" records whether the
queues are available. In vp_modern_find_vqs(), all queues
will be selected and activated, but once queues are enabled
there is no way to go back except reset.

Fix it by reset virtio device on error handling path. This
makes error handling follow the same order as normal device
cleanup in virtnet_remove() which does: unregister, destroy
failover, then reset. And that flow is better tested than
error handling so we can be reasonably sure it works well.

Fixes: 02465555 ("virtio_net: fix use after free on allocation failure")
Signed-off-by: NLi Zetao <lizetao1@huawei.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20221122150046.3910638-1-lizetao1@huawei.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

b0686565

29 10月, 2022 1 次提交

net: Remove the obsolte u64_stats_fetch_*_irq() users (drivers). · 068c38ad

由 Thomas Gleixner 提交于 10月 26, 2022

Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.

Convert to the regular interface.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

068c38ad

07 10月, 2022 2 次提交

virtio-net: use mtu size as buffer length for big packets · 4959aebb

由 Gavin Li 提交于 9月 14, 2022

Currently add_recvbuf_big() allocates MAX_SKB_FRAGS segments for big
packets even when GUEST_* offloads are not present on the device.
However, if guest GSO is not supported, it would be sufficient to
allocate segments to cover just up the MTU size and no further.
Allocating the maximum amount of segments results in a large waste of
buffer space in the queue, which limits the number of packets that can
be buffered and can result in reduced performance.

Therefore, if guest GSO is not supported, use the MTU to calculate the
optimal amount of segments required.

Below is the iperf TCP test results over a Mellanox NIC, using vDPA for
1 VQ, queue size 1024, before and after the change, with the iperf
server running over the virtio-net interface.

MTU(Bytes)/Bandwidth (Gbit/s)
             Before   After
  1500        22.5     22.4
  9000        12.8     25.9

And result of queue size 256.
MTU(Bytes)/Bandwidth (Gbit/s)
             Before   After
  9000        2.15     11.9

With this patch no degradation is observed with multiple below tests and
feature bit combinations. Results are summarized below for q depth of
1024. Interface MTU is 1500 if MTU feature is disabled. MTU is set to 9000
in other tests.

Features/              Bandwidth (Gbit/s)
                         Before   After
mtu off                   20.1     20.2
mtu/indirect on           17.4     17.3
mtu/indirect/packed on    17.2     17.2
Signed-off-by: NGavin Li <gavinl@nvidia.com>
Reviewed-by: NGavi Teitz <gavi@nvidia.com>
Reviewed-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: NSi-Wei Liu <si-wei.liu@oracle.com>
Message-Id: <20220914144911.56422-3-gavinl@nvidia.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NJason Wang <jasowang@redhat.com>

4959aebb

virtio-net: introduce and use helper function for guest gso support checks · 46cd26f4

由 Gavin Li 提交于 9月 14, 2022

Probe routine is already several hundred lines.
Use helper function for guest gso support check.
Signed-off-by: NGavin Li <gavinl@nvidia.com>
Reviewed-by: NGavi Teitz <gavi@nvidia.com>
Reviewed-by: NParav Pandit <parav@nvidia.com>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: NSi-Wei Liu <si-wei.liu@oracle.com>
Message-Id: <20220914144911.56422-2-gavinl@nvidia.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

46cd26f4

01 9月, 2022 1 次提交

net: move from strlcpy with unused retval to strscpy · fb3ceec1

由 Wolfram Sang 提交于 8月 30, 2022

Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN
Link: https://lore.kernel.org/r/20220830201457.7984-1-wsa+renesas@sang-engineering.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

fb3ceec1

16 8月, 2022 1 次提交

virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()" · 2e9ca760

由 Michael S. Tsirkin 提交于 8月 16, 2022

This reverts commit 762faee5.

This has been reported to trip up guests on GCP (Google Cloud).
The reason is that virtio_find_vqs_ctx_size is broken on legacy
devices. We can in theory fix virtio_find_vqs_ctx_size but
in fact the patch itself has several other issues:

- It treats unknown speed as < 10G
- It leaves userspace no way to find out the ring size set by hypervisor
- It tests speed when link is down
- It ignores the virtio spec advice:
        Both \field{speed} and \field{duplex} can change, thus the driver
        is expected to re-read these values after receiving a
        configuration change notification.
- It is not clear the performance impact has been tested properly

Revert the patch for now.
Reported-by: NAndres Freund <andres@anarazel.de>
Link: https://lore.kernel.org/r/20220814212610.GA3690074%40roeck-us.net
Link: https://lore.kernel.org/r/20220815070203.plwjx7b3cyugpdt7%40awork3.anarazel.de
Link: https://lore.kernel.org/r/3df6bb82-1951-455d-a768-e9e1513eb667%40www.fastmail.com
Link: https://lore.kernel.org/r/FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE%40anarazel.de
Fixes: 762faee5 ("virtio_net: set the default max ring size by find_vqs()")
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Tested-by: NAndres Freund <andres@anarazel.de>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Message-Id: <20220816053602.173815-2-mst@redhat.com>

2e9ca760

12 8月, 2022 1 次提交

virtio_net: fix endian-ness for RSS · 95bb6330

由 Michael S. Tsirkin 提交于 8月 11, 2022

Using native endian-ness for device supplied fields is wrong
on BE platforms. Sparse warns about this.

Fixes: 91f41f01 ("drivers/net/virtio_net: Added RSS hash report.")
Cc: "Andrew Melnychenko" <andrew@daynix.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95bb6330

11 8月, 2022 7 次提交

net: virtio_net: notifications coalescing support · 699b045a

由 Alvaro Karsz 提交于 7月 18, 2022

New VirtIO network feature: VIRTIO_NET_F_NOTF_COAL.

Control a Virtio network device notifications coalescing parameters
using the control virtqueue.

A device that supports this fetature can receive
VIRTIO_NET_CTRL_NOTF_COAL control commands.

- VIRTIO_NET_CTRL_NOTF_COAL_TX_SET:
  Ask the network device to change the following parameters:
  - tx_usecs: Maximum number of usecs to delay a TX notification.
  - tx_max_packets: Maximum number of packets to send before a
    TX notification.

- VIRTIO_NET_CTRL_NOTF_COAL_RX_SET:
  Ask the network device to change the following parameters:
  - rx_usecs: Maximum number of usecs to delay a RX notification.
  - rx_max_packets: Maximum number of packets to receive before a
    RX notification.

VirtIO spec. patch:
https://lists.oasis-open.org/archives/virtio-comment/202206/msg00100.htmlSigned-off-by: NAlvaro Karsz <alvaro.karsz@solid-run.com>
Message-Id: <20220718091102.498774-1-alvaro.karsz@solid-run.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJason Wang <jasowang@redhat.com>

699b045a

virtio_net: support set_ringparam · a335b33f

由 Xuan Zhuo 提交于 8月 01, 2022

Support set_ringparam based on virtio queue reset.

Users can use ethtool -G eth0 <ring_num> to modify the ring size of
virtio-net.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-43-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

a335b33f

virtio_net: support tx queue resize · ebcce492

由 Xuan Zhuo 提交于 8月 01, 2022

This patch implements the resize function of the tx queues.
Based on this function, it is possible to modify the ring num of the
queue.

Inludes fixup:

virtio_net: fix for stuck when change tx ring size with dev down

When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.

And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.

Message-Id: <20220801063902.129329-42-xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220811080258.79398-3-xuanzhuo@linux.alibaba.com>
Reported-by: NKangjie Xu <kangjie.xu@linux.alibaba.com>
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

ebcce492

virtio_net: support rx queue resize · 6a4763e2

由 Xuan Zhuo 提交于 8月 01, 2022

This patch implements the resize function of the rx queues.
Based on this function, it is possible to modify the ring num of the
queue.

Includes fixup:

virtio_net: fix for stuck when change rx ring size with dev down

When dev is set to DOWN state, napi has been disabled, if we modify the
ring size at this time, we should not call napi_disable() again, which
will cause stuck.

And all operations are under the protection of rtnl_lock, so there is no
need to consider concurrency issues.

Message-Id: <20220801063902.129329-41-xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220811080258.79398-2-xuanzhuo@linux.alibaba.com>
Reported-by: NKangjie Xu <kangjie.xu@linux.alibaba.com>
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

6a4763e2

virtio_net: split free_unused_bufs() · 6e345f8c

由 Xuan Zhuo 提交于 8月 01, 2022

This patch separates two functions for freeing sq buf and rq buf from
free_unused_bufs().

When supporting the enable/disable tx/rq queue in the future, it is
necessary to support separate recovery of a sq buf or a rq buf.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-40-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

6e345f8c

virtio_net: get ringparam by virtqueue_get_vring_max_size() · 8597b5dd

由 Xuan Zhuo 提交于 8月 01, 2022

Use virtqueue_get_vring_max_size() in virtnet_get_ringparam() to set
tx,rx_max_pending.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-39-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

8597b5dd

virtio_net: set the default max ring size by find_vqs() · 762faee5

由 Xuan Zhuo 提交于 8月 01, 2022

Use virtio_find_vqs_ctx_size() to specify the maximum ring size of tx,
rx at the same time.

                         | rx/tx ring size
-------------------------------------------
speed == UNKNOWN or < 10G| 1024
speed < 40G              | 4096
speed >= 40G             | 8192

Call virtnet_update_settings() once before calling init_vqs() to update
speed.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-38-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

762faee5

08 8月, 2022 1 次提交

virtio_net: fix memory leak inside XPD_TX with mergeable · 7a542bee

由 Xuan Zhuo 提交于 8月 04, 2022

When we call xdp_convert_buff_to_frame() to get xdpf, if it returns
NULL, we should check if xdp_page was allocated by xdp_linearize_page().
If it is newly allocated, it should be freed here alone. Just like any
other "goto err_xdp".

Fixes: 44fa2dbd ("xdp: transition into using xdp_frame for ndo_xdp_xmit")
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a542bee

27 7月, 2022 1 次提交

virtio-net: fix the race between refill work and close · 5a159128

由 Jason Wang 提交于 7月 25, 2022

We try using cancel_delayed_work_sync() to prevent the work from
enabling NAPI. This is insufficient since we don't disable the source
of the refill work scheduling. This means an NAPI poll callback after
cancel_delayed_work_sync() can schedule the refill work then can
re-enable the NAPI that leads to use-after-free [1].

Since the work can enable NAPI, we can't simply disable NAPI before
calling cancel_delayed_work_sync(). So fix this by introducing a
dedicated boolean to control whether or not the work could be
scheduled from NAPI.

[1]
==================================================================
BUG: KASAN: use-after-free in refill_work+0x43/0xd4
Read of size 2 at addr ffff88810562c92e by task kworker/2:1/42

CPU: 2 PID: 42 Comm: kworker/2:1 Not tainted 5.19.0-rc1+ #480
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events refill_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 print_report.cold+0xbb/0x6ac
 ? _printk+0xad/0xde
 ? refill_work+0x43/0xd4
 kasan_report+0xa8/0x130
 ? refill_work+0x43/0xd4
 refill_work+0x43/0xd4
 process_one_work+0x43d/0x780
 worker_thread+0x2a0/0x6f0
 ? process_one_work+0x780/0x780
 kthread+0x167/0x1a0
 ? kthread_exit+0x50/0x50
 ret_from_fork+0x22/0x30
 </TASK>
...

Fixes: b2baed69 ("virtio_net: set/cancel work on ndo_open/ndo_stop")
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a159128

27 6月, 2022 1 次提交

virtio-net: fix race between ndo_open() and virtio_device_ready() · 50c0ada6

由 Jason Wang 提交于 6月 17, 2022

We currently call virtio_device_ready() after netdev
registration. Since ndo_open() can be called immediately
after register_netdev, this means there exists a race between
ndo_open() and virtio_device_ready(): the driver may start to use the
device before DRIVER_OK which violates the spec.

Fix this by switching to use register_netdevice() and protect the
virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
only be called after virtio_device_ready().

Fixes: 4baf1e33 ("virtio_net: enable VQs early")
Signed-off-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220617072949.30734-1-jasowang@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

50c0ada6

23 6月, 2022 1 次提交

virtio_net: fix xdp_rxq_info bug after suspend/resume · 8af52fe9

由 Stephan Gerhold 提交于 6月 21, 2022

The following sequence currently causes a driver bug warning
when using virtio_net:

  # ip link set eth0 up
  # echo mem > /sys/power/state (or e.g. # rtcwake -s 10 -m mem)
  <resume>
  # ip link set eth0 down

  Missing register, driver bug
  WARNING: CPU: 0 PID: 375 at net/core/xdp.c:138 xdp_rxq_info_unreg+0x58/0x60
  Call trace:
   xdp_rxq_info_unreg+0x58/0x60
   virtnet_close+0x58/0xac
   __dev_close_many+0xac/0x140
   __dev_change_flags+0xd8/0x210
   dev_change_flags+0x24/0x64
   do_setlink+0x230/0xdd0
   ...

This happens because virtnet_freeze() frees the receive_queue
completely (including struct xdp_rxq_info) but does not call
xdp_rxq_info_unreg(). Similarly, virtnet_restore() sets up the
receive_queue again but does not call xdp_rxq_info_reg().

Actually, parts of virtnet_freeze_down() and virtnet_restore_up()
are almost identical to virtnet_close() and virtnet_open(): only
the calls to xdp_rxq_info_(un)reg() are missing. This means that
we can fix this easily and avoid such problems in the future by
just calling virtnet_close()/open() from the freeze/restore handlers.

Aside from adding the missing xdp_rxq_info calls the only difference
is that the refill work is only cancelled if netif_running(). However,
this should not make any functional difference since the refill work
should only be active if the network interface is actually up.

Fixes: 754b8a21 ("virtio_net: setup xdp_rxq_info")
Signed-off-by: NStephan Gerhold <stephan.gerhold@kernkonzept.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220621114845.3650258-1-stephan.gerhold@kernkonzept.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

8af52fe9

08 5月, 2022 1 次提交

net: virtio: switch to netif_napi_add_weight() · d484735d

由 Jakub Kicinski 提交于 5月 06, 2022

virtio netdev driver uses a custom napi weight, switch to the new
API for setting custom weight.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d484735d

06 5月, 2022 1 次提交

net: move snowflake callers to netif_napi_add_tx_weight() · 8d602e1a

由 Jakub Kicinski 提交于 5月 04, 2022

Make the drivers with custom tx napi weight call netif_napi_add_tx_weight().
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220504163725.550782-2-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

8d602e1a

26 4月, 2022 1 次提交

virtio_net: fix wrong buf address calculation when using xdp · acb16b39

由 Nikolay Aleksandrov 提交于 4月 25, 2022

We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
the new headroom after the xdp program has run, similar to how offset
and len are recalculated. Headroom is directly related to
data_hard_start, data and data_meta, so we use them to get the new size.
The result is correct (similar pr_err() in page_to_skb, one case of
xdp_page and one case of virtnet buf):
 a) Case with 4 bytes of metadata
 [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
 [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
 b) Case of pushing data +32 bytes
 [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
 [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
 c) Case of pushing data -33 bytes
 [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
 [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223

Offset and headroom are equal because offset points to the start of
reserved bytes for the virtio_net header which are at buf start +
headroom, while data points at buf start + vnet hdr size + headroom so
when data or data_meta are adjusted by the xdp prog both the headroom size
and the offset change equally. We can use data_hard_start to compute the
new headroom after the xdp prog (linearized / page start case, the
virtnet buf case is similar just with bigger base offset):
 xdp.data_hard_start = page_address + vnet_hdr
 xdp.data = page_address + vnet_hdr + headroom
 new headroom after xdp prog = xdp.data - xdp.data_hard_start - metasize

An example reproducer xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220425103703.3067292-1-razor@blackwall.orgSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

acb16b39

29 3月, 2022 4 次提交

drivers/net/virtio_net: Added RSS hash report control. · c1170820

由 Andrew Melnychenko 提交于 3月 28, 2022

Now it's possible to control supported hashflows.
Added hashflow set/get callbacks.
Also, disabling RXH_IP_SRC/DST for TCP would disable then for UDP.
TCP and UDP supports only:
ethtool -U eth0 rx-flow-hash tcp4 sd
    RXH_IP_SRC + RXH_IP_DST
ethtool -U eth0 rx-flow-hash tcp4 sdfn
    RXH_IP_SRC + RXH_IP_DST + RXH_L4_B_0_1 + RXH_L4_B_2_3
Disabling happens because VirtioNET hashtype for IP doesn't check L4 proto,
it works for all IP packets(TCP, UDP, ICMP, etc.).
For TCP and UDP, it's possible to set IP+PORT hashes.
But disabling IP hashes will disable them for TCP and UDP simultaneously.
It's possible to set IP+PORT for TCP/UDP and disable/enable IP
for everything else(UDP, ICMP, etc.).
Signed-off-by: NAndrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-5-andrew@daynix.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

c1170820

drivers/net/virtio_net: Added RSS hash report. · 91f41f01

由 Andrew Melnychenko 提交于 3月 28, 2022

Added features for RSS hash report.
If hash is provided - it sets to skb.
Added checks if rss and/or hash are enabled together.
Signed-off-by: NAndrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-4-andrew@daynix.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

91f41f01

drivers/net/virtio_net: Added basic RSS support. · c7114b12

由 Andrew Melnychenko 提交于 3月 28, 2022

Added features for RSS.
Added initialization, RXHASH feature and ethtool ops.
By default RSS/RXHASH is disabled.
Virtio RSS "IPv6 extensions" hashes disabled.
Added ethtools ops to set key and indirection table.
Signed-off-by: NAndrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-3-andrew@daynix.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

c7114b12

drivers/net/virtio_net: Fixed padded vheader to use v1 with hash. · c1ddc42d

由 Andrew Melnychenko 提交于 3月 28, 2022

The header v1 provides additional info about RSS.
Added changes to computing proper header length.
In the next patches, the header may contain RSS hash info
for the hash population.
Signed-off-by: NAndrew Melnychenko <andrew@daynix.com>
Link: https://lore.kernel.org/r/20220328175336.10802-2-andrew@daynix.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

c1ddc42d

15 2月, 2022 1 次提交

virtio_net: Fix code indent error · 4f50ef15

由 Michael Catanzaro 提交于 2月 13, 2022

This patch fixes the checkpatch.pl warning:

ERROR: code indent should use tabs where possible #3453: FILE: drivers/net/virtio_net.c:3453: ret = register_virtio_driver(&virtio_net_driver);$

Uneccessary newline was also removed making line 3453 now 3452.
Signed-off-by: NMichael Catanzaro <mcatanzaro.kernel@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f50ef15

16 1月, 2022 1 次提交

cpumask: replace cpumask_next_* with cpumask_first_* where appropriate · 9b51d9d8

由 Yury Norov 提交于 8月 14, 2021

cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.

There's no cpumask_first_zero() function, so create it.
Signed-off-by: NYury Norov <yury.norov@gmail.com>
Tested-by: NWolfram Sang <wsa+renesas@sang-engineering.com>

9b51d9d8

15 1月, 2022 1 次提交

virtio: wrap config->reset calls · d9679d00

由 Michael S. Tsirkin 提交于 10月 13, 2021

This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

d9679d00

16 12月, 2021 1 次提交

virtio_net: fix rx_drops stat for small pkts · 053c9e18

由 Wenliang Wang 提交于 12月 16, 2021

We found the stat of rx drops for small pkts does not increment when
build_skb fail, it's not coherent with other mode's rx drops stat.
Signed-off-by: NWenliang Wang <wangwenliang.1995@bytedance.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

053c9e18

14 12月, 2021 1 次提交

bpf: Let bpf_warn_invalid_xdp_action() report more info · c8064e5b

由 Paolo Abeni 提交于 11月 30, 2021

In non trivial scenarios, the action id alone is not sufficient to
identify the program causing the warning. Before the previous patch,
the generated stack-trace pointed out at least the involved device
driver.

Let's additionally include the program name and id, and the relevant
device name.

If the user needs additional infos, he can fetch them via a kernel
probe, leveraging the arguments added here.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com

c8064e5b

25 11月, 2021 1 次提交

Revert "virtio-net: don't let virtio core to validate used length" · fcfb65f8

由 Michael S. Tsirkin 提交于 11月 24, 2021

This reverts commit 816625c1.

Attempts to validate length in the core did not work out.
We'll drop them, so revert the dependent changes in drivers.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

fcfb65f8

22 11月, 2021 1 次提交

ethtool: extend ringparam setting/getting API with rx_buf_len · 74624944

由 Hao Chen 提交于 11月 18, 2021

Add two new parameters kernel_ringparam and extack for
.get_ringparam and .set_ringparam to extend more ring params
through netlink.
Signed-off-by: NHao Chen <chenhao288@hisilicon.com>
Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74624944

17 11月, 2021 1 次提交

net: annotate accesses to queue->trans_start · 5337824f

由 Eric Dumazet 提交于 11月 16, 2021

In following patches, dev_watchdog() will no longer stop all queues.
It will read queue->trans_start locklessly.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5337824f

01 11月, 2021 2 次提交

virtio-net: don't let virtio core to validate used length · 816625c1

由 Jason Wang 提交于 10月 27, 2021

For RX virtuqueue, the used length is validated in all the three paths
(big, small and mergeable). For control vq, we never tries to use used
length. So this patch forbids the core to validate the used length.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20211027022107.14357-3-jasowang@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

816625c1

virtio_net: clarify tailroom logic · fc02e8cb

由 Michael S. Tsirkin 提交于 10月 09, 2021

Make tailroom math follow same logic as everything else, subtracing
values in the order in which things are laid out in the buffer.
Tested-by: NCorentin Noël <corentin.noel@collabora.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

fc02e8cb

28 10月, 2021 1 次提交

net: virtio: use eth_hw_addr_set() · f2edaa4a

由 Jakub Kicinski 提交于 10月 27, 2021

Commit 406f42fa ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.

Even though the current code uses dev->addr_len the we can switch
to eth_hw_addr_set() instead of dev_addr_set(). The netdev is
always allocated by alloc_etherdev_mq() and there are at least two
places which assume Ethernet address:
 - the line below calling eth_hw_addr_random()
 - virtnet_set_mac_address() -> eth_commit_mac_addr_change()
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20211027152012.3393077-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f2edaa4a

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功