1. 02 6月, 2020 1 次提交
  2. 15 5月, 2020 1 次提交
    • J
      virtio_net: Add XDP frame size in two code paths · 9ce6146e
      Jesper Dangaard Brouer 提交于
      The virtio_net driver is running inside the guest-OS. There are two
      XDP receive code-paths in virtio_net, namely receive_small() and
      receive_mergeable(). The receive_big() function does not support XDP.
      
      In receive_small() the frame size is available in buflen. The buffer
      backing these frames are allocated in add_recvbuf_small() with same
      size, except for the headroom, but tailroom have reserved room for
      skb_shared_info. The headroom is encoded in ctx pointer as a value.
      
      In receive_mergeable() the frame size is more dynamic. There are two
      basic cases: (1) buffer size is based on a exponentially weighted
      moving average (see DECLARE_EWMA) of packet length. Or (2) in case
      virtnet_get_headroom() have any headroom then buffer size is
      PAGE_SIZE. The ctx pointer is this time used for encoding two values;
      the buffer len "truesize" and headroom. In case (1) if the rx buffer
      size is underestimated, the packet will have been split over more
      buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
      buffer area). If that happens the XDP path does a xdp_linearize_page
      operation.
      
      V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.
      
      The code is really hard to follow, so some hints to reviewers.
      The receive_mergeable() case gets frames that were allocated in
      add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
      and 'buf' ptr is advanced this headroom.  The headroom can only
      be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
      simple:
      
        static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
        {
      	return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
        }
      
      As frame_sz is an offset size from xdp.data_hard_start, reviewers
      should notice how this is calculated in receive_mergeable():
      
        int offset = buf - page_address(page);
        [...]
        data = page_address(xdp_page) + offset;
        xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
      
      The calculated offset will always be VIRTIO_XDP_HEADROOM when
      reaching this code.  Thus, xdp.data_hard_start will be page-start
      address plus vi->hdr_len.  Given this xdp.frame_sz need to be
      reduced with vi->hdr_len size.
      
      IMHO a followup patch should cleanup this code to make it easier
      to maintain and understand, but it is outside the scope of this
      patchset.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul
      9ce6146e
  3. 07 5月, 2020 1 次提交
    • M
      virtio_net: fix lockdep warning on 32 bit · 01c32598
      Michael S. Tsirkin 提交于
      When we fill up a receive VQ, try_fill_recv currently tries to count
      kicks using a 64 bit stats counter. Turns out, on a 32 bit kernel that
      uses a seqcount. sequence counts are "lock" constructs where you need to
      make sure that writers are serialized.
      
      In turn, this means that we mustn't run two try_fill_recv concurrently.
      Which of course we don't. We do run try_fill_recv sometimes from a
      softirq napi context, and sometimes from a fully preemptible context,
      but the later always runs with napi disabled.
      
      However, when it comes to the seqcount, lockdep is trying to enforce the
      rule that the same lock isn't accessed from preemptible and softirq
      context - it doesn't know about napi being enabled/disabled. This causes
      a false-positive warning:
      
      WARNING: inconsistent lock state
      ...
      inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      
      As a work around, shut down the warning by switching
      to u64_stats_update_begin_irqsave - that works by disabling
      interrupts on 32 bit only, is a NOP on 64 bit.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01c32598
  4. 06 3月, 2020 1 次提交
  5. 01 3月, 2020 1 次提交
    • C
      net/ethtool: Introduce link_ksettings API for virtual network devices · 9aedc6e2
      Cris Forno 提交于
      With the ethtool_virtdev_set_link_ksettings function in core/ethtool.c,
      ibmveth, netvsc, and virtio now use the core's helper function.
      
      Funtionality changes that pertain to ibmveth driver include:
      
        1. Changed the initial hardcoded link speed to 1GB.
      
        2. Added support for allowing a user to change the reported link
        speed via ethtool.
      
      Functionality changes to the netvsc driver include:
      
        1. When netvsc_get_link_ksettings is called, it will defer to the VF
        device if it exists to pull accelerated networking values, otherwise
        pull default or user-defined values.
      
        2. Similarly, if netvsc_set_link_ksettings called and a VF device
        exists, the real values of speed and duplex are changed.
      Signed-off-by: NCris Forno <cforno12@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aedc6e2
  6. 26 2月, 2020 2 次提交
  7. 27 1月, 2020 1 次提交
  8. 17 1月, 2020 1 次提交
    • T
      xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths · 1d233886
      Toke Høiland-Jørgensen 提交于
      Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
      we can re-use the bulking for the non-map version of the bpf_redirect()
      helper. This is a simple matter of having xdp_do_redirect_slow() queue the
      frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
      
      Unfortunately we can't make the bpf_redirect() helper return an error if
      the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
      have a reference to the network namespace of the ingress device at the time
      the helper is called. So we have to leave it as-is and keep the device
      lookup in xdp_do_redirect_slow().
      
      Since this leaves less reason to have the non-map redirect code in a
      separate function, so we get rid of the xdp_do_redirect_slow() function
      entirely. This does lose us the tracepoint disambiguation, but fortunately
      the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
      entry structures. This means both can contain a map index, so we can just
      amend the tracepoint definitions so we always emit the xdp_redirect(_err)
      tracepoints, but with the map ID only populated if a map is present. This
      means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
      the definitions around in case someone is still listening for them.
      
      With this change, the performance of the xdp_redirect sample program goes
      from 5Mpps to 8.4Mpps (a 68% increase).
      
      Since the flush functions are no longer map-specific, rename the flush()
      functions to drop _map from their names. One of the renamed functions is
      the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
      keep from having to update all drivers, use a #define to keep the old name
      working, and only update the virtual drivers in this patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
      1d233886
  9. 18 11月, 2019 1 次提交
  10. 02 10月, 2019 1 次提交
    • F
      netfilter: drop bridge nf reset from nf_reset · 895b5c9f
      Florian Westphal 提交于
      commit 174e2381
      ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
      recycle always drop skb extensions.  The additional skb_ext_del() that is
      performed via nf_reset on napi skb recycle is not needed anymore.
      
      Most nf_reset() calls in the stack are there so queued skb won't block
      'rmmod nf_conntrack' indefinitely.
      
      This removes the skb_ext_del from nf_reset, and renames it to a more
      fitting nf_reset_ct().
      
      In a few selected places, add a call to skb_ext_reset to make sure that
      no active extensions remain.
      
      I am submitting this for "net", because we're still early in the release
      cycle.  The patch applies to net-next too, but I think the rename causes
      needless divergence between those trees.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      895b5c9f
  11. 04 9月, 2019 1 次提交
    • ?
      virtio-net: lower min ring num_free for efficiency · 718be6ba
      ? jiang 提交于
      This change lowers ring buffer reclaim threshold from 1/2*queue to budget
      for better performance. According to our test with qemu + dpdk, packet
      dropping happens when the guest is not able to provide free buffer in
      avail ring timely with default 1/2*queue. The value in the patch has been
      tested and does show better performance.
      
      Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k, UDP)
      avg packets drop before: 2842
      avg packets drop after: 360(-87.3%)
      
      Further, current code suffers from a starvation problem: the amount of
      work done by try_fill_recv is not bounded by the budget parameter, thus
      (with large queues) once in a while userspace gets blocked for a long
      time while queue is being refilled. Trigger refills earlier to make sure
      the amount of work to do is limited.
      Signed-off-by: Njiangkidd <jiangkidd@hotmail.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      718be6ba
  12. 15 6月, 2019 1 次提交
  13. 21 5月, 2019 1 次提交
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 · 1ccea77e
      Thomas Gleixner 提交于
      Based on 2 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version this program is distributed in the
        hope that it will be useful but without any warranty without even
        the implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details you
        should have received a copy of the gnu general public license along
        with this program if not see http www gnu org licenses
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license as published by
        the free software foundation either version 2 of the license or at
        your option any later version this program is distributed in the
        hope that it will be useful but without any warranty without even
        the implied warranty of merchantability or fitness for a particular
        purpose see the gnu general public license for more details [based]
        [from] [clk] [highbank] [c] you should have received a copy of the
        gnu general public license along with this program if not see http
        www gnu org licenses
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-or-later
      
      has been chosen to replace the boilerplate/reference in 355 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NJilayne Lovejoy <opensource@jilayne.com>
      Reviewed-by: NSteve Winslow <swinslow@gmail.com>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ccea77e
  14. 07 4月, 2019 2 次提交
  15. 02 4月, 2019 1 次提交
    • F
      net: move skb->xmit_more hint to softnet data · 6b16f9ee
      Florian Westphal 提交于
      There are two reasons for this.
      
      First, the xmit_more flag conceptually doesn't fit into the skb, as
      xmit_more is not a property related to the skb.
      Its only a hint to the driver that the stack is about to transmit another
      packet immediately.
      
      Second, it was only done this way to not have to pass another argument
      to ndo_start_xmit().
      
      We can place xmit_more in the softnet data, next to the device recursion.
      The recursion counter is already written to on each transmit. The "more"
      indicator is placed right next to it.
      
      Drivers can use the netdev_xmit_more() helper instead of skb->xmit_more
      to check the "more packets coming" hint.
      
      skb->xmit_more is retained (but always 0) to not cause build breakage.
      
      This change takes care of the simple s/skb->xmit_more/netdev_xmit_more()/
      conversions.  Remaining drivers are converted in the next patches.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b16f9ee
  16. 19 3月, 2019 1 次提交
  17. 04 2月, 2019 1 次提交
  18. 31 1月, 2019 7 次提交
  19. 20 1月, 2019 2 次提交
  20. 21 12月, 2018 1 次提交
    • W
      virtio-net: ethtool configurable LRO · a02e8964
      Willem de Bruijn 提交于
      Virtio-net devices negotiate LRO support with the host.
      Display the initially negotiated state with ethtool -k.
      
      Also allow configuring it with ethtool -K, reusing the existing
      virtnet_set_guest_offloads helper that configures LRO for XDP.
      This is conditional on VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
      
      Virtio-net negotiates TSO4 and TSO6 separately, but ethtool does not
      distinguish between the two. Display LRO as on only if any offload
      is active.
      
      RTNL is held while calling virtnet_set_features, same as on the path
      from virtnet_xdp_set.
      
      Changes v1 -> v2
        - allow ethtool config (-K) only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
        - show LRO as enabled if any LRO variant is enabled
        - do not allow configuration while XDP is active
        - differentiate current features from the capable set, to restore
          on XDP down only those features that were active on XDP up
        - move test out of VIRTIO_NET_F_CSUM/TSO branch, which is tx only
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a02e8964
  21. 01 12月, 2018 1 次提交
  22. 24 11月, 2018 2 次提交
  23. 18 10月, 2018 1 次提交
  24. 11 10月, 2018 1 次提交
  25. 29 9月, 2018 1 次提交
    • E
      virtio_net: remove ndo_poll_controller · 260dd2c3
      Eric Dumazet 提交于
      As diagnosed by Song Liu, ndo_poll_controller() can
      be very dangerous on loaded hosts, since the cpu
      calling ndo_poll_controller() might steal all NAPI
      contexts (for all RX/TX queues of the NIC). This capture
      can last for unlimited amount of time, since one
      cpu is generally not able to drain all the queues under load.
      
      virto_net uses NAPI for TX completions, so we better let core
      networking stack call the napi->poll() to avoid the capture.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      260dd2c3
  26. 14 8月, 2018 1 次提交
  27. 12 8月, 2018 2 次提交
  28. 10 8月, 2018 1 次提交
    • A
      net: allow to call netif_reset_xps_queues() under cpus_read_lock · 4d99f660
      Andrei Vagin 提交于
      The definition of static_key_slow_inc() has cpus_read_lock in place. In the
      virtio_net driver, XPS queues are initialized after setting the queue:cpu
      affinity in virtnet_set_affinity() which is already protected within
      cpus_read_lock. Lockdep prints a warning when we are trying to acquire
      cpus_read_lock when it is already held.
      
      This patch adds an ability to call __netif_set_xps_queue under
      cpus_read_lock().
      Acked-by: NJason Wang <jasowang@redhat.com>
      
      ============================================
      WARNING: possible recursive locking detected
      4.18.0-rc3-next-20180703+ #1 Not tainted
      --------------------------------------------
      swapper/0/1 is trying to acquire lock:
      00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20
      
      but task is already holding lock:
      00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(cpu_hotplug_lock.rw_sem);
        lock(cpu_hotplug_lock.rw_sem);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by swapper/0/1:
       #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
       #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
       #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60
      
      v2: move cpus_read_lock() out of __netif_set_xps_queue()
      
      Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Fixes: 8af2c06f ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d99f660
  29. 06 8月, 2018 1 次提交