1. 03 12月, 2020 4 次提交
  2. 01 12月, 2020 1 次提交
    • B
      net: Introduce preferred busy-polling · 7fd3253a
      Björn Töpel 提交于
      The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
      option or system-wide using the /proc/sys/net/core/busy_read knob, is
      an opportunistic. That means that if the NAPI context is not
      scheduled, it will poll it. If, after busy-polling, the budget is
      exceeded the busy-polling logic will schedule the NAPI onto the
      regular softirq handling.
      
      One implication of the behavior above is that a busy/heavy loaded NAPI
      context will never enter/allow for busy-polling. Some applications
      prefer that most NAPI processing would be done by busy-polling.
      
      This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
      in concert with the napi_defer_hard_irqs and gro_flush_timeout
      knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
      introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
      feature"), and allows for a user to defer interrupts to be enabled and
      instead schedule the NAPI context from a watchdog timer. When a user
      enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
      and the NAPI context is being processed by a softirq, the softirq NAPI
      processing will exit early to allow the busy-polling to be performed.
      
      If the application stops performing busy-polling via a system call,
      the watchdog timer defined by gro_flush_timeout will timeout, and
      regular softirq handling will resume.
      
      In summary; Heavy traffic applications that prefer busy-polling over
      softirq processing should use this option.
      
      Example usage:
      
        $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
        $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
      
      Note that the timeout should be larger than the userspace processing
      window, otherwise the watchdog will timeout and fall back to regular
      softirq processing.
      
      Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
      7fd3253a
  3. 26 11月, 2020 1 次提交
  4. 19 11月, 2020 1 次提交
  5. 13 11月, 2020 4 次提交
  6. 12 11月, 2020 2 次提交
    • V
      net: evaluate net.ipv4.conf.all.proxy_arp_pvlan · 1af5318c
      Vincent Bernat 提交于
      Introduced in 65324144, the "proxy_arp_vlan" sysctl is a
      per-interface sysctl to tune proxy ARP support for private VLANs.
      While the "all" variant is exposed, it was a noop and never evaluated.
      We use the usual "or" logic for this kind of sysctls.
      
      Fixes: 65324144 ("net: RFC3069, private VLAN proxy arp support")
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1af5318c
    • V
      net: evaluate net.ipvX.conf.all.ignore_routes_with_linkdown · c0c5a60f
      Vincent Bernat 提交于
      Introduced in 0eeb075f, the "ignore_routes_with_linkdown" sysctl
      ignores a route whose interface is down. It is provided as a
      per-interface sysctl. However, while a "all" variant is exposed, it
      was a noop since it was never evaluated. We use the usual "or" logic
      for this kind of sysctls.
      
      Tested with:
      
          ip link add type veth # veth0 + veth1
          ip link add type veth # veth1 + veth2
          ip link set up dev veth0
          ip link set up dev veth1 # link-status paired with veth0
          ip link set up dev veth2
          ip link set up dev veth3 # link-status paired with veth2
      
          # First available path
          ip -4 addr add 203.0.113.${uts#H}/24 dev veth0
          ip -6 addr add 2001:db8:1::${uts#H}/64 dev veth0
      
          # Second available path
          ip -4 addr add 192.0.2.${uts#H}/24 dev veth2
          ip -6 addr add 2001:db8:2::${uts#H}/64 dev veth2
      
          # More specific route through first path
          ip -4 route add 198.51.100.0/25 via 203.0.113.254 # via veth0
          ip -6 route add 2001:db8:3::/56 via 2001:db8:1::ff # via veth0
      
          # Less specific route through second path
          ip -4 route add 198.51.100.0/24 via 192.0.2.254 # via veth2
          ip -6 route add 2001:db8:3::/48 via 2001:db8:2::ff # via veth2
      
          # H1: enable on "all"
          # H2: enable on "veth0"
          for v in ipv4 ipv6; do
            case $uts in
              H1)
                sysctl -qw net.${v}.conf.all.ignore_routes_with_linkdown=1
                ;;
              H2)
                sysctl -qw net.${v}.conf.veth0.ignore_routes_with_linkdown=1
                ;;
            esac
          done
      
          set -xe
          # When veth0 is up, best route is through veth0
          ip -o route get 198.51.100.1 | grep -Fw veth0
          ip -o route get 2001:db8:3::1 | grep -Fw veth0
      
          # When veth0 is down, best route should be through veth2 on H1/H2,
          # but on veth0 on H2
          ip link set down dev veth1 # down veth0
          ip route show
          [ $uts != H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth0
          [ $uts != H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth0
          [ $uts = H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth2
          [ $uts = H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth2
      
      Without this patch, the two last lines would fail on H1 (the one using
      the "all" sysctl). With the patch, everything succeeds as expected.
      
      Also document the sysctl in `ip-sysctl.rst`.
      
      Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c0c5a60f
  7. 11 11月, 2020 4 次提交
  8. 10 11月, 2020 1 次提交
  9. 07 11月, 2020 7 次提交
  10. 06 11月, 2020 6 次提交
  11. 05 11月, 2020 2 次提交
    • J
      io_uring: properly handle SQPOLL request cancelations · fdaf083c
      Jens Axboe 提交于
      Track if a given task io_uring context contains SQPOLL instances, so we
      can iterate those for cancelation (and request counts). This ensures that
      we properly wait on SQPOLL contexts, and find everything that needs
      canceling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fdaf083c
    • B
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster 提交于
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
  12. 04 11月, 2020 2 次提交
    • A
      net: add GSO UDP L4 and GSO fraglists to the list of software-backed types · 2e4ef10f
      Alexander Lobakin 提交于
      Commit e20cf8d3 ("udp: implement GRO for plain UDP sockets.") and
      commit 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.") made UDP L4
      and fraglisted GRO/GSO fully supported by the software fallback mode.
      We can safely add them to NETIF_F_GSO_SOFTWARE to allow logical/virtual
      netdevs to forward these types of skbs up to the real drivers.
      Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      2e4ef10f
    • O
      can: can_create_echo_skb(): fix echo skb generation: always use skb_clone() · 286228d3
      Oleksij Rempel 提交于
      All user space generated SKBs are owned by a socket (unless injected into the
      key via AF_PACKET). If a socket is closed, all associated skbs will be cleaned
      up.
      
      This leads to a problem when a CAN driver calls can_put_echo_skb() on a
      unshared SKB. If the socket is closed prior to the TX complete handler,
      can_get_echo_skb() and the subsequent delivering of the echo SKB to all
      registered callbacks, a SKB with a refcount of 0 is delivered.
      
      To avoid the problem, in can_get_echo_skb() the original SKB is now always
      cloned, regardless of shared SKB or not. If the process exists it can now
      safely discard its SKBs, without disturbing the delivery of the echo SKB.
      
      The problem shows up in the j1939 stack, when it clones the incoming skb, which
      detects the already 0 refcount.
      
      We can easily reproduce this with following example:
      
      testj1939 -B -r can0: &
      cansend can0 1823ff40#0123
      
      WARNING: CPU: 0 PID: 293 at lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
      refcount_t: addition on 0; use-after-free.
      Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc dw_hdmi_ahb_audio vcan
      CPU: 0 PID: 293 Comm: cansend Not tainted 5.5.0-rc6-00376-g9e20dcb7040d #1
      Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
      Backtrace:
      [<c010f570>] (dump_backtrace) from [<c010f90c>] (show_stack+0x20/0x24)
      [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
      [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
      [<c0127f0c>] (__warn) from [<c01283c8>] (warn_slowpath_fmt+0xa8/0xcc)
      [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>] (refcount_warn_saturate+0x108/0x174)
      [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>] (j1939_can_recv+0x20c/0x210)
      [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>] (can_rcv_filter+0xb4/0x268)
      [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
      [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
      [<c0aca300>] (can_rcv) from [<c09b1fdc>] (__netif_receive_skb_one_core+0x64/0x88)
      [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>] (__netif_receive_skb+0x38/0x94)
      [<c09b2038>] (__netif_receive_skb) from [<c09b2130>] (netif_receive_skb_internal+0x64/0xf8)
      [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>] (netif_receive_skb+0x34/0x19c)
      [<c09b21c4>] (netif_receive_skb) from [<c0791278>] (can_rx_offload_napi_poll+0x58/0xb4)
      
      Fixes: 0ae89beb ("can: add destructor for self generated skbs")
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Link: http://lore.kernel.org/r/20200124132656.22156-1-o.rempel@pengutronix.deAcked-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      286228d3
  13. 03 11月, 2020 4 次提交
    • A
      net: add kcov handle to skb extensions · 6370cc3b
      Aleksandr Nogikh 提交于
      Remote KCOV coverage collection enables coverage-guided fuzzing of the
      code that is not reachable during normal system call execution. It is
      especially helpful for fuzzing networking subsystems, where it is
      common to perform packet handling in separate work queues even for the
      packets that originated directly from the user space.
      
      Enable coverage-guided frame injection by adding kcov remote handle to
      skb extensions. Default initialization in __alloc_skb and
      __build_skb_around ensures that no socket buffer that was generated
      during a system call will be missed.
      
      Code that is of interest and that performs packet processing should be
      annotated with kcov_remote_start()/kcov_remote_stop().
      
      An alternative approach is to determine kcov_handle solely on the
      basis of the device/interface that received the specific socket
      buffer. However, in this case it would be impossible to distinguish
      between packets that originated during normal background network
      processes or were intentionally injected from the user space.
      Signed-off-by: NAleksandr Nogikh <nogikh@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      6370cc3b
    • J
      mm: always have io_remap_pfn_range() set pgprot_decrypted() · f8f6ae5d
      Jason Gunthorpe 提交于
      The purpose of io_remap_pfn_range() is to map IO memory, such as a
      memory mapped IO exposed through a PCI BAR.  IO devices do not
      understand encryption, so this memory must always be decrypted.
      Automatically call pgprot_decrypted() as part of the generic
      implementation.
      
      This fixes a bug where enabling AMD SME causes subsystems, such as RDMA,
      using io_remap_pfn_range() to expose BAR pages to user space to fail.
      The CPU will encrypt access to those BAR pages instead of passing
      unencrypted IO directly to the device.
      
      Places not mapping IO should use remap_pfn_range().
      
      Fixes: aca20d54 ("x86/mm: Add support to make use of Secure Memory Encryption")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Dave Young" <dyoung@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f6ae5d
    • R
      PM: runtime: Drop pm_runtime_clean_up_links() · d6e36668
      Rafael J. Wysocki 提交于
      After commit d12544fb ("PM: runtime: Remove link state checks in
      rpm_get/put_supplier()") nothing prevents the consumer device's
      runtime PM from acquiring additional references to the supplier
      device after pm_runtime_clean_up_links() has run (or even while it
      is running), so calling this function from __device_release_driver()
      may be pointless (or even harmful).
      
      Moreover, it ignores stateless device links, so the runtime PM
      handling of managed and stateless device links is inconsistent
      because of it, so better get rid of it entirely.
      
      Fixes: d12544fb ("PM: runtime: Remove link state checks in rpm_get/put_supplier()")
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
      Tested-by: NXiang Chen <chenxiang66@hisilicon.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6e36668
    • R
      PM: runtime: Drop runtime PM references to supplier on link removal · e0e398e2
      Rafael J. Wysocki 提交于
      While removing a device link, drop the supplier device's runtime PM
      usage counter as many times as needed to drop all of the runtime PM
      references to it from the consumer in addition to dropping the
      consumer's link count.
      
      Fixes: baa8809f ("PM / runtime: Optimize the use of device links")
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
      Tested-by: NXiang Chen <chenxiang66@hisilicon.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0e398e2
  14. 02 11月, 2020 1 次提交