1. 20 11月, 2021 1 次提交
  2. 16 11月, 2021 1 次提交
  3. 11 11月, 2021 1 次提交
    • A
      net: fix premature exit from NAPI state polling in napi_disable() · 0315a075
      Alexander Lobakin 提交于
      Commit 719c5719 ("net: make napi_disable() symmetric with
      enable") accidentally introduced a bug sometimes leading to a kernel
      BUG when bringing an iface up/down under heavy traffic load.
      
      Prior to this commit, napi_disable() was polling n->state until
      none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
      always flip them. Now there's a possibility to get away with the
      NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
      call with an uninitialized variable, rather than straight to
      another round of the state check.
      
      Error path looks like:
      
      napi_disable():
      unsigned long val, new; /* new is uninitialized */
      
      do {
      	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
      				      NAPIF_STATE_SCHED is set */
      	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
      		usleep_range(20, 200);
      		continue; /* go straight to the condition check */
      	}
      	new = val | <...>
      } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
      						  writes garbage */
      
      napi_enable():
      do {
      	val = READ_ONCE(n->state);
      	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
      <...>
      
      while the typical BUG splat is like:
      
      [  172.652461] ------------[ cut here ]------------
      [  172.652462] kernel BUG at net/core/dev.c:6937!
      [  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
      [  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
      [  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
      [  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
      [  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
      [  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
      < snip >
      [  172.782403] Call Trace:
      [  172.784857]  <TASK>
      [  172.786963]  ice_up_complete+0x6f/0x210 [ice]
      [  172.791349]  ice_xdp+0x136/0x320 [ice]
      [  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
      [  172.799648]  dev_xdp_install+0x61/0xe0
      [  172.803401]  dev_xdp_attach+0x1e0/0x550
      [  172.807240]  dev_change_xdp_fd+0x1e6/0x220
      [  172.811338]  do_setlink+0xee8/0x1010
      [  172.814917]  rtnl_setlink+0xe5/0x170
      [  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
      [  172.823732]  ? security_capable+0x36/0x50
      < snip >
      
      Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
      for-loop with an explicit break.
      
      From v1 [0]:
       - just use a for-loop to simplify both the fix and the existing
         code (Eric).
      
      [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
      
      Fixes: 719c5719 ("net: make napi_disable() symmetric with enable")
      Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
      Signed-off-by: NAlexander Lobakin <alexandr.lobakin@intel.com>
      Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0315a075
  4. 27 10月, 2021 1 次提交
  5. 26 10月, 2021 1 次提交
    • C
      net: multicast: calculate csum of looped-back and forwarded packets · 9122a70a
      Cyril Strejc 提交于
      During a testing of an user-space application which transmits UDP
      multicast datagrams and utilizes multicast routing to send the UDP
      datagrams out of defined network interfaces, I've found a multicast
      router does not fill-in UDP checksum into locally produced, looped-back
      and forwarded UDP datagrams, if an original output NIC the datagrams
      are sent to has UDP TX checksum offload enabled.
      
      The datagrams are sent malformed out of the NIC the datagrams have been
      forwarded to.
      
      It is because:
      
      1. If TX checksum offload is enabled on the output NIC, UDP checksum
         is not calculated by kernel and is not filled into skb data.
      
      2. dev_loopback_xmit(), which is called solely by
         ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
         unconditionally.
      
      3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
         CHECKSUM_COMPLETE"), the ip_summed value is preserved during
         forwarding.
      
      4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
         a packet egress.
      
      The minimum fix in dev_loopback_xmit():
      
      1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
         case when the original output NIC has TX checksum offload enabled.
         The effects are:
      
           a) If the forwarding destination interface supports TX checksum
              offloading, the NIC driver is responsible to fill-in the
              checksum.
      
           b) If the forwarding destination interface does NOT support TX
              checksum offloading, checksums are filled-in by kernel before
              skb is submitted to the NIC driver.
      
           c) For local delivery, checksum validation is skipped as in the
              case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
      
      2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
         means, for CHECKSUM_NONE, the behavior is unmodified and is there
         to skip a looped-back packet local delivery checksum validation.
      Signed-off-by: NCyril Strejc <cyril.strejc@skoda.cz>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9122a70a
  6. 25 10月, 2021 1 次提交
    • M
      net: Prevent infinite while loop in skb_tx_hash() · 0c57eeec
      Michael Chan 提交于
      Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
      to set the queue count and offset for each TC.  So the queue count
      and offset for the TCs may be zero for a short period after dev->num_tc
      has been set.  If a TX packet is being transmitted at this time in the
      code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
      nonzero dev->num_tc but zero qcount for the TC.  The while loop that
      keeps looping while hash >= qcount will not end.
      
      Fix it by checking the TC's qcount to be nonzero before using it.
      
      Fixes: eadec877 ("net: Add support for subordinate traffic classes to netdev_pick_tx")
      Reviewed-by: NAndy Gospodarek <gospo@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c57eeec
  7. 20 10月, 2021 1 次提交
    • J
      net-core: use netdev_* calls for kernel messages · 5b92be64
      Jesse Brandeburg 提交于
      While loading a driver and changing the number of queues, I noticed this
      message in the kernel log:
      
      "[253489.070080] Number of in use tx queues changed invalidating tc
      mappings. Priority traffic classification disabled!"
      
      But I had no idea what interface was being talked about because this
      message used pr_warn().
      
      After investigating, it appears we can use the netdev_* helpers already
      defined to create predictably formatted messages, and that already handle
      <unknown netdev> cases, in more of the messages in dev.c.
      
      After this change, this message (and others) will look like this:
      "[  170.181093] ice 0000:3b:00.0 ens785f0: Number of in use tx queues
      changed invalidating tc mappings. Priority traffic classification
      disabled!"
      
      One goal here was not to change the message significantly from the
      original format so as to not break user's expectations, so I just
      changed messages that used pr_* and generally started with %s ==
      dev->name.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b92be64
  8. 15 10月, 2021 3 次提交
    • L
      netfilter: Introduce egress hook · 42df6e1d
      Lukas Wunner 提交于
      Support classifying packets with netfilter on egress to satisfy user
      requirements such as:
      * outbound security policies for containers (Laura)
      * filtering and mangling intra-node Direct Server Return (DSR) traffic
        on a load balancer (Laura)
      * filtering locally generated traffic coming in through AF_PACKET,
        such as local ARP traffic generated for clustering purposes or DHCP
        (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
      * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
        and gPTP with nftables (Pablo)
      * in the future: in-kernel NAT64/NAT46 (Pablo)
      
      The egress hook introduced herein complements the ingress hook added by
      commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key").  A patch for nftables to hook up
      egress rules from user space has been submitted separately, so users may
      immediately take advantage of the feature.
      
      Alternatively or in addition to netfilter, packets can be classified
      with traffic control (tc).  On ingress, packets are classified first by
      tc, then by netfilter.  On egress, the order is reversed for symmetry.
      Conceptually, tc and netfilter can be thought of as layers, with
      netfilter layered above tc.
      
      Traffic control is capable of redirecting packets to another interface
      (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
      host namespace to a container via a veth connection:
      tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
      
      In this case, netfilter egress classifying is not performed when leaving
      the host namespace!  That's because the packet is still on the tc layer.
      If tc redirects the packet to a physical interface in the host namespace
      such that it leaves the system, the packet is never subjected to
      netfilter egress classifying.  That is only logical since it hasn't
      passed through netfilter ingress classifying either.
      
      Packets can alternatively be redirected at the netfilter layer using
      nft fwd.  Such a packet *is* subjected to netfilter egress classifying
      since it has reached the netfilter layer.
      
      Internally, the skb->nf_skip_egress flag controls whether netfilter is
      invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
      be called recursively by tunnel drivers such as vxlan, the flag is
      reverted to false after sch_handle_egress().  This ensures that
      netfilter is applied both on the overlay and underlying network.
      
      Interaction between tc and netfilter is possible by setting and querying
      skb->mark.
      
      If netfilter egress classifying is not enabled on any interface, it is
      patched out of the data path by way of a static_key and doesn't make a
      performance difference that is discernible from noise:
      
      Before:             1537 1538 1538 1537 1538 1537 Mb/sec
      After:              1536 1534 1539 1539 1539 1540 Mb/sec
      Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
      After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
      Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
      After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec
      
      When netfilter egress classifying is enabled on at least one interface,
      a minimal performance penalty is incurred for every egress packet, even
      if the interface it's transmitted over doesn't have any netfilter egress
      rules configured.  That is caused by checking dev->nf_hooks_egress
      against NULL.
      
      Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
      ip link add dev foo type dummy
      ip link set dev foo up
      modprobe pktgen
      echo "add_device foo" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
      
      Accept all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
      
      Drop all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
      
      Apply this patch when measuring packet drops to avoid errors in dmesg:
      https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Laura García Liébana <nevola@gmail.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      42df6e1d
    • L
      netfilter: Generalize ingress hook include file · 17d20784
      Lukas Wunner 提交于
      Prepare for addition of a netfilter egress hook by generalizing the
      ingress hook include file.
      
      No functional change intended.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      17d20784
    • L
      netfilter: Rename ingress hook include file · 7463acfb
      Lukas Wunner 提交于
      Prepare for addition of a netfilter egress hook by renaming
      <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
      
      The egress hook also necessitates a refactoring of the include file,
      but that is done in a separate commit to ease reviewing.
      
      No functional change intended.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7463acfb
  9. 10 10月, 2021 1 次提交
  10. 09 10月, 2021 1 次提交
    • A
      net: introduce a function to check if a netdev name is in use · 75ea27d0
      Antoine Tenart 提交于
      __dev_get_by_name is currently used to either retrieve a net device
      reference using its name or to check if a name is already used by a
      registered net device (per ns). In the later case there is no need to
      return a reference to a net device.
      
      Introduce a new helper, netdev_name_in_use, to check if a name is
      currently used by a registered net device without leaking a reference
      the corresponding net device. This helper uses netdev_name_node_lookup
      instead of __dev_get_by_name as we don't need the extra logic retrieving
      a reference to the corresponding net device.
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75ea27d0
  11. 02 10月, 2021 1 次提交
  12. 27 9月, 2021 1 次提交
  13. 20 9月, 2021 1 次提交
    • X
      napi: fix race inside napi_enable · 3765996e
      Xuan Zhuo 提交于
      The process will cause napi.state to contain NAPI_STATE_SCHED and
      not in the poll_list, which will cause napi_disable() to get stuck.
      
      The prefix "NAPI_STATE_" is removed in the figure below, and
      NAPI_STATE_HASHED is ignored in napi.state.
      
                            CPU0       |                   CPU1       | napi.state
      ===============================================================================
      napi_disable()                   |                              | SCHED | NPSVC
      napi_enable()                    |                              |
      {                                |                              |
          smp_mb__before_atomic();     |                              |
          clear_bit(SCHED, &n->state); |                              | NPSVC
                                       | napi_schedule_prep()         | SCHED | NPSVC
                                       | napi_poll()                  |
                                       |   napi_complete_done()       |
                                       |   {                          |
                                       |      if (n->state & (NPSVC | | (1)
                                       |               _BUSY_POLL)))  |
                                       |           return false;      |
                                       |     ................         |
                                       |   }                          | SCHED | NPSVC
                                       |                              |
          clear_bit(NPSVC, &n->state); |                              | SCHED
      }                                |                              |
                                       |                              |
      napi_schedule_prep()             |                              | SCHED | MISSED (2)
      
      (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
      (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
      
      Since NAPI_STATE_SCHED already exists and napi is not in the
      sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
      exist.
      
      1. This will cause this queue to no longer receive packets.
      2. If you encounter napi_disable under the protection of rtnl_lock, it
         will cause the entire rtnl_lock to be locked, affecting the overall
         system.
      
      This patch uses cmpxchg to implement napi_enable(), which ensures that
      there will be no race due to the separation of clear two bits.
      
      Fixes: 2d8bff12 ("netpoll: Close race condition between poll_one_napi and napi_disable")
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3765996e
  14. 15 9月, 2021 1 次提交
    • J
      net: sched: update default qdisc visibility after Tx queue cnt changes · 1e080f17
      Jakub Kicinski 提交于
      mq / mqprio make the default child qdiscs visible. They only do
      so for the qdiscs which are within real_num_tx_queues when the
      device is registered. Depending on order of calls in the driver,
      or if user space changes config via ethtool -L the number of
      qdiscs visible under tc qdisc show will differ from the number
      of queues. This is confusing to users and potentially to system
      configuration scripts which try to make sure qdiscs have the
      right parameters.
      
      Add a new Qdisc_ops callback and make relevant qdiscs TTRT.
      
      Note that this uncovers the "shortcut" created by
      commit 1f27cde3 ("net: sched: use pfifo_fast for non real queues")
      The default child qdiscs beyond initial real_num_tx are always
      pfifo_fast, no matter what the sysfs setting is. Fixing this
      gets a little tricky because we'd need to keep a reference
      on whatever the default qdisc was at the time of creation.
      In practice this is likely an non-issue the qdiscs likely have
      to be configured to non-default settings, so whatever user space
      is doing such configuration can replace the pfifos... now that
      it will see them.
      Reported-by: NMatthew Massey <matthewmassey@fb.com>
      Reviewed-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e080f17
  15. 14 8月, 2021 1 次提交
  16. 10 8月, 2021 2 次提交
  17. 05 8月, 2021 2 次提交
  18. 04 8月, 2021 1 次提交
    • J
      net: add netif_set_real_num_queues() for device reconfig · 271e5b7d
      Jakub Kicinski 提交于
      netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
      can fail which breaks drivers trying to implement reconfiguration
      in a way that can't leave the device half-broken. In other words
      those functions are incompatible with prepare/commit approach.
      
      Luckily setting real number of queues can fail only if the number
      is increased, meaning that if we order operations correctly we
      can guarantee ending up with either new config (success), or
      the old one (on error).
      
      Provide a helper implementing such logic so that drivers don't
      have to duplicate it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271e5b7d
  19. 03 8月, 2021 1 次提交
  20. 31 7月, 2021 1 次提交
  21. 30 7月, 2021 1 次提交
  22. 29 7月, 2021 2 次提交
    • P
      skbuff: allow 'slow_gro' for skb carring sock reference · 5e10da53
      Paolo Abeni 提交于
      This change leverages the infrastructure introduced by the previous
      patches to allow soft devices passing to the GRO engine owned skbs
      without impacting the fast-path.
      
      It's up to the GRO caller ensuring the slow_gro bit validity before
      invoking the GRO engine. The new helper skb_prepare_for_gro() is
      introduced for that goal.
      
      On slow_gro, skbs are aggregated only with equal sk.
      Additionally, skb truesize on GRO recycle and free is correctly
      updated so that sk wmem is not changed by the GRO processing.
      
      rfc-> v1:
       - fixed bad truesize on dev_gro_receive NAPI_FREE
       - use the existing state bit
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e10da53
    • P
      net: optimize GRO for the common case. · 9efb4b5b
      Paolo Abeni 提交于
      After the previous patches, at GRO time, skb->slow_gro is
      usually 0, unless the packets comes from some H/W offload
      slowpath or tunnel.
      
      We can optimize the GRO code assuming !skb->slow_gro is likely.
      This remove multiple conditionals in the most common path, at the
      price of an additional one when we hit the above "slow-paths".
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9efb4b5b
  23. 20 7月, 2021 1 次提交
  24. 16 7月, 2021 1 次提交
    • Q
      net_sched: introduce tracepoint trace_qdisc_enqueue() · 70713ddd
      Qitao Xu 提交于
      Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
      the entrance of TC layer on TX side. This is similar to
      trace_qdisc_dequeue():
      
      1. For both we only trace successful cases. The failure cases
         can be traced via trace_kfree_skb().
      
      2. They are called at entrance or exit of TC layer, not for each
         ->enqueue() or ->dequeue(). This is intentional, because
         we want to make trace_qdisc_enqueue() symmetric to
         trace_qdisc_dequeue(), which is easier to use.
      
      The return value of qdisc_enqueue() is not interesting here,
      we have Qdisc's drop packets in ->dequeue(), it is impossible to
      trace them even if we have the return value, the only way to trace
      them is tracing kfree_skb().
      
      We only add information we need to trace ring buffer. If any other
      information is needed, it is easy to extend it without breaking ABI,
      see commit 3dd344ea ("net: tracepoint: exposing sk_family in all
      tcp:tracepoints").
      Reviewed-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NQitao Xu <qitao.xu@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70713ddd
  25. 13 7月, 2021 1 次提交
    • X
      xdp, net: Fix use-after-free in bpf_xdp_link_release · 5acc7d3e
      Xuan Zhuo 提交于
      The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
      At this point, dev_xdp_uninstall() is called. Then xdp link will not be
      detached automatically when dev is released. But link->dev already
      points to dev, when xdp link is released, dev will still be accessed,
      but dev has been released.
      
      dev_get_by_index()        |
      link->dev = dev           |
                                |      rtnl_lock()
                                |      unregister_netdevice_many()
                                |          dev_xdp_uninstall()
                                |      rtnl_unlock()
      rtnl_lock();              |
      dev_xdp_attach_link()     |
      rtnl_unlock();            |
                                |      netdev_run_todo() // dev released
      bpf_xdp_link_release()    |
          /* access dev.        |
             use-after-free */  |
      
      [   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
      [   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
      [   45.968297]
      [   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
      [   45.969222] Hardware name: linux,dummy-virt (DT)
      [   45.969795] Call trace:
      [   45.970106]  dump_backtrace+0x0/0x4c8
      [   45.970564]  show_stack+0x30/0x40
      [   45.970981]  dump_stack_lvl+0x120/0x18c
      [   45.971470]  print_address_description.constprop.0+0x74/0x30c
      [   45.972182]  kasan_report+0x1e8/0x200
      [   45.972659]  __asan_report_load8_noabort+0x2c/0x50
      [   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.973834]  bpf_link_free+0xd0/0x188
      [   45.974315]  bpf_link_put+0x1d0/0x218
      [   45.974790]  bpf_link_release+0x3c/0x58
      [   45.975291]  __fput+0x20c/0x7e8
      [   45.975706]  ____fput+0x24/0x30
      [   45.976117]  task_work_run+0x104/0x258
      [   45.976609]  do_notify_resume+0x894/0xaf8
      [   45.977121]  work_pending+0xc/0x328
      [   45.977575]
      [   45.977775] The buggy address belongs to the page:
      [   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
      [   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
      [   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
      [   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      [   45.982259] page dumped because: kasan: bad access detected
      [   45.982948]
      [   45.983153] Memory state around the buggy address:
      [   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.986419]                                               ^
      [   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988895] ==================================================================
      [   45.989773] Disabling lock debugging due to kernel taint
      [   45.990552] Kernel panic - not syncing: panic_on_warn set ...
      [   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
      [   45.991929] Hardware name: linux,dummy-virt (DT)
      [   45.992448] Call trace:
      [   45.992753]  dump_backtrace+0x0/0x4c8
      [   45.993208]  show_stack+0x30/0x40
      [   45.993627]  dump_stack_lvl+0x120/0x18c
      [   45.994113]  dump_stack+0x1c/0x34
      [   45.994530]  panic+0x3a4/0x7d8
      [   45.994930]  end_report+0x194/0x198
      [   45.995380]  kasan_report+0x134/0x200
      [   45.995850]  __asan_report_load8_noabort+0x2c/0x50
      [   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.997007]  bpf_link_free+0xd0/0x188
      [   45.997474]  bpf_link_put+0x1d0/0x218
      [   45.997942]  bpf_link_release+0x3c/0x58
      [   45.998429]  __fput+0x20c/0x7e8
      [   45.998833]  ____fput+0x24/0x30
      [   45.999247]  task_work_run+0x104/0x258
      [   45.999731]  do_notify_resume+0x894/0xaf8
      [   46.000236]  work_pending+0xc/0x328
      [   46.000697] SMP: stopping secondary CPUs
      [   46.001226] Dumping ftrace buffer:
      [   46.001663]    (ftrace buffer empty)
      [   46.002110] Kernel Offset: disabled
      [   46.002545] CPU features: 0x00000001,23202c00
      [   46.003080] Memory Limit: none
      
      Fixes: aa8d3a71 ("bpf, xdp: Add bpf_link-based XDP attachment API")
      Reported-by: NAbaci <abaci@linux.alibaba.com>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
      5acc7d3e
  26. 10 7月, 2021 1 次提交
    • A
      net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache · 28b34f01
      Antoine Tenart 提交于
      Some socket buffers allocated in the fclone cache (in __alloc_skb) can
      end-up in the following path[1]:
      
      napi_skb_finish
        __kfree_skb_defer
          napi_skb_cache_put
      
      The issue is napi_skb_cache_put is not fclone friendly and will put
      those skbuff in the skb cache to be reused later, although this cache
      only expects skbuff allocated from skbuff_head_cache. When this happens
      the skbuff is eventually freed using the wrong origin cache, and we can
      see traces similar to:
      
      [ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache
      [ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0
      [ 1223.950211] Modules linked in:
      [ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474
      [ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
      [ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0
      
      Leading sometimes to other memory related issues.
      
      Fix this by using __kfree_skb for fclone skbuff, similar to what is done
      the other place __kfree_skb_defer is called.
      
      [1] At least in setups using veth pairs and tunnels. Building a kernel
          with KASAN we can for example see packets allocated in
          sk_stream_alloc_skb hit the above path and later the issue arises
          when the skbuff is reused.
      
      Fixes: 9243adfc ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing")
      Cc: Alexander Lobakin <alobakin@pm.me>
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b34f01
  27. 08 7月, 2021 4 次提交
  28. 07 7月, 2021 1 次提交
  29. 29 6月, 2021 1 次提交
  30. 24 6月, 2021 1 次提交
    • Y
      net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b
      Yunsheng Lin 提交于
      Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
      flag set, but queue discipline by-pass does not work for lockless
      qdisc because skb is always enqueued to qdisc even when the qdisc
      is empty, see __dev_xmit_skb().
      
      This patch calls sch_direct_xmit() to transmit the skb directly
      to the driver for empty lockless qdisc, which aviod enqueuing
      and dequeuing operation.
      
      As qdisc->empty is not reliable to indicate a empty qdisc because
      there is a time window between enqueuing and setting qdisc->empty.
      So we use the MISSED state added in commit a90c57f2 ("net:
      sched: fix packet stuck problem for lockless qdisc"), which
      indicate there is lock contention, suggesting that it is better
      not to do the qdisc bypass in order to avoid packet out of order
      problem.
      
      In order to make MISSED state reliable to indicate a empty qdisc,
      we need to ensure that testing and clearing of MISSED state is
      within the protection of qdisc->seqlock, only setting MISSED state
      can be done without the protection of qdisc->seqlock. A MISSED
      state testing is added without the protection of qdisc->seqlock to
      aviod doing unnecessary spin_trylock() for contention case.
      
      As the enqueuing is not within the protection of qdisc->seqlock,
      there is still a potential data race as mentioned by Jakub [1]:
      
            thread1               thread2             thread3
      qdisc_run_begin() # true
                              qdisc_run_begin(q)
                                   set(MISSED)
      pfifo_fast_dequeue
        clear(MISSED)
        # recheck the queue
      qdisc_run_end()
                                  enqueue skb1
                                                   qdisc empty # true
                                                qdisc_run_begin() # true
                                                sch_direct_xmit() # skb2
                               qdisc_run_begin()
                                  set(MISSED)
      
      When above happens, skb1 enqueued by thread2 is transmited after
      skb2 is transmited by thread3 because MISSED state setting and
      enqueuing is not under the qdisc->seqlock. If qdisc bypass is
      disabled, skb1 has better chance to be transmited quicker than
      skb2.
      
      This patch does not take care of the above data race, because we
      view this as similar as below:
      Even at the same time CPU1 and CPU2 write the skb to two socket
      which both heading to the same qdisc, there is no guarantee that
      which skb will hit the qdisc first, because there is a lot of
      factor like interrupt/softirq/cache miss/scheduling afffecting
      that.
      
      There are below cases that need special handling:
      1. When MISSED state is cleared before another round of dequeuing
         in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
         dequeue all skb in one round and call __netif_schedule(), which
         might result in a non-empty qdisc without MISSED set. In order
         to avoid this, the MISSED state is set for lockless qdisc and
         __netif_schedule() will be called at the end of qdisc_run_end.
      
      2. The MISSED state also need to be set for lockless qdisc instead
         of calling __netif_schedule() directly when requeuing a skb for
         a similar reason.
      
      3. For netdev queue stopped case, the MISSED case need clearing
         while the netdev queue is stopped, otherwise there may be
         unnecessary __netif_schedule() calling. So a new DRAINING state
         is added to indicate this case, which also indicate a non-empty
         qdisc.
      
      4. As there is already netif_xmit_frozen_or_stopped() checking in
         dequeue_skb() and sch_direct_xmit(), which are both within the
         protection of qdisc->seqlock, but the same checking in
         __dev_xmit_skb() is without the protection, which might cause
         empty indication of a lockless qdisc to be not reliable. So
         remove the checking in __dev_xmit_skb(), and the checking in
         the protection of qdisc->seqlock seems enough to avoid the cpu
         consumption problem for netdev queue stopped case.
      
      1. https://lkml.org/lkml/2021/5/29/215Acked-by: NJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4fef01b
  31. 22 6月, 2021 1 次提交
  32. 18 6月, 2021 1 次提交