1. 16 9月, 2021 1 次提交
  2. 24 8月, 2021 1 次提交
  3. 28 7月, 2021 1 次提交
    • A
      dev_ioctl: split out ndo_eth_ioctl · a7605370
      Arnd Bergmann 提交于
      Most users of ndo_do_ioctl are ethernet drivers that implement
      the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
      timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
      
      Separate these from the few drivers that use ndo_do_ioctl to
      implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
      
      This is a purely cosmetic change intended to help readers find
      their way through the implementation.
      
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Vivien Didelot <vivien.didelot@gmail.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: Vladimir Oltean <olteanv@gmail.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: linux-rdma@vger.kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7605370
  4. 08 6月, 2021 1 次提交
  5. 05 6月, 2021 1 次提交
  6. 02 6月, 2021 1 次提交
  7. 24 4月, 2021 1 次提交
    • Y
      enetc: fix locking for one-step timestamping packet transfer · 7ce9c3d3
      Yangbo Lu 提交于
      The previous patch to support PTP Sync packet one-step timestamping
      described one-step timestamping packet handling logic as below in
      commit message:
      
      - Trasmit packet immediately if no other one in transfer, or queue to
        skb queue if there is already one in transfer.
        The test_and_set_bit_lock() is used here to lock and check state.
      - Start a work when complete transfer on hardware, to release the bit
        lock and to send one skb in skb queue if has.
      
      There was not problem of the description, but there was a mistake in
      implementation. The locking/test_and_set_bit_lock() should be put in
      enetc_start_xmit() which may be called by worker, rather than in
      enetc_xmit(). Otherwise, the worker calling enetc_start_xmit() after
      bit lock released is not able to lock again for transfer.
      
      Fixes: 7294380c ("enetc: support PTP Sync packet one-step timestamping")
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce9c3d3
  8. 21 4月, 2021 1 次提交
  9. 20 4月, 2021 3 次提交
    • V
      net: enetc: add support for flow control · a8648887
      Vladimir Oltean 提交于
      In the ENETC receive path, a frame received by the MAC is first stored
      in a 256KB 'FIFO' memory, then transferred to DRAM when enqueuing it to
      the RX ring. The FIFO is a shared resource for all ENETC ports, but
      every port keeps track of its own memory utilization, on RX and on TX.
      
      There is a setting for RX rings through which they can either operate in
      'lossy' mode (where the lack of a free buffer causes an immediate
      discard of the frame) or in 'lossless' mode (where the lack of a free
      buffer in the ring makes the frame stay longer in the FIFO).
      
      In turn, when the memory utilization of the FIFO exceeds a certain
      margin, the MAC can be configured to emit PAUSE frames.
      
      There is enough FIFO memory to buffer up to 3 MTU-sized frames per RX
      port while not jeopardizing the other use cases (jumbo frames), and
      also not consume bytes from the port TX allocations. Also, 3 MTU-sized
      frames worth of memory is enough to ensure zero loss for 64 byte packets
      at 1G line rate.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8648887
    • V
      net: enetc: add a mini driver for the Integrated Endpoint Register Block · e7d48e5f
      Vladimir Oltean 提交于
      The NXP ENETC is a 4-port Ethernet controller which 'smells' to
      operating systems like 4 distinct PCIe PFs with SR-IOV, each PF having
      its own driver instance, but in fact there are some hardware resources
      which are shared between all ports, like for example the 256 KB SRAM
      FIFO between the MACs and the Host Transfer Agent which DMAs frames to
      DRAM.
      
      To hide the stuff that cannot be neatly exposed per port, the hardware
      designers came up with this idea of having a dedicated register block
      which is supposed to be populated by the bootloader, and contains
      everything configuration-related: MAC addresses, FIFO partitioning, etc.
      
      When a port is reset using PCIe Function Level Reset, its defaults are
      transferred from the IERB configuration. Most of the time, the settings
      made through the IERB are read-only in the port's memory space (if they
      are even visible), so they cannot be modified at runtime.
      
      Linux doesn't have any advanced FIFO partitioning requirements at all,
      but when reading through the hardware manual, it became clear that, even
      though there are many good 'recommendations' for default values, many of
      them were not actually put in practice on LS1028A. So we end up with a
      default configuration that:
      
      (a) does not have enough TX and RX byte credits to support the max MTU
          of 9600 (which the Linux driver claims already) properly (at full speed)
      (b) allows the FIFO to be overrun with RX traffic, potentially
          overwriting internal data structures.
      
      The last part sounds a bit catastrophic, but it isn't. Frames are
      supposed to transit the FIFO for a very short time, but they can
      actually accumulate there under 2 conditions:
      
      (a) there is very severe congestion on DRAM memory, or
      (b) the RX rings visible to the operating system were configured for
          lossless operation, and they just ran out of free buffers to copy
          the frame to. This is what is used to put backpressure onto the MAC
          with flow control.
      
      So since ENETC has not supported flow control thus far, RX FIFO overruns
      were never seen with Linux. But with the addition of flow control, we
      should configure some registers to prevent this from happening. What we
      are trying to protect against are bad actors which continue to send us
      traffic despite the fact that we have signaled a PAUSE condition. Of
      course we can't be lossless in that case, but it is best to configure
      the FIFO to do tail dropping rather than letting it overrun.
      
      So in a nutshell, this driver is a fixup for all the IERB default values
      that should have been but aren't.
      
      The IERB configuration needs to be done _before_ the PFs are enabled.
      So every PF searches for the presence of the "fsl,ls1028a-enetc-ierb"
      node in the device tree, and if it finds it, it "registers" with the
      IERB, which means that it requests the IERB to fix up its default
      values. This is done through -EPROBE_DEFER. The IERB driver is part of
      the fsl_enetc module, but is technically a platform driver, since the
      IERB is a good old fashioned MMIO region, as opposed to ENETC ports
      which pretend to be PCIe devices.
      
      The driver was already configuring ENETC_PTXMBAR (FIFO allocation for
      TX) because due to an omission, TXMBAR is a read/write register in the
      PF memory space. But the manual is quite clear that the formula for this
      should depend upon the TX byte credits (TXBCR). In turn, the TX byte
      credits are only readable/writable through the IERB. So if we want to
      ensure that the TXBCR register also has a value that is correct and in
      line with TXMBAR, there is simply no way this can be done from the PF
      driver, access to the IERB is needed.
      
      I could have modified U-Boot to fix up the IERB values, but that is
      quite undesirable, as old U-Boot versions are likely to be floating
      around for quite some time from now.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7d48e5f
    • V
      net: enetc: create a common enetc_pf_to_port helper · 87614b93
      Vladimir Oltean 提交于
      Even though ENETC interfaces are exposed as individual PCIe PFs with
      their own driver instances, the ENETC is still fundamentally a
      multi-port Ethernet controller, and some parts of the IP take a port
      number (as can be seen in the PSFP implementation).
      
      Create a common helper that can be used outside of the TSN code for
      retrieving the ENETC port number based on the PF number. This is only
      correct for LS1028A, the only Linux-capable instantiation of ENETC thus
      far.
      
      Note that ENETC port 3 is PF 6. The TSN code did not care about this
      because ENETC port 3 does not support TSN, so the wrong mapping done by
      enetc_get_port for PF 6 could have never been hit.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87614b93
  10. 17 4月, 2021 10 次提交
    • V
      net: enetc: apply the MDIO workaround for XDP_REDIRECT too · 24e39309
      Vladimir Oltean 提交于
      Described in fd5736bf ("enetc: Workaround for MDIO register access
      issue") is a workaround for a hardware bug that requires a register
      access of the MDIO controller to never happen concurrently with a
      register access of a port PF. To avoid that, a mutual exclusion scheme
      with rwlocks was implemented - the port PF accessors are the 'read'
      side, and the MDIO accessors are the 'write' side.
      
      When we do XDP_REDIRECT between two ENETC interfaces, all is fine
      because the MDIO lock is already taken from the NAPI poll loop.
      
      But when the ingress interface is not ENETC, just the egress is, the
      MDIO lock is not taken, so we might access the port PF registers
      concurrently with MDIO, which will make the link flap due to wrong
      values returned from the PHY.
      
      To avoid this, let's just slap an enetc_lock_mdio/enetc_unlock_mdio at
      the beginning and ending of enetc_xdp_xmit. The fact that the MDIO lock
      is designed as a rwlock is important here, because the read side is
      reentrant (that is one of the main reasons why we chose it). Usually,
      the way we benefit of its reentrancy is by running the data path
      concurrently on both CPUs, but in this case, we benefit from the
      reentrancy by taking the lock even when the lock is already taken
      (and that's the situation where ENETC is both the ingress and the egress
      interface for XDP_REDIRECT, which was fine before and still is fine now).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24e39309
    • V
      net: enetc: fix buffer leaks with XDP_TX enqueue rejections · 92ff9a6e
      Vladimir Oltean 提交于
      If the TX ring is congested, enetc_xdp_tx() returns false for the
      current XDP frame (represented as an array of software BDs).
      
      This array of software TX BDs is constructed in enetc_rx_swbd_to_xdp_tx_swbd
      from software BDs freshly cleaned from the RX ring. The issue is that we
      scrub the RX software BDs too soon, more precisely before we know that
      we can enqueue the TX BDs successfully into the TX ring.
      
      If we can't enqueue them (and enetc_xdp_tx returns false), we call
      enetc_xdp_drop which attempts to recycle the buffers held by the RX
      software BDs. But because we scrubbed those RX BDs already, two things
      happen:
      
      (a) we leak their memory
      (b) we populate the RX software BD ring with an all-zero rx_swbd
          structure, which makes the buffer refill path allocate more memory.
      
      enetc_refill_rx_ring
      -> if (unlikely(!rx_swbd->page))
         -> enetc_new_page
      
      That is a recipe for fast OOM.
      
      Fixes: 7ed2bc80 ("net: enetc: add support for XDP_TX")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92ff9a6e
    • V
      net: enetc: handle the invalid XDP action the same way as XDP_DROP · 975acc83
      Vladimir Oltean 提交于
      When the XDP program returns an invalid action, we should free the RX
      buffer.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      975acc83
    • V
      net: enetc: use dedicated TX rings for XDP · 7eab503b
      Vladimir Oltean 提交于
      It is possible for one CPU to perform TX hashing (see netdev_pick_tx)
      between the 8 ENETC TX rings, and the TX hashing to select TX queue 1.
      
      At the same time, it is possible for the other CPU to already use TX
      ring 1 for XDP (either XDP_TX or XDP_REDIRECT). Since there is no mutual
      exclusion between XDP and the network stack, we run into an issue
      because the ENETC TX procedure is not reentrant.
      
      The obvious approach would be to just make XDP take the lock of the
      network stack's TX queue corresponding to the ring it's about to enqueue
      in.
      
      For XDP_REDIRECT, this is quite straightforward, a lock at the beginning
      and end of enetc_xdp_xmit() should do the trick.
      
      But for XDP_TX, it's a bit more complicated. For one, we do TX batching
      all by ourselves for frames with the XDP_TX verdict. This is something
      we would like to keep the way it is, for performance reasons. But
      batching means that the network stack's lock should be kept from the
      first enqueued XDP_TX frame and until we ring the doorbell. That is
      mostly fine, except for cases when in the same NAPI loop we have mixed
      XDP_TX and XDP_REDIRECT frames. So if enetc_xdp_xmit() gets called while
      we are holding the lock from the RX NAPI, then bam, deadlock. The naive
      answer could be 'just flush the XDP_TX frames first, then release the
      network stack's TX queue lock, then call xdp_do_flush_map()'. But even
      xdp_do_redirect() is capable of flushing the batched XDP_REDIRECT
      frames, so unless we unlock/relock the TX queue around xdp_do_redirect(),
      there simply isn't any clean way to protect XDP_TX from concurrent
      network stack .ndo_start_xmit() on another CPU.
      
      So we need to take a different approach, and that is to reserve two
      rings for the sole use of XDP. We leave TX rings
      0..ndev->real_num_tx_queues-1 to be handled by the network stack, and we
      pick them from the end of the priv->tx_ring array.
      
      We make an effort to keep the mapping done by enetc_alloc_msix() which
      decides which CPU handles the TX completions of which TX ring in its
      NAPI poll. So the XDP TX ring of CPU 0 is handled by TX ring 6, and the
      XDP TX ring of CPU 1 is handled by TX ring 7.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7eab503b
    • V
      net: enetc: increase TX ring size · ee3e875f
      Vladimir Oltean 提交于
      Now that commit d6a2829e ("net: enetc: increase RX ring default
      size") has increased the RX ring size, it is quite easy to congest the
      TX rings when the traffic is predominantly XDP_TX, as the RX ring is
      quite a bit larger than the TX one.
      
      Since we bit the bullet and did the expensive thing already (larger RX
      rings consume more memory pages), it seems quite foolish to keep the TX
      rings small. So make them equally sized with TX.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee3e875f
    • V
      net: enetc: remove unneeded xdp_do_flush_map() · a6369fe6
      Vladimir Oltean 提交于
      xdp_do_redirect already contains:
      -> dev_map_enqueue
         -> __xdp_enqueue
            -> bq_enqueue
               -> bq_xmit_all // if we have more than 16 frames
      
      So the logic from enetc will never be hit, because ENETC_DEFAULT_TX_WORK
      is 128.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6369fe6
    • V
      net: enetc: stop XDP NAPI processing when build_skb() fails · 8f50d8bb
      Vladimir Oltean 提交于
      When the code path below fails:
      
      enetc_clean_rx_ring_xdp // XDP_PASS
      -> enetc_build_skb
         -> enetc_map_rx_buff_to_skb
            -> build_skb
      
      enetc_clean_rx_ring_xdp will 'break', but that 'break' instruction isn't
      strong enough to actually break the NAPI poll loop, just the switch/case
      statement for XDP actions. So we increment rx_frm_cnt and go to the next
      frames minding our own business.
      
      Instead let's do what the skb NAPI poll function does, and break the
      loop now, waiting for the memory pressure to go away. Otherwise the next
      calls to build_skb() are likely to fail too.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f50d8bb
    • V
      net: enetc: recycle buffers for frames with RX errors · 672f9a21
      Vladimir Oltean 提交于
      When receiving a frame with errors, currently we do nothing with it (we
      don't construct an skb or an xdp_buff), we just exit the NAPI poll loop.
      
      Let's put the buffer back into the RX ring (similar to XDP_DROP).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      672f9a21
    • V
      net: enetc: rename the buffer reuse helpers · 6b04830d
      Vladimir Oltean 提交于
      enetc_put_xdp_buff has nothing to do with XDP, frankly, it is just a
      helper to populate the recycle end of the shadow RX BD ring
      (next_to_alloc) with a given buffer.
      
      On the other hand, enetc_put_rx_buff plays more tricks than its name
      would suggest.
      
      So let's rename enetc_put_rx_buff into enetc_flip_rx_buff to reflect the
      half-page buffer reuse tricks that it employs, and enetc_put_xdp_buff
      into enetc_put_rx_buff which suggests a more garden-variety operation.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b04830d
    • V
      net: enetc: remove redundant clearing of skb/xdp_frame pointer in TX conf path · e9e49ae8
      Vladimir Oltean 提交于
      Later in enetc_clean_tx_ring we have:
      
      		/* Scrub the swbd here so we don't have to do that
      		 * when we reuse it during xmit
      		 */
      		memset(tx_swbd, 0, sizeof(*tx_swbd));
      
      So these assignments are unnecessary.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e49ae8
  11. 16 4月, 2021 1 次提交
  12. 15 4月, 2021 1 次提交
    • M
      net: enetc: fetch MAC address from device tree · 652d3be2
      Michael Walle 提交于
      Normally, the bootloader will already initialize the MAC address
      registers of the ENETC and the driver will just use them or generate a
      random one, if it is not initialized.
      
      Add a new way to provide the MAC address: via device tree. Besides the
      usual 'mac-address' property, there is also the possibility to fetch it
      via a NVMEM provider. The sl28 board stores the MAC address in the SPI
      NOR flash OTP region. Having this will allow linux to fetch the MAC
      address from there without being dependent on the bootloader.
      
      No in-tree boards have the device tree properties set, thus for these,
      this is a no-op.
      Signed-off-by: NMichael Walle <michael@walle.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      652d3be2
  13. 13 4月, 2021 2 次提交
    • Y
      enetc: support PTP Sync packet one-step timestamping · 7294380c
      Yangbo Lu 提交于
      This patch is to add support for PTP Sync packet one-step timestamping.
      Since ENETC single-step register has to be configured dynamically per
      packet for correctionField offeset and UDP checksum update, current
      one-step timestamping packet has to be sent only when the last one
      completes transmitting on hardware. So, on the TX, this patch handles
      one-step timestamping packet as below:
      
      - Trasmit packet immediately if no other one in transfer, or queue to
        skb queue if there is already one in transfer.
        The test_and_set_bit_lock() is used here to lock and check state.
      - Start a work when complete transfer on hardware, to release the bit
        lock and to send one skb in skb queue if has.
      
      And the configuration for one-step timestamping on ENETC before
      transmitting is,
      
      - Set one-step timestamping flag in extension BD.
      - Write 30 bits current timestamp in tstamp field of extension BD.
      - Update PTP Sync packet originTimestamp field with current timestamp.
      - Configure single-step register for correctionField offeset and UDP
        checksum update.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7294380c
    • Y
      enetc: mark TX timestamp type per skb · f768e751
      Yangbo Lu 提交于
      Mark TX timestamp type per skb on skb->cb[0], instead of
      global variable for all skbs. This is a preparation for
      one step timestamp support.
      
      For one-step timestamping enablement, there will be both
      one-step and two-step PTP messages to transfer. And a skb
      queue is needed for one-step PTP messages making sure
      start to send current message only after the last one
      completed on hardware. (ENETC single-step register has to
      be dynamically configured per message.) So, marking TX
      timestamp type per skb is required.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f768e751
  14. 10 4月, 2021 3 次提交
  15. 01 4月, 2021 9 次提交
    • V
      net: enetc: add support for XDP_REDIRECT · 9d2b68cc
      Vladimir Oltean 提交于
      The driver implementation of the XDP_REDIRECT action reuses parts from
      XDP_TX, most notably the enetc_xdp_tx function which transmits an array
      of TX software BDs. Only this time, the buffers don't have DMA mappings,
      we need to create them.
      
      When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can
      employ the same buffer reuse strategy as for the normal processing path
      and for XDP_PASS: we can flip to the other page half and seed that to
      the RX ring.
      
      Note that scatter/gather support is there, but disabled due to lack of
      multi-buffer support in XDP (which is added by this series):
      https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d2b68cc
    • V
      net: enetc: increase RX ring default size · d6a2829e
      Vladimir Oltean 提交于
      As explained in the XDP_TX patch, when receiving a burst of frames with
      the XDP_TX verdict, there is a momentary dip in the number of available
      RX buffers. The system will eventually recover as TX completions will
      start kicking in and refilling our RX BD ring again. But until that
      happens, we need to survive with as few out-of-buffer discards as
      possible.
      
      This increases the memory footprint of the driver in order to avoid
      discards at 2.5Gbps line rate 64B packet sizes, the maximum speed
      available for testing on 1 port on NXP LS1028A.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6a2829e
    • V
      net: enetc: add support for XDP_TX · 7ed2bc80
      Vladimir Oltean 提交于
      For reflecting packets back into the interface they came from, we create
      an array of TX software BDs derived from the RX software BDs. Therefore,
      we need to extend the TX software BD structure to contain most of the
      stuff that's already present in the RX software BD structure, for
      reasons that will become evident in a moment.
      
      For a frame with the XDP_TX verdict, we don't reuse any buffer right
      away as we do for XDP_DROP (the same page half) or XDP_PASS (the other
      page half, same as the skb code path).
      
      Because the buffer transfers ownership from the RX ring to the TX ring,
      reusing any page half right away is very dangerous. So what we can do is
      we can recycle the same page half as soon as TX is complete.
      
      The code path is:
      enetc_poll
      -> enetc_clean_rx_ring_xdp
         -> enetc_xdp_tx
         -> enetc_refill_rx_ring
      (time passes, another MSI interrupt is raised)
      enetc_poll
      -> enetc_clean_tx_ring
         -> enetc_recycle_xdp_tx_buff
      
      But that creates a problem, because there is a potentially large time
      window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in
      which we'll have less and less RX buffers.
      
      Basically, when the ship starts sinking, the knee-jerk reaction is to
      let enetc_refill_rx_ring do what it does for the standard skb code path
      (refill every 16 consumed buffers), but that turns out to be very
      inefficient. The problem is that we have no rx_swbd->page at our
      disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would
      have to call enetc_new_page for every buffer that we refill (if we
      choose to refill at this early stage). Very inefficient, it only makes
      the problem worse, because page allocation is an expensive process, and
      CPU time is exactly what we're lacking.
      
      Additionally, there is an even bigger problem: if we let
      enetc_refill_rx_ring top up the ring's buffers again from the RX path,
      remember that the buffers sent to transmission haven't disappeared
      anywhere. They will be eventually sent, and processed in
      enetc_clean_tx_ring, and an attempt will be made to recycle them.
      But surprise, the RX ring is already full of new buffers, because we
      were premature in deciding that we should refill. So not only we took
      the expensive decision of allocating new pages, but now we must throw
      away perfectly good and reusable buffers.
      
      So what we do is we implement an elastic refill mechanism, which keeps
      track of the number of in-flight XDP_TX buffer descriptors. We top up
      the RX ring only up to the total ring capacity minus the number of BDs
      that are in flight (because we know that those BDs will return to us
      eventually).
      
      The enetc driver manages 1 RX ring per CPU, and the default TX ring
      management is the same. So we do XDP_TX towards the TX ring of the same
      index, because it is affined to the same CPU. This will probably not
      produce great results when we have a tc-taprio/tc-mqprio qdisc on the
      interface, because in that case, the number of TX rings might be
      greater, but I didn't add any checks for that yet (mostly because I
      didn't know what checks to add).
      
      It should also be noted that we need to change the DMA mapping direction
      for RX buffers, since they may now be reflected into the TX ring of the
      same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and
      remapping as DMA_TO_DEVICE, because performance is better this way.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ed2bc80
    • V
      net: enetc: add support for XDP_DROP and XDP_PASS · d1b15102
      Vladimir Oltean 提交于
      For the RX ring, enetc uses an allocation scheme based on pages split
      into two buffers, which is already very efficient in terms of preventing
      reallocations / maximizing reuse, so I see no reason why I would change
      that.
      
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half B | half B | half B | half B | half B | half B | half B |
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half A | half A | half A | half A | half A | half A | half A | RX ring
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
           ^                                                     ^
           |                                                     |
       next_to_clean                                       next_to_alloc
                                                            next_to_use
      
                         +--------+--------+--------+--------+--------+
                         |        |        |        |        |        |
                         | half B | half B | half B | half B | half B |
                         |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half B | half B | half A | half A | half A | half A | half A | RX ring
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |   ^                                   ^
       | half A | half A |   |                                   |
       |        |        | next_to_clean                   next_to_use
       +--------+--------+
                    ^
                    |
               next_to_alloc
      
      then when enetc_refill_rx_ring is called, whose purpose is to advance
      next_to_use, it sees that it can take buffers up to next_to_alloc, and
      it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate
      one!".
      
      The only problem is that for default PAGE_SIZE values of 4096, buffer
      sizes are 2048 bytes. While this is enough for normal skb allocations at
      an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256
      bytes, and including skb_shared_info and alignment, we end up being able
      to make use of only 1472 bytes, which is insufficient for the default
      MTU.
      
      To solve that problem, we implement scatter/gather processing in the
      driver, because we would really like to keep the existing allocation
      scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and
      another one of 28 bytes.
      
      Because the headroom required by XDP is different (and much larger) than
      the one required by the network stack, whenever a BPF program is added
      or deleted on the port, we drain the existing RX buffers and seed new
      ones with the required headroom. We also keep the required headroom in
      rx_ring->buffer_offset.
      
      The simplest way to implement XDP_PASS, where an skb must be created, is
      to create an xdp_buff based on the next_to_clean RX BDs, but not clear
      those BDs from the RX ring yet, just keep the original index at which
      the BDs for this frame started. Then, if the verdict is XDP_PASS,
      instead of converting the xdb_buff to an skb, we replay a call to
      enetc_build_skb (just as in the normal enetc_clean_rx_ring case),
      starting from the original BD index.
      
      We would also like to be minimally invasive to the regular RX data path,
      and not check whether there is a BPF program attached to the ring on
      every packet. So we create a separate RX ring processing function for
      XDP.
      
      Because we only install/remove the BPF program while the interface is
      down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there
      shouldn't be any circumstance in which we are processing packets and
      there is a potentially freed BPF program attached to the RX ring.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b15102
    • V
      net: enetc: move up enetc_reuse_page and enetc_page_reusable · 65d0cbb4
      Vladimir Oltean 提交于
      For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring,
      so we need to avoid a forward declaration.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65d0cbb4
    • V
      net: enetc: clean the TX software BD on the TX confirmation path · 1ee8d6f3
      Vladimir Oltean 提交于
      With the future introduction of some new fields into enetc_tx_swbd such
      as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits
      to true from the XDP_TX/XDP_REDIRECT code path, but also to false from
      the old code paths.
      
      This is because TX software buffer descriptors are kept in a ring that
      is shadow of the hardware TX ring, so these structures keep getting
      reused, and there is always the possibility that when a software BD is
      reused (after we ran a full circle through the TX ring), the old user of
      the tx_swbd had set is_xdp_tx = true, and now we are sending a regular
      skb, which would need to set is_xdp_tx = false.
      
      To be minimally invasive to the old code paths, let's just scrub the
      software TX BD in the TX confirmation path (enetc_clean_tx_ring), once
      we know that nobody uses this software TX BD (tx_ring->next_to_clean
      hasn't yet been updated, and the TX paths check enetc_bd_unused which
      tells them if there's any more space in the TX ring for a new enqueue).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ee8d6f3
    • V
      net: enetc: add a dedicated is_eof bit in the TX software BD · d504498d
      Vladimir Oltean 提交于
      In the transmit path, if we have a scatter/gather frame, it is put into
      multiple software buffer descriptors, the last of which has the skb
      pointer populated (which is necessary for rearming the TX MSI vector and
      for collecting the two-step TX timestamp from the TX confirmation path).
      
      At the moment, this is sufficient, but with XDP_TX, we'll need to
      service TX software buffer descriptors that don't have an skb pointer,
      however they might be final nonetheless. So add a dedicated bit for
      final software BDs that we populate and check explicitly. Also, we keep
      looking just for an skb when doing TX timestamping, because we don't
      want/need that for XDP.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d504498d
    • V
      net: enetc: move skb creation into enetc_build_skb · a800abd3
      Vladimir Oltean 提交于
      We need to build an skb from two code paths now: from the plain RX data
      path and from the XDP data path when the verdict is XDP_PASS.
      
      Create a new enetc_build_skb function which contains the essential steps
      for building an skb based on the first and last positions of buffer
      descriptors within the RX ring.
      
      We also squash the enetc_process_skb function into enetc_build_skb,
      because what that function did wasn't very meaningful on its own.
      
      The "rx_frm_cnt++" instruction has been moved around napi_gro_receive
      for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which
      itself must be before napi_gro_receive, because that's when we lose
      ownership of the skb.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a800abd3
    • V
      net: enetc: consume the error RX buffer descriptors in a dedicated function · 2fa423f5
      Vladimir Oltean 提交于
      We can and should check the RX BD errors before starting to build the
      skb. The only apparent reason why things are done in this backwards
      order is to spare one call to enetc_rxbd_next.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fa423f5
  16. 25 3月, 2021 2 次提交
    • V
      net: enetc: don't depend on system endianness in enetc_set_mac_ht_flt · e366a392
      Vladimir Oltean 提交于
      When enetc runs out of exact match entries for unicast address
      filtering, it switches to an approach based on hash tables, where
      multiple MAC addresses might end up in the same bucket.
      
      However, the enetc_set_mac_ht_flt function currently depends on the
      system endianness, because it interprets the 64-bit hash value as an
      array of two u32 elements. Modify this to use lower_32_bits and
      upper_32_bits.
      
      Tested by forcing enetc to go into hash table mode by creating two
      macvlan upper interfaces:
      
      ip link add link eno0 address 00:01:02:03:00:00 eno0.0 type macvlan && ip link set eno0.0 up
      ip link add link eno0 address 00:01:02:03:00:01 eno0.1 type macvlan && ip link set eno0.1 up
      
      and verified that the same bit values are written to the registers
      before and after:
      
      enetc_sync_mac_filters: addr 00:00:80:00:40:10 exact match 0
      enetc_sync_mac_filters: addr 00:00:00:00:80:00 exact match 0
      enetc_set_mac_ht_flt: hash 0x80008000000000 UMHFR0 0x0 UMHFR1 0x800080
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e366a392
    • V
      net: enetc: don't depend on system endianness in enetc_set_vlan_ht_filter · 110eccdb
      Vladimir Oltean 提交于
      ENETC has a 64-entry hash table for VLAN RX filtering per Station
      Interface, which is accessed through two 32-bit registers: VHFR0 holding
      the low portion, and VHFR1 holding the high portion.
      
      The enetc_set_vlan_ht_filter function looks at the pf->vlan_ht_filter
      bitmap, which is fundamentally an unsigned long variable, and casts it
      to a u32 array of two elements. It puts the first u32 element into VHFR0
      and the second u32 element into VHFR1.
      
      It is easy to imagine that this will not work on big endian systems
      (although, yes, we have bigger problems, because currently enetc assumes
      that the CPU endianness is equal to the controller endianness, aka
      little endian - but let's assume that we could add a cpu_to_le32 in
      enetc_wd_reg and a le32_to_cpu in enetc_rd_reg).
      
      Let's use lower_32_bits and upper_32_bits which are designed to work
      regardless of endianness.
      
      Tested that both the old and the new method produce the same results:
      
      $ ethtool -K eth1 rx-vlan-filter on
      $ ip link add link eth1 name eth1.100 type vlan id 100
      enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x20
      enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x20
      $ ip link add link eth1 name eth1.101 type vlan id 101
      enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x30
      enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x30
      $ ip link add link eth1 name eth1.34 type vlan id 34
      enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x0 VHFR1 0x34
      enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x0 VHFR1 0x34
      $ ip link add link eth1 name eth1.1024 type vlan id 1024
      enetc_set_vlan_ht_filter: method 1: si_idx 0 VHFR0 0x1 VHFR1 0x34
      enetc_set_vlan_ht_filter: method 2: si_idx 0 VHFR0 0x1 VHFR1 0x34
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      110eccdb
  17. 20 3月, 2021 1 次提交
    • V
      net: enetc: teardown CBDR during PF/VF unbind · c54f042d
      Vladimir Oltean 提交于
      Michael reports that after the blamed patch, unbinding a VF would cause
      these transactions to remain pending, and trigger some warnings with the
      DMA API debug:
      
      $ echo 1 > /sys/bus/pci/devices/0000\:00\:00.0/sriov_numvfs
      pci 0000:00:01.0: [1957:ef00] type 00 class 0x020001
      fsl_enetc_vf 0000:00:01.0: Adding to iommu group 19
      fsl_enetc_vf 0000:00:01.0: enabling device (0000 -> 0002)
      fsl_enetc_vf 0000:00:01.0 eno0vf0: renamed from eth0
      
      $ echo 0 > /sys/bus/pci/devices/0000\:00\:00.0/sriov_numvfs
      DMA-API: pci 0000:00:01.0: device driver has pending DMA allocations while released from device [count=1]
      One of leaked entries details: [size=2048 bytes] [mapped with DMA_BIDIRECTIONAL] [mapped as coherent]
      WARNING: CPU: 0 PID: 2547 at kernel/dma/debug.c:853 dma_debug_device_change+0x174/0x1c8
      (...)
      Call trace:
       dma_debug_device_change+0x174/0x1c8
       blocking_notifier_call_chain+0x74/0xa8
       device_release_driver_internal+0x18c/0x1f0
       device_release_driver+0x20/0x30
       pci_stop_bus_device+0x8c/0xe8
       pci_stop_and_remove_bus_device+0x20/0x38
       pci_iov_remove_virtfn+0xb8/0x128
       sriov_disable+0x3c/0x110
       pci_disable_sriov+0x24/0x30
       enetc_sriov_configure+0x4c/0x108
       sriov_numvfs_store+0x11c/0x198
      (...)
      DMA-API: Mapped at:
       dma_entry_alloc+0xa4/0x130
       debug_dma_alloc_coherent+0xbc/0x138
       dma_alloc_attrs+0xa4/0x108
       enetc_setup_cbdr+0x4c/0x1d0
       enetc_vf_probe+0x11c/0x250
      pci 0000:00:01.0: Removing from iommu group 19
      
      This happens because stupid me moved enetc_teardown_cbdr outside of
      enetc_free_si_resources, but did not bother to keep calling
      enetc_teardown_cbdr from all the places where enetc_free_si_resources
      was called. In particular, now it is no longer called from the main
      unbind function, just from the probe error path.
      
      Fixes: 4b47c0b8 ("net: enetc: don't initialize unused ports from a separate code path")
      Reported-by: NMichael Walle <michael@walle.cc>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: NMichael Walle <michael@walle.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c54f042d