1. 17 4月, 2021 6 次提交
    • V
      net: enetc: use dedicated TX rings for XDP · 7eab503b
      Vladimir Oltean 提交于
      It is possible for one CPU to perform TX hashing (see netdev_pick_tx)
      between the 8 ENETC TX rings, and the TX hashing to select TX queue 1.
      
      At the same time, it is possible for the other CPU to already use TX
      ring 1 for XDP (either XDP_TX or XDP_REDIRECT). Since there is no mutual
      exclusion between XDP and the network stack, we run into an issue
      because the ENETC TX procedure is not reentrant.
      
      The obvious approach would be to just make XDP take the lock of the
      network stack's TX queue corresponding to the ring it's about to enqueue
      in.
      
      For XDP_REDIRECT, this is quite straightforward, a lock at the beginning
      and end of enetc_xdp_xmit() should do the trick.
      
      But for XDP_TX, it's a bit more complicated. For one, we do TX batching
      all by ourselves for frames with the XDP_TX verdict. This is something
      we would like to keep the way it is, for performance reasons. But
      batching means that the network stack's lock should be kept from the
      first enqueued XDP_TX frame and until we ring the doorbell. That is
      mostly fine, except for cases when in the same NAPI loop we have mixed
      XDP_TX and XDP_REDIRECT frames. So if enetc_xdp_xmit() gets called while
      we are holding the lock from the RX NAPI, then bam, deadlock. The naive
      answer could be 'just flush the XDP_TX frames first, then release the
      network stack's TX queue lock, then call xdp_do_flush_map()'. But even
      xdp_do_redirect() is capable of flushing the batched XDP_REDIRECT
      frames, so unless we unlock/relock the TX queue around xdp_do_redirect(),
      there simply isn't any clean way to protect XDP_TX from concurrent
      network stack .ndo_start_xmit() on another CPU.
      
      So we need to take a different approach, and that is to reserve two
      rings for the sole use of XDP. We leave TX rings
      0..ndev->real_num_tx_queues-1 to be handled by the network stack, and we
      pick them from the end of the priv->tx_ring array.
      
      We make an effort to keep the mapping done by enetc_alloc_msix() which
      decides which CPU handles the TX completions of which TX ring in its
      NAPI poll. So the XDP TX ring of CPU 0 is handled by TX ring 6, and the
      XDP TX ring of CPU 1 is handled by TX ring 7.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7eab503b
    • V
      net: enetc: remove unneeded xdp_do_flush_map() · a6369fe6
      Vladimir Oltean 提交于
      xdp_do_redirect already contains:
      -> dev_map_enqueue
         -> __xdp_enqueue
            -> bq_enqueue
               -> bq_xmit_all // if we have more than 16 frames
      
      So the logic from enetc will never be hit, because ENETC_DEFAULT_TX_WORK
      is 128.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6369fe6
    • V
      net: enetc: stop XDP NAPI processing when build_skb() fails · 8f50d8bb
      Vladimir Oltean 提交于
      When the code path below fails:
      
      enetc_clean_rx_ring_xdp // XDP_PASS
      -> enetc_build_skb
         -> enetc_map_rx_buff_to_skb
            -> build_skb
      
      enetc_clean_rx_ring_xdp will 'break', but that 'break' instruction isn't
      strong enough to actually break the NAPI poll loop, just the switch/case
      statement for XDP actions. So we increment rx_frm_cnt and go to the next
      frames minding our own business.
      
      Instead let's do what the skb NAPI poll function does, and break the
      loop now, waiting for the memory pressure to go away. Otherwise the next
      calls to build_skb() are likely to fail too.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f50d8bb
    • V
      net: enetc: recycle buffers for frames with RX errors · 672f9a21
      Vladimir Oltean 提交于
      When receiving a frame with errors, currently we do nothing with it (we
      don't construct an skb or an xdp_buff), we just exit the NAPI poll loop.
      
      Let's put the buffer back into the RX ring (similar to XDP_DROP).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      672f9a21
    • V
      net: enetc: rename the buffer reuse helpers · 6b04830d
      Vladimir Oltean 提交于
      enetc_put_xdp_buff has nothing to do with XDP, frankly, it is just a
      helper to populate the recycle end of the shadow RX BD ring
      (next_to_alloc) with a given buffer.
      
      On the other hand, enetc_put_rx_buff plays more tricks than its name
      would suggest.
      
      So let's rename enetc_put_rx_buff into enetc_flip_rx_buff to reflect the
      half-page buffer reuse tricks that it employs, and enetc_put_xdp_buff
      into enetc_put_rx_buff which suggests a more garden-variety operation.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b04830d
    • V
      net: enetc: remove redundant clearing of skb/xdp_frame pointer in TX conf path · e9e49ae8
      Vladimir Oltean 提交于
      Later in enetc_clean_tx_ring we have:
      
      		/* Scrub the swbd here so we don't have to do that
      		 * when we reuse it during xmit
      		 */
      		memset(tx_swbd, 0, sizeof(*tx_swbd));
      
      So these assignments are unnecessary.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9e49ae8
  2. 16 4月, 2021 1 次提交
  3. 13 4月, 2021 2 次提交
    • Y
      enetc: support PTP Sync packet one-step timestamping · 7294380c
      Yangbo Lu 提交于
      This patch is to add support for PTP Sync packet one-step timestamping.
      Since ENETC single-step register has to be configured dynamically per
      packet for correctionField offeset and UDP checksum update, current
      one-step timestamping packet has to be sent only when the last one
      completes transmitting on hardware. So, on the TX, this patch handles
      one-step timestamping packet as below:
      
      - Trasmit packet immediately if no other one in transfer, or queue to
        skb queue if there is already one in transfer.
        The test_and_set_bit_lock() is used here to lock and check state.
      - Start a work when complete transfer on hardware, to release the bit
        lock and to send one skb in skb queue if has.
      
      And the configuration for one-step timestamping on ENETC before
      transmitting is,
      
      - Set one-step timestamping flag in extension BD.
      - Write 30 bits current timestamp in tstamp field of extension BD.
      - Update PTP Sync packet originTimestamp field with current timestamp.
      - Configure single-step register for correctionField offeset and UDP
        checksum update.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7294380c
    • Y
      enetc: mark TX timestamp type per skb · f768e751
      Yangbo Lu 提交于
      Mark TX timestamp type per skb on skb->cb[0], instead of
      global variable for all skbs. This is a preparation for
      one step timestamp support.
      
      For one-step timestamping enablement, there will be both
      one-step and two-step PTP messages to transfer. And a skb
      queue is needed for one-step PTP messages making sure
      start to send current message only after the last one
      completed on hardware. (ENETC single-step register has to
      be dynamically configured per message.) So, marking TX
      timestamp type per skb is required.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f768e751
  4. 10 4月, 2021 3 次提交
  5. 01 4月, 2021 8 次提交
    • V
      net: enetc: add support for XDP_REDIRECT · 9d2b68cc
      Vladimir Oltean 提交于
      The driver implementation of the XDP_REDIRECT action reuses parts from
      XDP_TX, most notably the enetc_xdp_tx function which transmits an array
      of TX software BDs. Only this time, the buffers don't have DMA mappings,
      we need to create them.
      
      When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can
      employ the same buffer reuse strategy as for the normal processing path
      and for XDP_PASS: we can flip to the other page half and seed that to
      the RX ring.
      
      Note that scatter/gather support is there, but disabled due to lack of
      multi-buffer support in XDP (which is added by this series):
      https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d2b68cc
    • V
      net: enetc: add support for XDP_TX · 7ed2bc80
      Vladimir Oltean 提交于
      For reflecting packets back into the interface they came from, we create
      an array of TX software BDs derived from the RX software BDs. Therefore,
      we need to extend the TX software BD structure to contain most of the
      stuff that's already present in the RX software BD structure, for
      reasons that will become evident in a moment.
      
      For a frame with the XDP_TX verdict, we don't reuse any buffer right
      away as we do for XDP_DROP (the same page half) or XDP_PASS (the other
      page half, same as the skb code path).
      
      Because the buffer transfers ownership from the RX ring to the TX ring,
      reusing any page half right away is very dangerous. So what we can do is
      we can recycle the same page half as soon as TX is complete.
      
      The code path is:
      enetc_poll
      -> enetc_clean_rx_ring_xdp
         -> enetc_xdp_tx
         -> enetc_refill_rx_ring
      (time passes, another MSI interrupt is raised)
      enetc_poll
      -> enetc_clean_tx_ring
         -> enetc_recycle_xdp_tx_buff
      
      But that creates a problem, because there is a potentially large time
      window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in
      which we'll have less and less RX buffers.
      
      Basically, when the ship starts sinking, the knee-jerk reaction is to
      let enetc_refill_rx_ring do what it does for the standard skb code path
      (refill every 16 consumed buffers), but that turns out to be very
      inefficient. The problem is that we have no rx_swbd->page at our
      disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would
      have to call enetc_new_page for every buffer that we refill (if we
      choose to refill at this early stage). Very inefficient, it only makes
      the problem worse, because page allocation is an expensive process, and
      CPU time is exactly what we're lacking.
      
      Additionally, there is an even bigger problem: if we let
      enetc_refill_rx_ring top up the ring's buffers again from the RX path,
      remember that the buffers sent to transmission haven't disappeared
      anywhere. They will be eventually sent, and processed in
      enetc_clean_tx_ring, and an attempt will be made to recycle them.
      But surprise, the RX ring is already full of new buffers, because we
      were premature in deciding that we should refill. So not only we took
      the expensive decision of allocating new pages, but now we must throw
      away perfectly good and reusable buffers.
      
      So what we do is we implement an elastic refill mechanism, which keeps
      track of the number of in-flight XDP_TX buffer descriptors. We top up
      the RX ring only up to the total ring capacity minus the number of BDs
      that are in flight (because we know that those BDs will return to us
      eventually).
      
      The enetc driver manages 1 RX ring per CPU, and the default TX ring
      management is the same. So we do XDP_TX towards the TX ring of the same
      index, because it is affined to the same CPU. This will probably not
      produce great results when we have a tc-taprio/tc-mqprio qdisc on the
      interface, because in that case, the number of TX rings might be
      greater, but I didn't add any checks for that yet (mostly because I
      didn't know what checks to add).
      
      It should also be noted that we need to change the DMA mapping direction
      for RX buffers, since they may now be reflected into the TX ring of the
      same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and
      remapping as DMA_TO_DEVICE, because performance is better this way.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ed2bc80
    • V
      net: enetc: add support for XDP_DROP and XDP_PASS · d1b15102
      Vladimir Oltean 提交于
      For the RX ring, enetc uses an allocation scheme based on pages split
      into two buffers, which is already very efficient in terms of preventing
      reallocations / maximizing reuse, so I see no reason why I would change
      that.
      
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half B | half B | half B | half B | half B | half B | half B |
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half A | half A | half A | half A | half A | half A | half A | RX ring
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
           ^                                                     ^
           |                                                     |
       next_to_clean                                       next_to_alloc
                                                            next_to_use
      
                         +--------+--------+--------+--------+--------+
                         |        |        |        |        |        |
                         | half B | half B | half B | half B | half B |
                         |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half B | half B | half A | half A | half A | half A | half A | RX ring
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |   ^                                   ^
       | half A | half A |   |                                   |
       |        |        | next_to_clean                   next_to_use
       +--------+--------+
                    ^
                    |
               next_to_alloc
      
      then when enetc_refill_rx_ring is called, whose purpose is to advance
      next_to_use, it sees that it can take buffers up to next_to_alloc, and
      it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate
      one!".
      
      The only problem is that for default PAGE_SIZE values of 4096, buffer
      sizes are 2048 bytes. While this is enough for normal skb allocations at
      an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256
      bytes, and including skb_shared_info and alignment, we end up being able
      to make use of only 1472 bytes, which is insufficient for the default
      MTU.
      
      To solve that problem, we implement scatter/gather processing in the
      driver, because we would really like to keep the existing allocation
      scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and
      another one of 28 bytes.
      
      Because the headroom required by XDP is different (and much larger) than
      the one required by the network stack, whenever a BPF program is added
      or deleted on the port, we drain the existing RX buffers and seed new
      ones with the required headroom. We also keep the required headroom in
      rx_ring->buffer_offset.
      
      The simplest way to implement XDP_PASS, where an skb must be created, is
      to create an xdp_buff based on the next_to_clean RX BDs, but not clear
      those BDs from the RX ring yet, just keep the original index at which
      the BDs for this frame started. Then, if the verdict is XDP_PASS,
      instead of converting the xdb_buff to an skb, we replay a call to
      enetc_build_skb (just as in the normal enetc_clean_rx_ring case),
      starting from the original BD index.
      
      We would also like to be minimally invasive to the regular RX data path,
      and not check whether there is a BPF program attached to the ring on
      every packet. So we create a separate RX ring processing function for
      XDP.
      
      Because we only install/remove the BPF program while the interface is
      down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there
      shouldn't be any circumstance in which we are processing packets and
      there is a potentially freed BPF program attached to the RX ring.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b15102
    • V
      net: enetc: move up enetc_reuse_page and enetc_page_reusable · 65d0cbb4
      Vladimir Oltean 提交于
      For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring,
      so we need to avoid a forward declaration.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65d0cbb4
    • V
      net: enetc: clean the TX software BD on the TX confirmation path · 1ee8d6f3
      Vladimir Oltean 提交于
      With the future introduction of some new fields into enetc_tx_swbd such
      as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits
      to true from the XDP_TX/XDP_REDIRECT code path, but also to false from
      the old code paths.
      
      This is because TX software buffer descriptors are kept in a ring that
      is shadow of the hardware TX ring, so these structures keep getting
      reused, and there is always the possibility that when a software BD is
      reused (after we ran a full circle through the TX ring), the old user of
      the tx_swbd had set is_xdp_tx = true, and now we are sending a regular
      skb, which would need to set is_xdp_tx = false.
      
      To be minimally invasive to the old code paths, let's just scrub the
      software TX BD in the TX confirmation path (enetc_clean_tx_ring), once
      we know that nobody uses this software TX BD (tx_ring->next_to_clean
      hasn't yet been updated, and the TX paths check enetc_bd_unused which
      tells them if there's any more space in the TX ring for a new enqueue).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ee8d6f3
    • V
      net: enetc: add a dedicated is_eof bit in the TX software BD · d504498d
      Vladimir Oltean 提交于
      In the transmit path, if we have a scatter/gather frame, it is put into
      multiple software buffer descriptors, the last of which has the skb
      pointer populated (which is necessary for rearming the TX MSI vector and
      for collecting the two-step TX timestamp from the TX confirmation path).
      
      At the moment, this is sufficient, but with XDP_TX, we'll need to
      service TX software buffer descriptors that don't have an skb pointer,
      however they might be final nonetheless. So add a dedicated bit for
      final software BDs that we populate and check explicitly. Also, we keep
      looking just for an skb when doing TX timestamping, because we don't
      want/need that for XDP.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d504498d
    • V
      net: enetc: move skb creation into enetc_build_skb · a800abd3
      Vladimir Oltean 提交于
      We need to build an skb from two code paths now: from the plain RX data
      path and from the XDP data path when the verdict is XDP_PASS.
      
      Create a new enetc_build_skb function which contains the essential steps
      for building an skb based on the first and last positions of buffer
      descriptors within the RX ring.
      
      We also squash the enetc_process_skb function into enetc_build_skb,
      because what that function did wasn't very meaningful on its own.
      
      The "rx_frm_cnt++" instruction has been moved around napi_gro_receive
      for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which
      itself must be before napi_gro_receive, because that's when we lose
      ownership of the skb.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a800abd3
    • V
      net: enetc: consume the error RX buffer descriptors in a dedicated function · 2fa423f5
      Vladimir Oltean 提交于
      We can and should check the RX BD errors before starting to build the
      skb. The only apparent reason why things are done in this backwards
      order is to spare one call to enetc_rxbd_next.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fa423f5
  6. 11 3月, 2021 12 次提交
  7. 09 3月, 2021 1 次提交
    • V
      net: enetc: allow hardware timestamping on TX queues with tc-etf enabled · 29d98f54
      Vladimir Oltean 提交于
      The txtime is passed to the driver in skb->skb_mstamp_ns, which is
      actually in a union with skb->tstamp (the place where software
      timestamps are kept).
      
      Since commit b50a5c70 ("net: allow simultaneous SW and HW transmit
      timestamping"), __sock_recv_timestamp has some logic for making sure
      that the two calls to skb_tstamp_tx:
      
      skb_tx_timestamp(skb) # Software timestamp in the driver
      -> skb_tstamp_tx(skb, NULL)
      
      and
      
      skb_tstamp_tx(skb, &shhwtstamps) # Hardware timestamp in the driver
      
      will both do the right thing and in a race-free manner, meaning that
      skb_tx_timestamp will deliver a cmsg with the software timestamp only,
      and skb_tstamp_tx with a non-NULL hwtstamps argument will deliver a cmsg
      with the hardware timestamp only.
      
      Why are races even possible? Well, because although the software timestamp
      skb->tstamp is private per skb, the hardware timestamp skb_hwtstamps(skb)
      lives in skb_shinfo(skb), an area which is shared between skbs and their
      clones. And skb_tstamp_tx works by cloning the packets when timestamping
      them, therefore attempting to perform hardware timestamping on an skb's
      clone will also change the hardware timestamp of the original skb. And
      the original skb might have been yet again cloned for software
      timestamping, at an earlier stage.
      
      So the logic in __sock_recv_timestamp can't be as simple as saying
      "does this skb have a hardware timestamp? if yes I'll send the hardware
      timestamp to the socket, otherwise I'll send the software timestamp",
      precisely because the hardware timestamp is shared.
      Instead, it's quite the other way around: __sock_recv_timestamp says
      "does this skb have a software timestamp? if yes, I'll send the software
      timestamp, otherwise the hardware one". This works because the software
      timestamp is not shared with clones.
      
      But that means we have a problem when we attempt hardware timestamping
      with skbs that don't have the skb->tstamp == 0. __sock_recv_timestamp
      will say "oh, yeah, this must be some sort of odd clone" and will not
      deliver the hardware timestamp to the socket. And this is exactly what
      is happening when we have txtime enabled on the socket: as mentioned,
      that is put in a union with skb->tstamp, so it is quite easy to mistake
      it.
      
      Do what other drivers do (intel igb/igc) and write zero to skb->tstamp
      before taking the hardware timestamp. It's of no use to us now (we're
      already on the TX confirmation path).
      
      Fixes: 0d08c9ec ("enetc: add support time specific departure base on the qos etf")
      Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29d98f54
  8. 02 3月, 2021 6 次提交
    • V
      net: enetc: keep RX ring consumer index in sync with hardware · 3a5d12c9
      Vladimir Oltean 提交于
      The RX rings have a producer index owned by hardware, where newly
      received frame buffers are placed, and a consumer index owned by
      software, where newly allocated buffers are placed, in expectation of
      hardware being able to place frame data in them.
      
      Hardware increments the producer index when a frame is received, however
      it is not allowed to increment the producer index to match the consumer
      index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received
      BDs. Whenever the producer index matches the value of the consumer
      index, the ring has no unprocessed received frames and all BDs in the
      ring have been initialized/prepared by software, i.e. hardware owns all
      BDs in the ring.
      
      The code uses the next_to_clean variable to keep track of the producer
      index, and the next_to_use variable to keep track of the consumer index.
      
      The RX rings are seeded from enetc_refill_rx_ring, which is called from
      two places:
      
      1. initially the ring is seeded until full with enetc_bd_unused(rx_ring),
         i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511:
      
      .ndo_open
      -> enetc_open
         -> enetc_setup_bdrs
            -> enetc_setup_rxbdr
               -> enetc_refill_rx_ring
      
      2. then during the data path processing, it is refilled with 16 buffers
         at a time:
      
      enetc_msix
      -> napi_schedule
         -> enetc_poll
            -> enetc_clean_rx_ring
               -> enetc_refill_rx_ring
      
      There is just one problem: the initial seeding done during .ndo_open
      updates just the producer index (ENETC_RBPIR) with 0, and the software
      next_to_clean and next_to_use variables. Notably, it will not update the
      consumer index to make the hardware aware of the newly added buffers.
      
      Wait, what? So how does it work?
      
      Well, the reset values of the producer index and of the consumer index
      of a ring are both zero. As per the description in the second paragraph,
      it means that the ring is full of buffers waiting for hardware to put
      frames in them, which by coincidence is almost true, because we have in
      fact seeded 511 buffers into the ring.
      
      But will the hardware attempt to access the 512th entry of the ring,
      which has an invalid BD in it? Well, no, because in order to do that, it
      would have to first populate the first 511 entries, and the NAPI
      enetc_poll will kick in by then. Eventually, after 16 processed slots
      have become available in the RX ring, enetc_clean_rx_ring will call
      enetc_refill_rx_ring and then will [ finally ] update the consumer index
      with the new software next_to_use variable. From now on, the
      next_to_clean and next_to_use variables are in sync with the producer
      and consumer ring indices.
      
      So the day is saved, right? Well, not quite. Freeing the memory
      allocated for the rings is done in:
      
      enetc_close
      -> enetc_clear_bdrs
         -> enetc_clear_rxbdr
            -> this just disables the ring
      -> enetc_free_rxtx_rings
         -> enetc_free_rx_ring
            -> sets next_to_clean and next_to_use to 0
      
      but again, nothing is committed to the hardware producer and consumer
      indices (yay!). The assumption is that the ring is disabled, so the
      indices don't matter anyway, and it's the responsibility of the "open"
      code path to set those up.
      
      .. Except that the "open" code path does not set those up properly.
      
      While initially, things almost work, during subsequent enetc_close ->
      enetc_open sequences, we have problems. To be precise, the enetc_open
      that is subsequent to enetc_close will again refill the ring with 511
      entries, but it will leave the consumer index untouched. Untouched
      means, of course, equal to the value it had before disabling the ring
      and draining the old buffers in enetc_close.
      
      But as mentioned, enetc_setup_rxbdr will at least update the producer
      index though, through this line of code:
      
      	enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0);
      
      so at this stage we'll have:
      
      next_to_clean=0 (in hardware 0)
      next_to_use=511 (in hardware we'll have the refill index prior to enetc_close)
      
      Again, the next_to_clean and producer index are in sync and set to
      correct values, so the driver manages to limp on. Eventually, 16 ring
      entries will be consumed by enetc_poll, and the savior
      enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then
      update the hardware consumer ring based upon the new next_to_use.
      
      So.. it works?
      Well, by coincidence, it almost does, but there's a circumstance where
      enetc_clean_rx_ring won't be there to save us. If the previous value of
      the consumer index was 15, there's a problem, because the NAPI poll
      sequence will only issue a refill when 16 or more buffers have been
      consumed.
      
      It's easiest to illustrate this with an example:
      
      ip link set eno0 up
      ip addr add 192.168.100.1/24 dev eno0
      ping 192.168.100.1 -c 20 # ping this port from another board
      ip link set eno0 down
      ip link set eno0 up
      ping 192.168.100.1 -c 20 # ping it again from the same other board
      
      One by one:
      
      1. ip link set eno0 up
      -> calls enetc_setup_rxbdr:
         -> calls enetc_refill_rx_ring(511 buffers)
         -> next_to_clean=0 (in hw 0)
         -> next_to_use=511 (in hw 0)
      
      2. ping 192.168.100.1 -c 20 # ping this port from another board
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0)
      enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15)
      
      20 packets transmitted, 20 packets received, 0% packet loss
      
      3. ip link set eno0 down
      enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15)
      
      4. ip link set eno0 up
      -> calls enetc_setup_rxbdr:
         -> calls enetc_refill_rx_ring(511 buffers)
         -> next_to_clean=0 (in hw 0)
         -> next_to_use=511 (in hw 15)
      
      5. ping 192.168.100.1 -c 20 # ping it again from the same other board
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15)
      enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15)
      
      20 packets transmitted, 12 packets received, 40% packet loss
      
      And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal
      to 15 for that to happen), no nothing. The hardware enters the condition where
      the producer (14) + 1 is equal to the consumer (15) index, which makes it
      believe it has no more free buffers to put packets in, so it starts discarding
      them:
      
      ip netns exec ns0 ethtool -S eno0 | grep -v ': 0'
      NIC statistics:
           Rx ring  0 discarded frames: 8
      
      Summarized, if the interface receives between 16 and 32 (mod 512) frames
      and then there is a link flap, then the port will eventually die with no
      way to recover. If it receives less than 16 (mod 512) frames, then the
      initial NAPI poll [ before the link flap ] will not update the consumer
      index in hardware (it will remain zero) which will be ok when the buffers
      are later reinitialized. If more than 32 (mod 512) frames are received,
      the initial NAPI poll has the chance to refill the ring twice, updating
      the consumer index to at least 32. So after the link flap, the consumer
      index is still wrong, but the post-flap NAPI poll gets a chance to
      refill the ring once (because it passes through cleaned_cnt=15) and
      makes the consumer index be again back in sync with next_to_use.
      
      The solution to this problem is actually simple, we just need to write
      next_to_use into the hardware consumer index at enetc_open time, which
      always brings it back in sync after an initial buffer seeding process.
      
      The simpler thing would be to put the write to the consumer index into
      enetc_refill_rx_ring directly, but there are issues with the MDIO
      locking: in the NAPI poll code we have the enetc_lock_mdio() taken from
      top-level and we use the unlocked enetc_wr_reg_hot, whereas in
      enetc_open, the enetc_lock_mdio() is not taken at the top level, but
      instead by each individual enetc_wr_reg, so we are forced to put an
      additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of
      the code is left as a refactoring exercise.
      
      Fixes: d4fd0404 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a5d12c9
    • V
      net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr · 96a5223b
      Vladimir Oltean 提交于
      The Station Interface Receive Interrupt Detect Register (SIRXIDR)
      contains a 16-bit wide mask of 'interrupt detected' events for each ring
      associated with a port. Bit i is write-1-to-clean for RX ring i.
      
      I have no explanation whatsoever how this line of code came to be
      inserted in the blamed commit. I checked the downstream versions of that
      patch and none of them have it.
      
      The somewhat comical aspect of it is that we're writing a binary number
      to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring).
      Since the RX rings have 512 buffer descriptors, we end up writing 511 to
      this register, which is 0x1ff, so we are effectively clearing the
      'interrupt detected' event for rings 0-8.
      
      This register is not what is used for interrupt handling though - it
      only provides a summary for the entire SI. The hardware provides one
      separate Interrupt Detect Register per RX ring, which auto-clears upon
      read. So there doesn't seem to be any adverse effect caused by this
      bogus write.
      
      There is, however, one reason why this should be handled as a bugfix:
      next_to_clean _should_ be committed to hardware, just not to that
      register, and this was obscuring the fact that it wasn't. This is fixed
      in the next patch, and removing the bogus line now allows the fix patch
      to be backported beyond that point.
      
      Fixes: fd5736bf ("enetc: Workaround for MDIO register access issue")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a5223b
    • V
      net: enetc: fix incorrect TPID when receiving 802.1ad tagged packets · 827b6fd0
      Vladimir Oltean 提交于
      When the enetc ports have rx-vlan-offload enabled, they report a TPID of
      ETH_P_8021Q regardless of what was actually in the packet. When
      rx-vlan-offload is disabled, packets have the proper TPID. Fix this
      inconsistency by finishing the TODO left in the code.
      
      Fixes: d4fd0404 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      827b6fd0
    • V
      net: enetc: take the MDIO lock only once per NAPI poll cycle · 6d36ecdb
      Vladimir Oltean 提交于
      The workaround for the ENETC MDIO erratum caused a performance
      degradation of 82 Kpps (seen with IP forwarding of two 1Gbps streams of
      64B packets). This is due to excessive locking and unlocking in the fast
      path, which can be avoided.
      
      By taking the MDIO read-side lock only once per NAPI poll cycle, we are
      able to regain 54 Kpps (65%) of the performance hit. The rest of the
      performance degradation comes from the TX data path, but unfortunately
      it doesn't look like we can optimize that away easily, even with
      netdev_xmit_more(), there just isn't any skb batching done, to help with
      taking the MDIO lock less often than once per packet.
      
      We need to change the register accessor type for enetc_get_tx_tstamp,
      because it now runs under the enetc_lock_mdio as per the new call path
      detailed below:
      
      enetc_msix
      -> napi_schedule
         -> enetc_poll
            -> enetc_lock_mdio
            -> enetc_clean_tx_ring
               -> enetc_get_tx_tstamp
            -> enetc_clean_rx_ring
            -> enetc_unlock_mdio
      
      Fixes: fd5736bf ("enetc: Workaround for MDIO register access issue")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d36ecdb
    • V
      net: enetc: initialize RFS/RSS memories for unused ports too · 3222b5b6
      Vladimir Oltean 提交于
      Michael reports that since linux-next-20210211, the AER messages for ECC
      errors have started reappearing, and this time they can be reliably
      reproduced with the first ping on one of his LS1028A boards.
      
      $ ping 1[   33.258069] pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0
      72.16.0.1
      PING [   33.267050] pcieport 0000:00:1f.0: AER: can't find device of ID0000
      172.16.0.1 (172.16.0.1): 56 data bytes
      64 bytes from 172.16.0.1: seq=0 ttl=64 time=17.124 ms
      64 bytes from 172.16.0.1: seq=1 ttl=64 time=0.273 ms
      
      $ devmem 0x1f8010e10 32
      0xC0000006
      
      It isn't clear why this is necessary, but it seems that for the errors
      to go away, we must clear the entire RFS and RSS memory, not just for
      the ports in use.
      
      Sadly the code is structured in such a way that we can't have unified
      logic for the used and unused ports. For the minimal initialization of
      an unused port, we need just to enable and ioremap the PF memory space,
      and a control buffer descriptor ring. Unused ports must then free the
      CBDR because the driver will exit, but used ports can not pick up from
      where that code path left, since the CBDR API does not reinitialize a
      ring when setting it up, so its producer and consumer indices are out of
      sync between the software and hardware state. So a separate
      enetc_init_unused_port function was created, and it gets called right
      after the PF memory space is enabled.
      
      Fixes: 07bf34a5 ("net: enetc: initialize the RFS and RSS memories")
      Reported-by: NMichael Walle <michael@walle.cc>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: NMichael Walle <michael@walle.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3222b5b6
    • V
      net: enetc: don't overwrite the RSS indirection table when initializing · c646d10d
      Vladimir Oltean 提交于
      After the blamed patch, all RX traffic gets hashed to CPU 0 because the
      hashing indirection table set up in:
      
      enetc_pf_probe
      -> enetc_alloc_si_resources
         -> enetc_configure_si
            -> enetc_setup_default_rss_table
      
      is overwritten later in:
      
      enetc_pf_probe
      -> enetc_init_port_rss_memory
      
      which zero-initializes the entire port RSS table in order to avoid ECC errors.
      
      The trouble really is that enetc_init_port_rss_memory really neads
      enetc_alloc_si_resources to be called, because it depends upon
      enetc_alloc_cbdr and enetc_setup_cbdr. But that whole enetc_configure_si
      thing could have been better thought out, it has nothing to do in a
      function called "alloc_si_resources", especially since its counterpart,
      "free_si_resources", does nothing to unwind the configuration of the SI.
      
      The point is, we need to pull out enetc_configure_si out of
      enetc_alloc_resources, and move it after enetc_init_port_rss_memory.
      This allows us to set up the default RSS indirection table after
      initializing the memory.
      
      Fixes: 07bf34a5 ("net: enetc: initialize the RFS and RSS memories")
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c646d10d
  9. 18 11月, 2020 1 次提交