1. 09 12月, 2022 14 次提交
    • J
      ice: reschedule ice_ptp_wait_for_offset_valid during reset · 95af1f1c
      Jacob Keller 提交于
      If the ice_ptp_wait_for_offest_valid function is scheduled to run while the
      driver is resetting, it will exit without completing calibration. The work
      function gets scheduled by ice_ptp_port_phy_restart which will be called as
      part of the reset recovery process.
      
      It is possible for the first execution to occur before the driver has
      completely cleared its resetting flags. Ensure calibration completes by
      rescheduling the task until reset is fully completed.
      Reported-by: NSiddaraju DH <siddaraju.dh@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      95af1f1c
    • S
      ice: make Tx and Rx vernier offset calibration independent · f029a343
      Siddaraju DH 提交于
      The Tx and Rx calibration and timestamp generation blocks are independent.
      However, the ice driver waits until both blocks are ready before
      configuring either block.
      
      This can result in delay of configuring one block because we have not yet
      received a packet in the other block.
      
      There is no reason to wait to finish programming Tx just because we haven't
      received a packet. Similarly there is no reason to wait to program Rx just
      because we haven't transmitted a packet.
      
      Instead of checking both offset status before programming either block,
      refactor the ice_phy_cfg_tx_offset_e822 and ice_phy_cfg_rx_offset_e822
      functions so that they perform their own offset status checks.
      Additionally, make them also check the offset ready bit to determine if
      the offset values have already been programmed.
      
      Call the individual configure functions directly in
      ice_ptp_wait_for_offset_valid. The functions will now correctly check
      status, and program the offsets if ready. Once the offset is programmed,
      the functions will exit quickly after just checking the offset ready
      register.
      
      Remove the ice_phy_calc_vernier_e822 in ice_ptp_hw.c, as well as the offset
      valid check functions in ice_ptp.c entirely as they are no longer
      necessary.
      
      With this change, the Tx and Rx blocks will each be enabled as soon as
      possible without waiting for the other block to complete calibration. This
      can enable timestamps faster in setups which have a low rate of transmitted
      or received packets. In particular, it can stop a situation where one port
      never receives traffic, and thus never finishes calibration of the Tx
      block, resulting in continuous faults reported by the ptp4l daemon
      application.
      Signed-off-by: NSiddaraju DH <siddaraju.dh@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f029a343
    • J
      ice: only check set bits in ice_ptp_flush_tx_tracker · e3ba5248
      Jacob Keller 提交于
      The ice_ptp_flush_tx_tracker function is called to clear all outstanding Tx
      timestamp requests when the port is being brought down. This function
      iterates over the entire list, but this is unnecessary. We only need to
      check the bits which are actually set in the ready bitmap.
      
      Replace this logic with for_each_set_bit, and follow a similar flow as in
      ice_ptp_tx_tstamp_cleanup. Note that it is safe to call dev_kfree_skb_any
      on a NULL pointer as it will perform a no-op so we do not need to verify
      that the skb is actually NULL.
      
      The new implementation also avoids clearing (and thus reading!) the PHY
      timestamp unless the index is marked as having a valid timestamp in the
      timestamp status bitmap. This ensures that we properly clear the status
      registers as appropriate.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e3ba5248
    • J
      ice: handle flushing stale Tx timestamps in ice_ptp_tx_tstamp · d40fd600
      Jacob Keller 提交于
      In the event of a PTP clock time change due to .adjtime or .settime, the
      ice driver needs to update the cached copy of the PHC time and also discard
      any outstanding Tx timestamps.
      
      This is required because otherwise the wrong copy of the PHC time will be
      used when extending the Tx timestamp. This could result in reporting
      incorrect timestamps to the stack.
      
      The current approach taken to handle this is to call
      ice_ptp_flush_tx_tracker, which will discard any timestamps which are not
      yet complete.
      
      This is problematic for two reasons:
      
      1) it could lead to a potential race condition where the wrong timestamp is
         associated with a future packet.
      
         This can occur with the following flow:
      
         1. Thread A gets request to transmit a timestamped packet, and picks an
            index and transmits the packet
      
         2. Thread B calls ice_ptp_flush_tx_tracker and sees the index in use,
            marking is as disarded. No timestamp read occurs because the status
            bit is not set, but the index is released for re-use
      
         3. Thread A gets a new request to transmit another timestamped packet,
            picks the same (now unused) index and transmits that packet.
      
         4. The PHY transmits the first packet and updates the timestamp slot and
            generates an interrupt.
      
         5. The ice_ptp_tx_tstamp thread executes and sees the interrupt and a
            valid timestamp but associates it with the new Tx SKB and not the one
            that actual timestamp for the packet as expected.
      
         This could result in the previous timestamp being assigned to a new
         packet producing incorrect timestamps and leading to incorrect behavior
         in PTP applications.
      
         This is most likely to occur when the packet rate for Tx timestamp
         requests is very high.
      
      2) on E822 hardware, we must avoid reading a timestamp index more than once
         each time its status bit is set and an interrupt is generated by
         hardware.
      
         We do have some extensive checks for the unread flag to ensure that only
         one of either the ice_ptp_flush_tx_tracker or ice_ptp_tx_tstamp threads
         read the timestamp. However, even with this we can still have cases
         where we "flush" a timestamp that was actually completed in hardware.
         This can lead to cases where we don't read the timestamp index as
         appropriate.
      
      To fix both of these issues, we must avoid calling ice_ptp_flush_tx_tracker
      outside of the teardown path.
      
      Rather than using ice_ptp_flush_tx_tracker, introduce a new state bitmap,
      the stale bitmap. Start this as cleared when we begin a new timestamp
      request. When we're about to extend a timestamp and send it up to the
      stack, first check to see if that stale bit was set. If so, drop the
      timestamp without sending it to the stack.
      
      When we need to update the cached PHC timestamp out of band, just mark all
      currently outstanding timestamps as stale. This will ensure that once
      hardware completes the timestamp we'll ignore it correctly and avoid
      reporting bogus timestamps to userspace.
      
      With this change, we fix potential issues caused  by calling
      ice_ptp_flush_tx_tracker during normal operation.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d40fd600
    • J
      ice: cleanup allocations in ice_ptp_alloc_tx_tracker · c1f3414d
      Jacob Keller 提交于
      The ice_ptp_alloc_tx_tracker function must allocate the timestamp array and
      the bitmap for tracking the currently in use indexes. A future change is
      going to add yet another allocation to this function.
      
      If these allocations fail we need to ensure that we properly cleanup and
      ensure that the pointers in the ice_ptp_tx structure are NULL.
      
      Simplify this logic by allocating to local variables first. If any
      allocation fails, then free everything and exit. Only update the ice_ptp_tx
      structure if all allocations succeed.
      
      This ensures that we have no side effects on the Tx structure unless all
      allocations have succeeded. Thus, no code will see an invalid pointer and
      we don't need to re-assign NULL on cleanup.
      
      This is safe because kernel "free" functions are designed to be NULL safe
      and perform no action if passed a NULL pointer. Thus its safe to simply
      always call kfree or bitmap_free even if one of those pointers was NULL.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      c1f3414d
    • J
      ice: protect init and calibrating check in ice_ptp_request_ts · 3ad5c10b
      Jacob Keller 提交于
      When requesting a new timestamp, the ice_ptp_request_ts function does not
      hold the Tx tracker lock while checking init and calibrating. This means
      that we might issue a new timestamp request just after the Tx timestamp
      tracker starts being deinitialized. This could lead to incorrect access of
      the timestamp structures. Correct this by moving the init and calibrating
      checks under the lock, and updating the flows which modify these fields to
      use the lock.
      
      Note that we do not need to hold the lock while checking for tx->init in
      ice_ptp_tx_tstamp. This is because the teardown function will use
      synchronize_irq after clearing the flag to ensure that the threaded
      interrupt completes. Either a) the tx->init flag will be cleared before the
      ice_ptp_tx_tstamp function starts, thus it will exit immediately, or b) the
      threaded interrupt will be executing and the synchronize_irq will wait
      until the threaded interrupt has completed at which point we know the init
      field has definitely been set and new interrupts will not execute the Tx
      timestamp thread function.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      3ad5c10b
    • J
      ice: synchronize the misc IRQ when tearing down Tx tracker · f0ae1240
      Jacob Keller 提交于
      Since commit 1229b339 ("ice: Add low latency Tx timestamp read") the
      ice driver has used a threaded IRQ for handling Tx timestamps. This change
      did not add a call to synchronize_irq during ice_ptp_release_tx_tracker.
      Thus it is possible that an interrupt could occur just as the tracker is
      being removed. This could lead to a use-after-free of the Tx tracker
      structure data.
      
      Fix this by calling sychronize_irq in ice_ptp_release_tx_tracker after
      we've cleared the init flag. In addition, make sure that we re-check the
      init flag at the end of ice_ptp_tx_tstamp before we exit ensuring that we
      will stop polling for new timestamps once the tracker de-initialization has
      begun.
      
      Refactor the ts_handled variable into "more_timestamps" so that we can
      simply directly assign this boolean instead of relying on an initialized
      value of true. This makes the new combined check easier to read.
      
      With this change, the ice_ptp_release_tx_tracker function will now wait for
      the threaded interrupt to complete if it was executing while the init flag
      was cleared.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f0ae1240
    • J
      ice: check Tx timestamp memory register for ready timestamps · 10e4b4a3
      Jacob Keller 提交于
      The PHY for E822 based hardware has a register which indicates which
      timestamps are valid in the PHY timestamp memory block. Each bit in the
      register indicates whether the associated index in the timestamp memory is
      valid.
      
      Hardware sets this bit when the timestamp is captured, and clears the bit
      when the timestamp is read. Use of this register is important as reading
      timestamp registers can impact the way that hardware generates timestamp
      interrupts.
      
      This occurs because the PHY has an internal value which is incremented
      when hardware captures a timestamp and decremented when software reads a
      timestamp. Reading timestamps which are not marked as valid still decrement
      the internal value and can result in the Tx timestamp interrupt not
      triggering in the future.
      
      To prevent this, use the timestamp memory value to determine which
      timestamps are ready to be read. The ice_get_phy_tx_tstamp_ready function
      reads this value. For E810 devices, this just always returns with all bits
      set.
      
      Skip any timestamp which is not set in this bitmap, avoiding reading extra
      timestamps on E822 devices.
      
      The stale check against a cached timestamp value is no longer necessary for
      PHYs which support the timestamp ready bitmap properly. E810 devices still
      need this. Introduce a new verify_cached flag to the ice_ptp_tx structure.
      Use this to determine if we need to perform the verification against the
      cached timestamp value. Set this to 1 for the E810 Tx tracker init
      function. Notice that many of the fields in ice_ptp_tx are simple 1 bit
      flags. Save some structure space by using bitfields of length 1 for these
      values.
      
      Modify the ICE_PTP_TS_VALID check to simply drop the timestamp immediately
      so that in an event of getting such an invalid timestamp the driver does
      not attempt to re-read the timestamp again in a future poll of the
      register.
      
      With these changes, the driver now reads each timestamp register exactly
      once, and does not attempt any re-reads. This ensures the interrupt
      tracking logic in the PHY will not get stuck.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      10e4b4a3
    • J
      ice: handle discarding old Tx requests in ice_ptp_tx_tstamp · 0dd92862
      Jacob Keller 提交于
      Currently the driver uses the PTP kthread to process handling and
      discarding of stale Tx timestamp requests. The function
      ice_ptp_tx_tstamp_cleanup is used for this.
      
      A separate thread creates complications for the driver as we now have both
      the main Tx timestamp processing IRQ checking timestamps as well as the
      kthread.
      
      Rather than using the kthread to handle this, simply check for stale
      timestamps within the ice_ptp_tx_tstamp function. This function must
      already process the timestamps anyways.
      
      If a Tx timestamp has been waiting for 2 seconds we simply clear the bit
      and discard the SKB. This avoids the complication of having separate
      threads polling, reducing overall CPU work.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      0dd92862
    • J
      ice: always call ice_ptp_link_change and make it void · 6b1ff5d3
      Jacob Keller 提交于
      The ice_ptp_link_change function is currently only called for E822 based
      hardware. Future changes are going to extend this function to perform
      additional tasks on link change.
      
      Always call this function, moving the E810 check from the callers down to
      just before we call the E822-specific function required to restart the PHY.
      
      This function also returns an error value, but none of the callers actually
      check it. In general, the errors it produces are more likely systemic
      problems such as invalid or corrupt port numbers. No caller checks these,
      and so no warning is logged.
      
      Re-order the flag checks so that ICE_FLAG_PTP is checked first. Drop the
      unnecessary check for ICE_FLAG_PTP_SUPPORTED, as ICE_FLAG_PTP will not be
      set except when ICE_FLAG_PTP_SUPPORTED is set.
      
      Convert the port checks to WARN_ON_ONCE, in order to generate a kernel
      stack trace when they are hit.
      
      Convert the function to void since no caller actually checks these return
      values.
      Co-developed-by: NDave Ertman <david.m.ertman@intel.com>
      Signed-off-by: NDave Ertman <david.m.ertman@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      6b1ff5d3
    • J
      ice: fix misuse of "link err" with "link status" · 11722c39
      Jacob Keller 提交于
      The ice_ptp_link_change function has a comment which mentions "link
      err" when referring to the current link status. We are storing the status
      of whether link is up or down, which is not an error.
      
      It is appears that this use of err accidentally got included due to an
      overzealous search and replace when removing the ice_status enum and local
      status variable.
      
      Fix the wording to use the correct term.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      11722c39
    • K
      ice: Reset TS memory for all quads · 407b66c0
      Karol Kolacinski 提交于
      In E822 products, the owner PF should reset memory for all quads, not
      only for the one where assigned lport is.
      Signed-off-by: NKarol Kolacinski <karol.kolacinski@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      407b66c0
    • M
      ice: Remove the E822 vernier "bypass" logic · 0357d5ca
      Milena Olech 提交于
      The E822 devices support an extended "vernier" calibration which enables
      higher precision timestamps by accounting for delays in the PHY, and
      compensating for them. These delays are measured by hardware as part of its
      vernier calibration logic.
      
      The driver currently starts the PHY in "bypass" mode which skips
      the compensation. Then it later attempts to switch from bypass to vernier.
      This unfortunately does not work as expected. Instead of properly
      compensating for the delays, the hardware continues operating in bypass
      without the improved precision expected.
      
      Because we cannot dynamically switch between bypass and vernier mode,
      refactor the driver to always operate in vernier mode. This has a slight
      downside: Tx timestamp and Rx timestamp requests that occur as the very
      first packet set after link up will not complete properly and may be
      reported to applications as missing timestamps.
      
      This occurs frequently in test environments where traffic is light or
      targeted specifically at testing PTP. However, in practice most
      environments will have transmitted or received some data over the network
      before such initial requests are made.
      Signed-off-by: NMilena Olech <milena.olech@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      0357d5ca
    • S
      ice: Use more generic names for ice_ptp_tx fields · 6b5cbc8c
      Sergey Temerkhanov 提交于
      Some supported devices have per-port timestamp memory blocks while
      others have shared ones within quads. Rename the struct ice_ptp_tx
      fields to reflect the block entities it works with
      Signed-off-by: NSergey Temerkhanov <sergey.temerkhanov@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      6b5cbc8c
  2. 06 12月, 2022 3 次提交
  3. 01 12月, 2022 5 次提交
    • V
      net: devlink: let the core report the driver name instead of the drivers · 226bf980
      Vincent Mailhol 提交于
      The driver name is available in device_driver::name. Right now,
      drivers still have to report this piece of information themselves in
      their devlink_ops::info_get callback function.
      
      In order to factorize code, make devlink_nl_info_fill() add the driver
      name attribute.
      
      Now that the core sets the driver name attribute, drivers are not
      supposed to call devlink_info_driver_name_put() anymore. Remove
      devlink_info_driver_name_put() and clean-up all the drivers using this
      function in their callback.
      Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Tested-by: Ido Schimmel <idosch@nvidia.com> # mlxsw
      Reviewed-by: NJacob Keller  <jacob.e.keller@intel.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      226bf980
    • J
      ice: implement direct read for NVM and Shadow RAM regions · 3af4b40b
      Jacob Keller 提交于
      Implement the .read handler for the NVM and Shadow RAM regions. This
      enables user space to read a small chunk of the flash without needing the
      overhead of creating a full snapshot.
      
      Update the documentation for ice to detail which regions have direct read
      support.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3af4b40b
    • J
      ice: use same function to snapshot both NVM and Shadow RAM · ed23debe
      Jacob Keller 提交于
      The ice driver supports a region for both the flat NVM contents as well as
      the Shadow RAM which is a layer built on top of the flash during device
      initialization.
      
      These regions use an almost identical read function, except that the NVM
      needs to set the direct flag when reading, while Shadow RAM needs to read
      without the direct flag set. They each call ice_read_flat_nvm with the only
      difference being whether to set the direct flash flag.
      
      The NVM region read function also was fixed to read the NVM in blocks to
      avoid a situation where the firmware reclaims the lock due to taking too
      long.
      
      Note that the region snapshot function takes the ops pointer so the
      function can easily determine which region to read. Make use of this and
      re-use the NVM snapshot function for both the NVM and Shadow RAM regions.
      This makes Shadow RAM benefit from the same block approach as the NVM
      region. It also reduces code in the ice driver.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ed23debe
    • A
      igb: Allocate MSI-X vector when testing · 28e96556
      Akihiko Odaki 提交于
      Without this change, the interrupt test fail with MSI-X environment:
      
      $ sudo ethtool -t enp0s2 offline
      [   43.921783] igb 0000:00:02.0: offline testing starting
      [   44.855824] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Down
      [   44.961249] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
      [   51.272202] igb 0000:00:02.0: testing shared interrupt
      [   56.996975] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
      The test result is FAIL
      The test extra info:
      Register test  (offline)	 0
      Eeprom test    (offline)	 0
      Interrupt test (offline)	 4
      Loopback test  (offline)	 0
      Link test   (on/offline)	 0
      
      Here, "4" means an expected interrupt was not delivered.
      
      To fix this, route IRQs correctly to the first MSI-X vector by setting
      IVAR_MISC. Also, set bit 0 of EIMS so that the vector will not be
      masked. The interrupt test now runs properly with this change:
      
      $ sudo ethtool -t enp0s2 offline
      [   42.762985] igb 0000:00:02.0: offline testing starting
      [   50.141967] igb 0000:00:02.0: testing shared interrupt
      [   56.163957] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
      The test result is PASS
      The test extra info:
      Register test  (offline)	 0
      Eeprom test    (offline)	 0
      Interrupt test (offline)	 0
      Loopback test  (offline)	 0
      Link test   (on/offline)	 0
      
      Fixes: 4eefa8f0 ("igb: add single vector msi-x testing to interrupt test")
      Signed-off-by: NAkihiko Odaki <akihiko.odaki@daynix.com>
      Reviewed-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      28e96556
    • A
      e1000e: Fix TX dispatch condition · eed913f6
      Akihiko Odaki 提交于
      e1000_xmit_frame is expected to stop the queue and dispatch frames to
      hardware if there is not sufficient space for the next frame in the
      buffer, but sometimes it failed to do so because the estimated maximum
      size of frame was wrong. As the consequence, the later invocation of
      e1000_xmit_frame failed with NETDEV_TX_BUSY, and the frame in the buffer
      remained forever, resulting in a watchdog failure.
      
      This change fixes the estimated size by making it match with the
      condition for NETDEV_TX_BUSY. Apparently, the old estimation failed to
      account for the following lines which determines the space requirement
      for not causing NETDEV_TX_BUSY:
          ```
          	/* reserve a descriptor for the offload context */
          	if ((mss) || (skb->ip_summed == CHECKSUM_PARTIAL))
          		count++;
          	count++;
      
          	count += DIV_ROUND_UP(len, adapter->tx_fifo_limit);
          ```
      
      This issue was found when running http-stress02 test included in Linux
      Test Project 20220930 on QEMU with the following commandline:
      ```
      qemu-system-x86_64 -M q35,accel=kvm -m 8G -smp 8
      	-drive if=virtio,format=raw,file=root.img,file.locking=on
      	-device e1000e,netdev=netdev
      	-netdev tap,script=ifup,downscript=no,id=netdev
      ```
      
      Fixes: bc7f75fa ("[E1000E]: New pci-express e1000 driver (currently for ICH9 devices only)")
      Signed-off-by: NAkihiko Odaki <akihiko.odaki@daynix.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Tested-by: NNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      eed913f6
  4. 24 11月, 2022 11 次提交
  5. 22 11月, 2022 1 次提交
    • J
      ice: fix handling of burst Tx timestamps · 30f15874
      Jacob Keller 提交于
      Commit 1229b339 ("ice: Add low latency Tx timestamp read") refactored
      PTP timestamping logic to use a threaded IRQ instead of a separate kthread.
      
      This implementation introduced ice_misc_intr_thread_fn and redefined the
      ice_ptp_process_ts function interface to return a value of whether or not
      the timestamp processing was complete.
      
      ice_misc_intr_thread_fn would take the return value from ice_ptp_process_ts
      and convert it into either IRQ_HANDLED if there were no more timestamps to
      be processed, or IRQ_WAKE_THREAD if the thread should continue processing.
      
      This is not correct, as the kernel does not re-schedule threaded IRQ
      functions automatically. IRQ_WAKE_THREAD can only be used by the main IRQ
      function.
      
      This results in the ice_ptp_process_ts function (and in turn the
      ice_ptp_tx_tstamp function) from only being called exactly once per
      interrupt.
      
      If an application sends a burst of Tx timestamps without waiting for a
      response, the interrupt will trigger for the first timestamp. However,
      later timestamps may not have arrived yet. This can result in dropped or
      discarded timestamps. Worse, on E822 hardware this results in the interrupt
      logic getting stuck such that no future interrupts will be triggered. The
      result is complete loss of Tx timestamp functionality.
      
      Fix this by modifying the ice_misc_intr_thread_fn to perform its own
      polling of the ice_ptp_process_ts function. We sleep for a few microseconds
      between attempts to avoid wasting significant CPU time. The value was
      chosen to allow time for the Tx timestamps to complete without wasting so
      much time that we overrun application wait budgets in the worst case.
      
      The ice_ptp_process_ts function also currently returns false in the event
      that the Tx tracker is not initialized. This would result in the threaded
      IRQ handler never exiting if it gets started while the tracker is not
      initialized.
      
      Fix the function to appropriately return true when the tracker is not
      initialized.
      
      Note that this will not reproduce with default ptp4l behavior, as the
      program always synchronously waits for a timestamp response before sending
      another timestamp request.
      Reported-by: NSiddaraju DH <siddaraju.dh@intel.com>
      Fixes: 1229b339 ("ice: Add low latency Tx timestamp read")
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20221118222729.1565317-1-anthony.l.nguyen@intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      30f15874
  6. 19 11月, 2022 4 次提交
    • S
      iavf: Fix race condition between iavf_shutdown and iavf_remove · a8417330
      Slawomir Laba 提交于
      Fix a deadlock introduced by commit
      97457801 ("iavf: Add waiting so the port is initialized in remove")
      due to race condition between iavf_shutdown and iavf_remove, where
      iavf_remove stucks forever in while loop since iavf_shutdown already
      set __IAVF_REMOVE adapter state.
      
      Fix this by checking if the __IAVF_IN_REMOVE_TASK has already been
      set and return if so.
      
      Fixes: 97457801 ("iavf: Add waiting so the port is initialized in remove")
      Signed-off-by: NSlawomir Laba <slawomirx.laba@intel.com>
      Signed-off-by: NMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: NMarek Szlosek <marek.szlosek@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      a8417330
    • S
      iavf: remove INITIAL_MAC_SET to allow gARP to work properly · bb861c14
      Stefan Assmann 提交于
      IAVF_FLAG_INITIAL_MAC_SET prevents waiting on iavf_is_mac_set_handled()
      the first time the MAC is set. This breaks gratuitous ARP because the
      MAC address has not been updated yet when the gARP packet is sent out.
      
      Current behaviour:
      $ echo 1 > /sys/class/net/ens4f0/device/sriov_numvfs
      iavf 0000:88:02.0: MAC address: ee:04:19:14:ec:ea
      $ ip addr add 192.168.1.1/24 dev ens4f0v0
      $ ip link set dev ens4f0v0 up
      $ echo 1 > /proc/sys/net/ipv4/conf/ens4f0v0/arp_notify
      $ ip link set ens4f0v0 addr 00:11:22:33:44:55
      07:23:41.676611 ee:04:19:14:ec:ea > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.1 tell 192.168.1.1, length 28
      
      With IAVF_FLAG_INITIAL_MAC_SET removed:
      $ echo 1 > /sys/class/net/ens4f0/device/sriov_numvfs
      iavf 0000:88:02.0: MAC address: 3e:8a:16:a2:37:6d
      $ ip addr add 192.168.1.1/24 dev ens4f0v0
      $ ip link set dev ens4f0v0 up
      $ echo 1 > /proc/sys/net/ipv4/conf/ens4f0v0/arp_notify
      $ ip link set ens4f0v0 addr 00:11:22:33:44:55
      07:28:01.836608 00:11:22:33:44:55 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.1 tell 192.168.1.1, length 28
      
      Fixes: 35a2443d ("iavf: Add waiting for response from PF in set mac")
      Signed-off-by: NStefan Assmann <sassmann@kpanic.de>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      bb861c14
    • I
      iavf: Do not restart Tx queues after reset task failure · 08f1c147
      Ivan Vecera 提交于
      After commit aa626da9 ("iavf: Detach device during reset task")
      the device is detached during reset task and re-attached at its end.
      The problem occurs when reset task fails because Tx queues are
      restarted during device re-attach and this leads later to a crash.
      
      To resolve this issue properly close the net device in cause of
      failure in reset task to avoid restarting of tx queues at the end.
      Also replace the hacky manipulation with IFF_UP flag by device close
      that clears properly both IFF_UP and __LINK_STATE_START flags.
      In these case iavf_close() does not do anything because the adapter
      state is already __IAVF_DOWN.
      
      Reproducer:
      1) Run some Tx traffic (e.g. iperf3) over iavf interface
      2) Set VF trusted / untrusted in loop
      
      [root@host ~]# cat repro.sh
      
      PF=enp65s0f0
      IF=${PF}v0
      
      ip link set up $IF
      ip addr add 192.168.0.2/24 dev $IF
      sleep 1
      
      iperf3 -c 192.168.0.1 -t 600 --logfile /dev/null &
      sleep 2
      
      while :; do
              ip link set $PF vf 0 trust on
              ip link set $PF vf 0 trust off
      done
      [root@host ~]# ./repro.sh
      
      Result:
      [ 2006.650969] iavf 0000:41:01.0: Failed to init adminq: -53
      [ 2006.675662] ice 0000:41:00.0: VF 0 is now trusted
      [ 2006.689997] iavf 0000:41:01.0: Reset task did not complete, VF disabled
      [ 2006.696611] iavf 0000:41:01.0: failed to allocate resources during reinit
      [ 2006.703209] ice 0000:41:00.0: VF 0 is now untrusted
      [ 2006.737011] ice 0000:41:00.0: VF 0 is now trusted
      [ 2006.764536] ice 0000:41:00.0: VF 0 is now untrusted
      [ 2006.768919] BUG: kernel NULL pointer dereference, address: 0000000000000b4a
      [ 2006.776358] #PF: supervisor read access in kernel mode
      [ 2006.781488] #PF: error_code(0x0000) - not-present page
      [ 2006.786620] PGD 0 P4D 0
      [ 2006.789152] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [ 2006.792903] ice 0000:41:00.0: VF 0 is now trusted
      [ 2006.793501] CPU: 4 PID: 0 Comm: swapper/4 Kdump: loaded Not tainted 6.1.0-rc3+ #2
      [ 2006.805668] Hardware name: Abacus electric, s.r.o. - servis@abacus.cz Super Server/H12SSW-iN, BIOS 2.4 04/13/2022
      [ 2006.815915] RIP: 0010:iavf_xmit_frame_ring+0x96/0xf70 [iavf]
      [ 2006.821028] ice 0000:41:00.0: VF 0 is now untrusted
      [ 2006.821572] Code: 48 83 c1 04 48 c1 e1 04 48 01 f9 48 83 c0 10 6b 50 f8 55 c1 ea 14 45 8d 64 14 01 48 39 c8 75 eb 41 83 fc 07 0f 8f e9 08 00 00 <0f> b7 45 4a 0f b7 55 48 41 8d 74 24 05 31 c9 66 39 d0 0f 86 da 00
      [ 2006.845181] RSP: 0018:ffffb253004bc9e8 EFLAGS: 00010293
      [ 2006.850397] RAX: ffff9d154de45b00 RBX: ffff9d15497d52e8 RCX: ffff9d154de45b00
      [ 2006.856327] ice 0000:41:00.0: VF 0 is now trusted
      [ 2006.857523] RDX: 0000000000000000 RSI: 00000000000005a8 RDI: ffff9d154de45ac0
      [ 2006.857525] RBP: 0000000000000b00 R08: ffff9d159cb010ac R09: 0000000000000001
      [ 2006.857526] R10: ffff9d154de45940 R11: 0000000000000000 R12: 0000000000000002
      [ 2006.883600] R13: ffff9d1770838dc0 R14: 0000000000000000 R15: ffffffffc07b8380
      [ 2006.885840] ice 0000:41:00.0: VF 0 is now untrusted
      [ 2006.890725] FS:  0000000000000000(0000) GS:ffff9d248e900000(0000) knlGS:0000000000000000
      [ 2006.890727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 2006.909419] CR2: 0000000000000b4a CR3: 0000000c39c10002 CR4: 0000000000770ee0
      [ 2006.916543] PKRU: 55555554
      [ 2006.918254] ice 0000:41:00.0: VF 0 is now trusted
      [ 2006.919248] Call Trace:
      [ 2006.919250]  <IRQ>
      [ 2006.919252]  dev_hard_start_xmit+0x9e/0x1f0
      [ 2006.932587]  sch_direct_xmit+0xa0/0x370
      [ 2006.936424]  __dev_queue_xmit+0x7af/0xd00
      [ 2006.940429]  ip_finish_output2+0x26c/0x540
      [ 2006.944519]  ip_output+0x71/0x110
      [ 2006.947831]  ? __ip_finish_output+0x2b0/0x2b0
      [ 2006.952180]  __ip_queue_xmit+0x16d/0x400
      [ 2006.952721] ice 0000:41:00.0: VF 0 is now untrusted
      [ 2006.956098]  __tcp_transmit_skb+0xa96/0xbf0
      [ 2006.965148]  __tcp_retransmit_skb+0x174/0x860
      [ 2006.969499]  ? cubictcp_cwnd_event+0x40/0x40
      [ 2006.973769]  tcp_retransmit_skb+0x14/0xb0
      ...
      
      Fixes: aa626da9 ("iavf: Detach device during reset task")
      Cc: Jacob Keller <jacob.e.keller@intel.com>
      Cc: Patryk Piotrowski <patryk.piotrowski@intel.com>
      Cc: SlawomirX Laba <slawomirx.laba@intel.com>
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      08f1c147
    • I
      iavf: Fix a crash during reset task · c678669d
      Ivan Vecera 提交于
      Recent commit aa626da9 ("iavf: Detach device during reset task")
      removed netif_tx_stop_all_queues() with an assumption that Tx queues
      are already stopped by netif_device_detach() in the beginning of
      reset task. This assumption is incorrect because during reset
      task a potential link event can start Tx queues again.
      Revert this change to fix this issue.
      
      Reproducer:
      1. Run some Tx traffic (e.g. iperf3) over iavf interface
      2. Switch MTU of this interface in a loop
      
      [root@host ~]# cat repro.sh
      
      IF=enp2s0f0v0
      
      iperf3 -c 192.168.0.1 -t 600 --logfile /dev/null &
      sleep 2
      
      while :; do
              for i in 1280 1500 2000 900 ; do
                      ip link set $IF mtu $i
                      sleep 2
              done
      done
      [root@host ~]# ./repro.sh
      
      Result:
      [  306.199917] iavf 0000:02:02.0 enp2s0f0v0: NIC Link is Up Speed is 40 Gbps Full Duplex
      [  308.205944] iavf 0000:02:02.0 enp2s0f0v0: NIC Link is Up Speed is 40 Gbps Full Duplex
      [  310.103223] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [  310.110179] #PF: supervisor write access in kernel mode
      [  310.115396] #PF: error_code(0x0002) - not-present page
      [  310.120526] PGD 0 P4D 0
      [  310.123057] Oops: 0002 [#1] PREEMPT SMP NOPTI
      [  310.127408] CPU: 24 PID: 183 Comm: kworker/u64:9 Kdump: loaded Not tainted 6.1.0-rc3+ #2
      [  310.135485] Hardware name: Abacus electric, s.r.o. - servis@abacus.cz Super Server/H12SSW-iN, BIOS 2.4 04/13/2022
      [  310.145728] Workqueue: iavf iavf_reset_task [iavf]
      [  310.150520] RIP: 0010:iavf_xmit_frame_ring+0xd1/0xf70 [iavf]
      [  310.156180] Code: d0 0f 86 da 00 00 00 83 e8 01 0f b7 fa 29 f8 01 c8 39 c6 0f 8f a0 08 00 00 48 8b 45 20 48 8d 14 92 bf 01 00 00 00 4c 8d 3c d0 <49> 89 5f 08 8b 43 70 66 41 89 7f 14 41 89 47 10 f6 83 82 00 00 00
      [  310.174918] RSP: 0018:ffffbb5f0082caa0 EFLAGS: 00010293
      [  310.180137] RAX: 0000000000000000 RBX: ffff92345471a6e8 RCX: 0000000000000200
      [  310.187259] RDX: 0000000000000000 RSI: 000000000000000d RDI: 0000000000000001
      [  310.194385] RBP: ffff92341d249000 R08: ffff92434987fcac R09: 0000000000000001
      [  310.201509] R10: 0000000011f683b9 R11: 0000000011f50641 R12: 0000000000000008
      [  310.208631] R13: ffff923447500000 R14: 0000000000000000 R15: 0000000000000000
      [  310.215756] FS:  0000000000000000(0000) GS:ffff92434ee00000(0000) knlGS:0000000000000000
      [  310.223835] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  310.229572] CR2: 0000000000000008 CR3: 0000000fbc210004 CR4: 0000000000770ee0
      [  310.236696] PKRU: 55555554
      [  310.239399] Call Trace:
      [  310.241844]  <IRQ>
      [  310.243855]  ? dst_alloc+0x5b/0xb0
      [  310.247260]  dev_hard_start_xmit+0x9e/0x1f0
      [  310.251439]  sch_direct_xmit+0xa0/0x370
      [  310.255276]  __qdisc_run+0x13e/0x580
      [  310.258848]  __dev_queue_xmit+0x431/0xd00
      [  310.262851]  ? selinux_ip_postroute+0x147/0x3f0
      [  310.267377]  ip_finish_output2+0x26c/0x540
      
      Fixes: aa626da9 ("iavf: Detach device during reset task")
      Cc: Jacob Keller <jacob.e.keller@intel.com>
      Cc: Patryk Piotrowski <patryk.piotrowski@intel.com>
      Cc: SlawomirX Laba <slawomirx.laba@intel.com>
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      c678669d
  7. 18 11月, 2022 2 次提交
    • M
      ice: Prevent ADQ, DCB coexistence with Custom Tx scheduler · 80fe30a8
      Michal Wilczynski 提交于
      ADQ, DCB might interfere with Custom Tx Scheduler changes that user
      might introduce using devlink-rate API.
      
      Check if ADQ, DCB is active, when user tries to change any setting
      in exported Tx scheduler tree. If any of those are active block the user
      from doing so, and log an appropriate message.
      
      Remove the exported hierarchy if user enable ADQ or DCB.
      Prevent ADQ or DCB from getting configured if user already made some
      changes using devlink-rate API.
      Signed-off-by: NMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      80fe30a8
    • M
      ice: Implement devlink-rate API · 42c2eb6b
      Michal Wilczynski 提交于
      There is a need to support modification of Tx scheduler tree, in the
      ice driver. This will allow user to control Tx settings of each node in
      the internal hierarchy of nodes. As a result user will be able to use
      Hierarchy QoS implemented entirely in the hardware.
      
      This patch implemenents devlink-rate API. It also exports initial
      default hierarchy. It's mostly dictated by the fact that the tree
      can't be removed entirely, all we can do is enable the user to modify
      it. For example root node shouldn't ever be removed, also nodes that
      have children are off-limits.
      
      Example initial tree with 2 VF's:
      
      [root@fedora ~]# devlink port function rate show
      
      pci/0000:4b:00.0/node_27: type node parent node_26
      pci/0000:4b:00.0/node_26: type node parent node_0
      pci/0000:4b:00.0/node_34: type node parent node_33
      pci/0000:4b:00.0/node_33: type node parent node_32
      pci/0000:4b:00.0/node_32: type node parent node_16
      pci/0000:4b:00.0/node_19: type node parent node_18
      pci/0000:4b:00.0/node_18: type node parent node_17
      pci/0000:4b:00.0/node_17: type node parent node_16
      pci/0000:4b:00.0/node_21: type node parent node_20
      pci/0000:4b:00.0/node_20: type node parent node_3
      pci/0000:4b:00.0/node_14: type node parent node_5
      pci/0000:4b:00.0/node_5: type node parent node_3
      pci/0000:4b:00.0/node_13: type node parent node_4
      pci/0000:4b:00.0/node_12: type node parent node_4
      pci/0000:4b:00.0/node_11: type node parent node_4
      pci/0000:4b:00.0/node_10: type node parent node_4
      pci/0000:4b:00.0/node_9: type node parent node_4
      pci/0000:4b:00.0/node_8: type node parent node_4
      pci/0000:4b:00.0/node_7: type node parent node_4
      pci/0000:4b:00.0/node_6: type node parent node_4
      pci/0000:4b:00.0/node_4: type node parent node_3
      pci/0000:4b:00.0/node_3: type node parent node_16
      pci/0000:4b:00.0/node_16: type node parent node_15
      pci/0000:4b:00.0/node_15: type node parent node_0
      pci/0000:4b:00.0/node_2: type node parent node_1
      pci/0000:4b:00.0/node_1: type node parent node_0
      pci/0000:4b:00.0/node_0: type node
      pci/0000:4b:00.0/1: type leaf parent node_27
      pci/0000:4b:00.0/2: type leaf parent node_27
      
      Let me visualize part of the tree:
      
                          +---------+
                          |  node_0 |
                          +---------+
                               |
                          +----v----+
                          | node_26 |
                          +----+----+
                               |
                          +----v----+
                          | node_27 |
                          +----+----+
                               |
                      |-----------------|
                 +----v----+       +----v----+
                 |   VF 1  |       |   VF 2  |
                 +----+----+       +----+----+
      
      So at this point there is a couple things that can be done.
      For example we could only assign parameters to VF's.
      
      [root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                       tx_max 5Gbps
      
      This would cap the VF 1 BW to 5Gbps.
      
      But let's say you would like to create a completely new branch.
      This can be done like this:
      
      [root@fedora ~]# devlink port function rate add \
                       pci/0000:4b:00.0/node_custom parent node_0
      [root@fedora ~]# devlink port function rate add \
                       pci/0000:4b:00.0/node_custom_1 parent node_custom
      [root@fedora ~]# devlink port function rate set \
                       pci/0000:4b:00.0/1 parent node_custom_1
      
      This creates a completely new branch and reassigns VF 1 to it.
      
      A number of parameters is supported per each node: tx_max, tx_share,
      tx_priority and tx_weight.
      Signed-off-by: NMichal Wilczynski <michal.wilczynski@intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      42c2eb6b