1. 07 1月, 2022 1 次提交
    • W
      ice: improve switchdev's slow-path · c1e5da5d
      Wojciech Drewek 提交于
      In current switchdev implementation, every VF PR is assigned to
      individual ring on switchdev ctrl VSI. For slow-path traffic, there
      is a mapping VF->ring done in software based on src_vsi value (by
      calling ice_eswitch_get_target_netdev function).
      
      With this change, HW solution is introduced which is more
      efficient. For each VF, src MAC (VF's MAC) filter will be created,
      which forwards packets to the corresponding switchdev ctrl VSI queue
      based on src MAC address.
      
      This filter has to be removed and then replayed in case of
      resetting one VF. Keep information about this rule in repr->mac_rule,
      thanks to that we know which rule has to be removed and replayed
      for a given VF.
      
      In case of CORE/GLOBAL all rules are removed
      automatically. We have to take care of readding them. This is done
      by ice_replay_vsi_adv_rule.
      
      When driver leaves switchdev mode, remove all advanced rules
      from switchdev ctrl VSI. This is done by ice_rem_adv_rule_for_vsi.
      
      Flag repr->rule_added is needed because in some cases reset
      might be triggered before VF sends request to add MAC.
      Co-developed-by: NGrzegorz Nitka <grzegorz.nitka@intel.com>
      Signed-off-by: NGrzegorz Nitka <grzegorz.nitka@intel.com>
      Signed-off-by: NWojciech Drewek <wojciech.drewek@intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      c1e5da5d
  2. 30 12月, 2021 1 次提交
  3. 23 12月, 2021 1 次提交
  4. 22 12月, 2021 2 次提交
  5. 16 12月, 2021 2 次提交
    • J
      ice: tighter control over VSI_DOWN state · 21c6e36b
      Jesse Brandeburg 提交于
      The driver had comments to the effect of: This flag should be set before
      calling this function. While reviewing code it was found that there were
      several violations of this policy, which could introduce hard to find
      bugs or races.
      
      Fix the violations of the "VSI DOWN state must be set before calling
      ice_down" and make checking the state into code with a WARN_ON.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      21c6e36b
    • J
      ice: support immediate firmware activation via devlink reload · 399e27db
      Jacob Keller 提交于
      The ice hardware contains an embedded chip with firmware which can be
      updated using devlink flash. The firmware which runs on this chip is
      referred to as the Embedded Management Processor firmware (EMP
      firmware).
      
      Activating the new firmware image currently requires that the system be
      rebooted. This is not ideal as rebooting the system can cause unwanted
      downtime.
      
      In practical terms, activating the firmware does not always require a
      full system reboot. In many cases it is possible to activate the EMP
      firmware immediately. There are a couple of different scenarios to
      cover.
      
       * The EMP firmware itself can be reloaded by issuing a special update
         to the device called an Embedded Management Processor reset (EMP
         reset). This reset causes the device to reset and reload the EMP
         firmware.
      
       * PCI configuration changes are only reloaded after a cold PCIe reset.
         Unfortunately there is no generic way to trigger this for a PCIe
         device without a system reboot.
      
      When performing a flash update, firmware is capable of responding with
      some information about the specific update requirements.
      
      The driver updates the flash by programming a secondary inactive bank
      with the contents of the new image, and then issuing a command to
      request to switch the active bank starting from the next load.
      
      The response to the final command for updating the inactive NVM flash
      bank includes an indication of the minimum reset required to fully
      update the device. This can be one of the following:
      
       * A full power on is required
       * A cold PCIe reset is required
       * An EMP reset is required
      
      The response to the command to switch flash banks includes an indication
      of whether or not the firmware will allow an EMP reset request.
      
      For most updates, an EMP reset is sufficient to load the new EMP
      firmware without issues. In some cases, this reset is not sufficient
      because the PCI configuration space has changed. When this could cause
      incompatibility with the new EMP image, the firmware is capable of
      rejecting the EMP reset request.
      
      Add logic to ice_fw_update.c to handle the response data flash update
      AdminQ commands.
      
      For the reset level, issue a devlink status notification informing the
      user of how to complete the update with a simple suggestion like
      "Activate new firmware by rebooting the system".
      
      Cache the status of whether or not firmware will restrict the EMP reset
      for use in implementing devlink reload.
      
      Implement support for devlink reload with the "fw_activate" flag. This
      allows user space to request the firmware be activated immediately.
      
      For the .reload_down handler, we will issue a request for the EMP reset
      using the appropriate firmware AdminQ command. If we know that the
      firmware will not allow an EMP reset, simply exit with a suitable
      netlink extended ACK message indicating that the EMP reset is not
      available.
      
      For the .reload_up handler, simply wait until the driver has finished
      resetting. Logic to handle processing of an EMP reset already exists in
      the driver as part of its reset and rebuild flows.
      
      Implement support for the devlink reload interface with the
      "fw_activate" action. This allows userspace to request activation of
      firmware without a reboot.
      
      Note that support for indicating the required reset and EMP reset
      restriction is not supported on old versions of firmware. The driver can
      determine if the two features are supported by checking the device
      capabilities report. I confirmed support has existed since at least
      version 5.5.2 as reported by the 'fw.mgmt' version. Support to issue the
      EMP reset request has existed in all version of the EMP firmware for the
      ice hardware.
      
      Check the device capabilities report to determine whether or not the
      indications are reported by the running firmware. If the reset
      requirement indication is not supported, always assume a full power on
      is necessary. If the reset restriction capability is not supported,
      always assume the EMP reset is available.
      
      Users can verify if the EMP reset has activated the firmware by using
      the devlink info report to check that the 'running' firmware version has
      updated. For example a user might do the following:
      
       # Check current version
       $ devlink dev info
      
       # Update the device
       $ devlink dev flash pci/0000:af:00.0 file firmware.bin
      
       # Confirm stored version updated
       $ devlink dev info
      
       # Reload to activate new firmware
       $ devlink dev reload pci/0000:af:00.0 action fw_activate
      
       # Confirm running version updated
       $ devlink dev info
      
      Finally, this change does *not* implement basic driver-only reload
      support. I did look into trying to do this. However, it requires
      significant refactor of how the ice driver probes and loads everything.
      The ice driver probe and allocation flows were not designed with such
      a reload in mind. Refactoring the flow to support this is beyond the
      scope of this change.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      399e27db
  6. 15 12月, 2021 8 次提交
  7. 09 12月, 2021 1 次提交
  8. 08 12月, 2021 1 次提交
  9. 23 11月, 2021 2 次提交
    • S
      net/ice: Add support for enable_iwarp and enable_roce devlink param · e523af4e
      Shiraz Saleem 提交于
      Allow support for 'enable_iwarp' and 'enable_roce' devlink params to turn
      on/off iWARP or RoCE protocol support for E800 devices.
      
      For example, a user can turn on iWARP functionality with,
      
      devlink dev param set pci/0000:07:00.0 name enable_iwarp value true cmode runtime
      
      This add an iWARP auxiliary rdma device, ice.iwarp.<>, under this PF.
      
      A user request to enable both iWARP and RoCE under the same PF is rejected
      since this device does not support both protocols simultaneously on the
      same port.
      Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
      Tested-by: NLeszek Kaliszczuk <leszek.kaliszczuk@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e523af4e
    • M
      ice: avoid bpf_prog refcount underflow · f65ee535
      Marta Plantykow 提交于
      Ice driver has the routines for managing XDP resources that are shared
      between ndo_bpf op and VSI rebuild flow. The latter takes place for
      example when user changes queue count on an interface via ethtool's
      set_channels().
      
      There is an issue around the bpf_prog refcounting when VSI is being
      rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as
      an argument that is used later on by ice_vsi_assign_bpf_prog(), same
      bpf_prog pointers are swapped with each other. Then it is also
      interpreted as an 'old_prog' which in turn causes us to call
      bpf_prog_put on it that will decrement its refcount.
      
      Below splat can be interpreted in a way that due to zero refcount of a
      bpf_prog it is wiped out from the system while kernel still tries to
      refer to it:
      
      [  481.069429] BUG: unable to handle page fault for address: ffffc9000640f038
      [  481.077390] #PF: supervisor read access in kernel mode
      [  481.083335] #PF: error_code(0x0000) - not-present page
      [  481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0
      [  481.097141] Oops: 0000 [#1] PREEMPT SMP PTI
      [  481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G           OE     5.15.0-rc5+ #1
      [  481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40
      [  481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84
      [  481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286
      [  481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000
      [  481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000
      [  481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0
      [  481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc
      [  481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [  481.196276] FS:  00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000
      [  481.205633] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0
      [  481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  481.237029] Call Trace:
      [  481.239856]  rtnl_fill_ifinfo+0x768/0x12e0
      [  481.244602]  rtnl_dump_ifinfo+0x525/0x650
      [  481.249246]  ? __alloc_skb+0xa5/0x280
      [  481.253484]  netlink_dump+0x168/0x3c0
      [  481.257725]  netlink_recvmsg+0x21e/0x3e0
      [  481.262263]  ____sys_recvmsg+0x87/0x170
      [  481.266707]  ? __might_fault+0x20/0x30
      [  481.271046]  ? _copy_from_user+0x66/0xa0
      [  481.275591]  ? iovec_from_user+0xf6/0x1c0
      [  481.280226]  ___sys_recvmsg+0x82/0x100
      [  481.284566]  ? sock_sendmsg+0x5e/0x60
      [  481.288791]  ? __sys_sendto+0xee/0x150
      [  481.293129]  __sys_recvmsg+0x56/0xa0
      [  481.297267]  do_syscall_64+0x3b/0xc0
      [  481.301395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  481.307238] RIP: 0033:0x7f5466f39617
      [  481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
      [  481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617
      [  481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003
      [  481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50
      [  481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360
      [  481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98
      [  481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice]
      [  481.528558] CR2: ffffc9000640f038
      [  481.542041] ---[ end trace d1f24c9ecf5b61c1 ]---
      
      Fix this by only calling ice_vsi_assign_bpf_prog() inside
      ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL.
      This way set_channels() flow will not attempt to swap the vsi->xdp_prog
      pointers with itself.
      
      Also, sprinkle around some comments that provide a reasoning about
      correlation between driver and kernel in terms of bpf_prog refcount.
      
      Fixes: efc2214b ("ice: Add support for XDP")
      Reviewed-by: NAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: NMarta Plantykow <marta.a.plantykow@intel.com>
      Co-developed-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f65ee535
  10. 30 10月, 2021 1 次提交
  11. 29 10月, 2021 3 次提交
    • B
      ice: Add support to print error on PHY FW load failure · 99d40752
      Brett Creeley 提交于
      Some devices have support for loading the PHY FW and in some cases this
      can fail. When this fails, the FW will set the corresponding bit in the
      link info structure. Also, the FW will send a link event if the correct
      link event mask bit is set. Add support for printing an error message
      when the PHY FW load fails during any link configuration flow and the
      link event flow.
      
      Since ice_check_module_power() is already doing something very similar
      add a new function ice_check_link_cfg_err() so any failures reported in
      the link info's link_cfg_err member can be printed in this one function.
      
      Also, add the new ICE_FLAG_PHY_FW_LOAD_FAILED bit to the PF's flags so
      we don't constantly print this error message during link polling if the
      value never changed.
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NSunitha Mekala <sunithax.d.mekala@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      99d40752
    • M
      ice: VXLAN and Geneve TC support · 9e300987
      Michal Swiatkowski 提交于
      Add definition for VXLAN and Geneve dummy packet. Define VXLAN and
      Geneve type of fields to match on correct UDP tunnel header.
      
      Parse tunnel specific fields from TC tool like outer MACs, outer IPs,
      outer destination port and VNI. Save values and masks in outer header
      struct and move header pointer to inner to simplify parsing inner
      values.
      
      There are two cases for redirect action:
      - from uplink to VF - TC filter is added on tunnel device
      - from VF to uplink - TC filter is added on PR, for this case check if
        redirect device is tunnel device
      
      VXLAN example:
      - create tunnel device
      ip l add $VXLAN_DEV type vxlan id $VXLAN_VNI dstport $VXLAN_PORT \
      dev $PF
      - add TC filter (in switchdev mode)
      tc filter add dev $VXLAN_DEV protocol ip parent ffff: flower \
      enc_dst_ip $VF1_IP enc_key_id $VXLAN_VNI action mirred egress \
      redirect dev $VF1_PR
      
      Geneve example:
      - create tunnel device
      ip l add $GENEVE_DEV type geneve id $GENEVE_VNI dstport $GENEVE_PORT \
      remote $GENEVE_IP
      - add TC filter (in switchdev mode)
      tc filter add dev $GENEVE_DEV protocol ip parent ffff: flower \
      enc_key_id $GENEVE_VNI dst_ip $GENEVE1_IP action mirred egress \
      redirect dev $VF1_PR
      Signed-off-by: NMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      9e300987
    • M
      ice: support for indirect notification · 195bb48f
      Michal Swiatkowski 提交于
      Implement indirect notification mechanism to support offloading TC rules
      on tunnel devices.
      
      Keep indirect block list in netdev priv. Notification will call setting
      tc cls flower function. For now we can offload only ingress type. Return
      not supported for other flow block binder.
      Signed-off-by: NMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Acked-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      195bb48f
  12. 21 10月, 2021 4 次提交
  13. 20 10月, 2021 2 次提交
    • J
      ice: update dim usage and moderation · d8eb7ad5
      Jesse Brandeburg 提交于
      The driver was having trouble with unreliable latency when doing single
      threaded ping-pong tests. This was root caused to the DIM algorithm
      landing on a too slow interrupt value, which caused high latency, and it
      was especially present when queues were being switched frequently by the
      scheduler as happens on default setups today.
      
      In attempting to improve this, we allow the upper rate limit for
      interrupts to move to rate limit of 4 microseconds as a max, which means
      that no vector can generate more than 250,000 interrupts per second. The
      old config was up to 100,000. The driver previously tried to program the
      rate limit too frequently and if the receive and transmit side were both
      active on the same vector, the INTRL would be set incorrectly, and this
      change fixes that issue as a side effect of the redesign.
      
      This driver will operate from now on with a slightly changed DIM table
      with more emphasis towards latency sensitivity by having more table
      entries with lower latency than with high latency (high being >= 64
      microseconds).
      
      The driver also resets the DIM algorithm state with a new stats set when
      there is no work done and the data becomes stale (older than 1 second),
      for the respective receive or transmit portion of the interrupt.
      
      Add a new helper for setting rate limit, which will be used more
      in a followup patch.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d8eb7ad5
    • B
      ice: Add support for VF rate limiting · 4ecc8633
      Brett Creeley 提交于
      Implement ndo_set_vf_rate to support setting of min_tx_rate and
      max_tx_rate; set the appropriate bandwidth in the scheduler for the
      node representing the specified VF VSI.
      Co-developed-by: NTarun Singh <tarun.k.singh@intel.com>
      Signed-off-by: NTarun Singh <tarun.k.singh@intel.com>
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      4ecc8633
  14. 15 10月, 2021 8 次提交
    • M
      ice: make use of ice_for_each_* macros · 2faf63b6
      Maciej Fijalkowski 提交于
      Go through the code base and use ice_for_each_* macros.  While at it,
      introduce ice_for_each_xdp_txq() macro that can be used for looping over
      xdp_rings array.
      
      Commit is not introducing any new functionality.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      2faf63b6
    • M
      ice: introduce XDP_TX fallback path · 22bf877e
      Maciej Fijalkowski 提交于
      Under rare circumstances there might be a situation where a requirement
      of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
      resources have to be shared between CPUs. This yields a need for placing
      accesses to xdp_ring inside a critical section protected by spinlock.
      These accesses happen to be in the hot path, so let's introduce the
      static branch that will be triggered from the control plane when driver
      could not provide Tx queue dedicated for XDP on each CPU.
      
      Currently, the design that has been picked is to allow any number of XDP
      Tx queues that is at least half of a count of CPUs that platform has.
      For lower number driver will bail out with a response to user that there
      were not enough Tx resources that would allow configuring XDP. The
      sharing of rings is signalled via static branch enablement which in turn
      indicates that lock for xdp_ring accesses needs to be taken in hot path.
      
      Approach based on static branch has no impact on performance of a
      non-fallback path. One thing that is needed to be mentioned is a fact
      that the static branch will act as a global driver switch, meaning that
      if one PF got out of Tx resources, then other PFs that ice driver is
      servicing will suffer. However, given the fact that HW that ice driver
      is handling has 1024 Tx queues per each PF, this is currently an
      unlikely scenario.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      22bf877e
    • M
      ice: optimize XDP_TX workloads · 9610bd98
      Maciej Fijalkowski 提交于
      Optimize Tx descriptor cleaning for XDP. Current approach doesn't
      really scale and chokes when multiple flows are handled.
      
      Introduce two ring fields, @next_dd and @next_rs that will keep track of
      descriptor that should be looked at when the need for cleaning arise and
      the descriptor that should have the RS bit set, respectively.
      
      Note that at this point the threshold is a constant (32), but it is
      something that we could make configurable.
      
      First thing is to get away from setting RS bit on each descriptor. Let's
      do this only once NTU is higher than the currently @next_rs value. In
      such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
      advance the @next_rs by a 32.
      
      Second thing is to clean the Tx ring only when there are less than 32
      free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
      This bit is written back by HW to let the driver know that xmit was
      successful. It will happen only for those descriptors that had RS bit
      set. Clean only 32 descriptors and advance the DD bit.
      
      Actual cleaning routine is moved from ice_napi_poll() down to the
      ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
      SKBs in there that would rely on interrupts for the cleaning. Nice side
      effect is that for rare case of Tx fallback path (that next patch is
      going to introduce) we don't have to trigger the SW irq to clean the
      ring.
      
      With those two concepts, ring is kept at being almost full, but it is
      guaranteed that driver will be able to produce Tx descriptors.
      
      This approach seems to work out well even though the Tx descriptors are
      produced in one-by-one manner. Test was conducted with the ice HW
      bombarded with packets from HW generator, configured to generate 30
      flows.
      
      Xdp2 sample yields the following results:
      <snip>
      proto 17:   79973066 pkt/s
      proto 17:   80018911 pkt/s
      proto 17:   80004654 pkt/s
      proto 17:   79992395 pkt/s
      proto 17:   79975162 pkt/s
      proto 17:   79955054 pkt/s
      proto 17:   79869168 pkt/s
      proto 17:   79823947 pkt/s
      proto 17:   79636971 pkt/s
      </snip>
      
      As that sample reports the Rx'ed frames, let's look at sar output.
      It says that what we Rx'ed we do actually Tx, no noticeable drops.
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00      0.00      0.00     38.32
      
      with tx_busy staying calm.
      
      When compared to a state before:
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00      0.00      0.00     43.64
      
      it can be observed that the amount of txpck/s is almost doubled, meaning
      that the performance is improved by around 90%. All of this due to the
      drops in the driver, previously the tx_busy stat was bumped at a 7mpps
      rate.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      9610bd98
    • M
      ice: propagate xdp_ring onto rx_ring · eb087cd8
      Maciej Fijalkowski 提交于
      With rings being split, it is now convenient to introduce a pointer to
      XDP ring within the Rx ring. For XDP_TX workloads this means that
      xdp_rings array access will be skipped, which was executed per each
      processed frame.
      
      Also, read the XDP prog once per NAPI and if prog is present, set up the
      local xdp_ring pointer. Reading prog a single time was discussed in [1]
      with some concern raised by Toke around dispatcher handling and having
      the need for going through the RCU grace period in the ndo_bpf driver
      callback, but ice currently is torning down NAPI instances regardless of
      the prog presence on VSI.
      
      Although the pointer to XDP ring introduced to Rx ring makes things a
      lot slimmer/simpler, I still feel that single prog read per NAPI
      lifetime is beneficial.
      
      Further patch that will introduce the fallback path will also get a
      profit from that as xdp_ring pointer will be set during the XDP rings
      setup.
      
      [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      eb087cd8
    • M
      ice: unify xdp_rings accesses · 0bb4f9ec
      Maciej Fijalkowski 提交于
      There has been a long lasting issue of improper xdp_rings indexing for
      XDP_TX and XDP_REDIRECT actions. Given that currently rx_ring->q_index
      is mixed with smp_processor_id(), there could be a situation where Tx
      descriptors are produced onto XDP Tx ring, but tail is never bumped -
      for example pin a particular queue id to non-matching IRQ line.
      
      Address this problem by ignoring the user ring count setting and always
      initialize the xdp_rings array to be of num_possible_cpus() size. Then,
      always use the smp_processor_id() as an index to xdp_rings array. This
      provides serialization as at given time only a single softirq can run on
      a particular CPU.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      0bb4f9ec
    • M
      ice: split ice_ring onto Tx/Rx separate structs · e72bba21
      Maciej Fijalkowski 提交于
      While it was convenient to have a generic ring structure that served
      both Tx and Rx sides, next commits are going to introduce several
      Tx-specific fields, so in order to avoid hurting the Rx side, let's
      pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
      
      Rx ring could be handled by the old ice_ring which would reduce the code
      churn within this patch, but this would make things asymmetric.
      
      Make the union out of the ring container within ice_q_vector so that it
      is possible to iterate over newly introduced ice_tx_ring.
      
      Remove the @size as it's only accessed from control path and it can be
      calculated pretty easily.
      
      Change definitions of ice_update_ring_stats and
      ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
      used for both Rx and Tx rings.
      
      Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
      Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
      difference now.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e72bba21
    • M
      ice: remove ring_active from ice_ring · e93d1c37
      Maciej Fijalkowski 提交于
      This field is dead and driver is not making any use of it. Simply remove
      it.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e93d1c37
    • D
      ice: Avoid crash from unnecessary IDA free · 73e30a62
      Dave Ertman 提交于
      In the remove path, there is an attempt to free the aux_idx IDA whether
      it was allocated or not.  This can potentially cause a crash when
      unloading the driver on systems that do not initialize support for RDMA.
      But, this free cannot be gated by the status bit for RDMA, since it is
      allocated if the driver detects support for RDMA at probe time, but the
      driver can enter into a state where RDMA is not supported after the IDA
      has been allocated at probe time and this would lead to a memory leak.
      
      Initialize aux_idx to an invalid value and check for a valid value when
      unloading to determine if an IDA free is necessary.
      
      Fixes: d25a0fc4 ("ice: Initialize RDMA support")
      Reported-by: NJun Miao <jun.miao@windriver.com>
      Signed-off-by: NDave Ertman <david.m.ertman@intel.com>
      Tested-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      73e30a62
  15. 12 10月, 2021 1 次提交
  16. 08 10月, 2021 2 次提交