1. 06 4月, 2022 1 次提交
  2. 01 4月, 2022 1 次提交
    • I
      ice: Fix broken IFF_ALLMULTI handling · 1273f895
      Ivan Vecera 提交于
      Handling of all-multicast flag and associated multicast promiscuous
      mode is broken in ice driver. When an user switches allmulticast
      flag on or off the driver checks whether any VLANs are configured
      over the interface (except default VLAN 0).
      
      If any extra VLANs are registered it enables multicast promiscuous
      mode for all these VLANs (including default VLAN 0) using
      ICE_SW_LKUP_PROMISC_VLAN look-up type. In this situation all
      multicast packets tagged with known VLAN ID or untagged are received
      and multicast packets tagged with unknown VLAN ID ignored.
      
      If no extra VLANs are registered (so only VLAN 0 exists) it enables
      multicast promiscuous mode for VLAN 0 and uses ICE_SW_LKUP_PROMISC
      look-up type. In this situation any multicast packets including
      tagged ones are received.
      
      The driver handles IFF_ALLMULTI in ice_vsi_sync_fltr() this way:
      
      ice_vsi_sync_fltr() {
        ...
        if (changed_flags & IFF_ALLMULTI) {
          if (netdev->flags & IFF_ALLMULTI) {
            if (vsi->num_vlans > 1)
              ice_set_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
            else
              ice_set_promisc(..., ICE_MCAST_PROMISC_BITS);
          } else {
            if (vsi->num_vlans > 1)
              ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS);
            else
              ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS);
          }
        }
        ...
      }
      
      The code above depends on value vsi->num_vlan that specifies number
      of VLANs configured over the interface (including VLAN 0) and
      this is problem because that value is modified in NDO callbacks
      ice_vlan_rx_add_vid() and ice_vlan_rx_kill_vid().
      
      Scenario 1:
      1. ip link set ens7f0 allmulticast on
      2. ip link add vlan10 link ens7f0 type vlan id 10
      3. ip link set ens7f0 allmulticast off
      4. ip link set ens7f0 allmulticast on
      
      [1] In this scenario IFF_ALLMULTI is enabled and the driver calls
          ice_set_promisc(..., ICE_MCAST_PROMISC_BITS) that installs
          multicast promisc rule with non-VLAN look-up type.
      [2] Then VLAN with ID 10 is added and vsi->num_vlan incremented to 2
      [3] Command switches IFF_ALLMULTI off and the driver calls
          ice_clear_promisc(..., ICE_MCAST_VLAN_PROMISC_BITS) but this
          call is effectively NOP because it looks for multicast promisc
          rules for VLAN 0 and VLAN 10 with VLAN look-up type but no such
          rules exist. So the all-multicast remains enabled silently
          in hardware.
      [4] Command tries to switch IFF_ALLMULTI on and the driver calls
          ice_clear_promisc(..., ICE_MCAST_PROMISC_BITS) but this call
          fails (-EEXIST) because non-VLAN multicast promisc rule already
          exists.
      
      Scenario 2:
      1. ip link add vlan10 link ens7f0 type vlan id 10
      2. ip link set ens7f0 allmulticast on
      3. ip link add vlan20 link ens7f0 type vlan id 20
      4. ip link del vlan10 ; ip link del vlan20
      5. ip link set ens7f0 allmulticast off
      
      [1] VLAN with ID 10 is added and vsi->num_vlan==2
      [2] Command switches IFF_ALLMULTI on and driver installs multicast
          promisc rules with VLAN look-up type for VLAN 0 and 10
      [3] VLAN with ID 20 is added and vsi->num_vlan==3 but no multicast
          promisc rules is added for this new VLAN so the interface does
          not receive MC packets from VLAN 20
      [4] Both VLANs are removed but multicast rule for VLAN 10 remains
          installed so interface receives multicast packets from VLAN 10
      [5] Command switches IFF_ALLMULTI off and because vsi->num_vlan is 1
          the driver tries to remove multicast promisc rule for VLAN 0
          with non-VLAN look-up that does not exist.
          All-multicast looks disabled from user point of view but it
          is partially enabled in HW (interface receives all multicast
          packets either untagged or tagged with VLAN ID 10)
      
      To resolve these issues the patch introduces these changes:
      1. Adds handling for IFF_ALLMULTI to ice_vlan_rx_add_vid() and
         ice_vlan_rx_kill_vid() callbacks. So when VLAN is added/removed
         and IFF_ALLMULTI is enabled an appropriate multicast promisc
         rule for that VLAN ID is added/removed.
      2. In ice_vlan_rx_add_vid() when first VLAN besides VLAN 0 is added
         so (vsi->num_vlan == 2) and IFF_ALLMULTI is enabled then look-up
         type for existing multicast promisc rule for VLAN 0 is updated
         to ICE_MCAST_VLAN_PROMISC_BITS.
      3. In ice_vlan_rx_kill_vid() when last VLAN besides VLAN 0 is removed
         so (vsi->num_vlan == 1) and IFF_ALLMULTI is enabled then look-up
         type for existing multicast promisc rule for VLAN 0 is updated
         to ICE_MCAST_PROMISC_BITS.
      4. Both ice_vlan_rx_{add,kill}_vid() have to run under ICE_CFG_BUSY
         bit protection to avoid races with ice_vsi_sync_fltr() that runs
         in ice_service_task() context.
      5. Bit ICE_VSI_VLAN_FLTR_CHANGED is use-less and can be removed.
      6. Error messages added to ice_fltr_*_vsi_promisc() helper functions
         to avoid them in their callers
      7. Small improvements to increase readability
      
      Fixes: 5eda8afd ("ice: Add support for PF/VF promiscuous mode")
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NAlice Michael <alice.michael@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1273f895
  3. 29 3月, 2022 1 次提交
    • M
      ice: xsk: Fix indexing in ice_tx_xsk_pool() · 1ac2524d
      Maciej Fijalkowski 提交于
      Ice driver tries to always create XDP rings array to be
      num_possible_cpus() sized, regardless of user's queue count setting that
      can be changed via ethtool -L for example.
      
      Currently, ice_tx_xsk_pool() calculates the qid by decrementing the
      ring->q_index by the count of XDP queues, but ring->q_index is set to 'i
      + vsi->alloc_txq'.
      
      When user did ethtool -L $IFACE combined 1, alloc_txq is 1, but
      vsi->num_xdp_txq is still num_possible_cpus(). Then, ice_tx_xsk_pool()
      will do OOB access and in the final result ring would not get xsk_pool
      pointer assigned. Then, each ice_xsk_wakeup() call will fail with error
      and it will not be possible to get into NAPI and do the processing from
      driver side.
      
      Fix this by decrementing vsi->alloc_txq instead of vsi->num_xdp_txq from
      ring-q_index in ice_tx_xsk_pool() so the calculation is reflected to the
      setting of ring->q_index.
      
      Fixes: 22bf877e ("ice: introduce XDP_TX fallback path")
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220328142123.170157-5-maciej.fijalkowski@intel.com
      1ac2524d
  4. 24 3月, 2022 1 次提交
    • A
      ice: fix 'scheduling while atomic' on aux critical err interrupt · 32d53c0a
      Alexander Lobakin 提交于
      There's a kernel BUG splat on processing aux critical error
      interrupts in ice_misc_intr():
      
      [ 2100.917085] BUG: scheduling while atomic: swapper/15/0/0x00010000
      ...
      [ 2101.060770] Call Trace:
      [ 2101.063229]  <IRQ>
      [ 2101.065252]  dump_stack+0x41/0x60
      [ 2101.068587]  __schedule_bug.cold.100+0x4c/0x58
      [ 2101.073060]  __schedule+0x6a4/0x830
      [ 2101.076570]  schedule+0x35/0xa0
      [ 2101.079727]  schedule_preempt_disabled+0xa/0x10
      [ 2101.084284]  __mutex_lock.isra.7+0x310/0x420
      [ 2101.088580]  ? ice_misc_intr+0x201/0x2e0 [ice]
      [ 2101.093078]  ice_send_event_to_aux+0x25/0x70 [ice]
      [ 2101.097921]  ice_misc_intr+0x220/0x2e0 [ice]
      [ 2101.102232]  __handle_irq_event_percpu+0x40/0x180
      [ 2101.106965]  handle_irq_event_percpu+0x30/0x80
      [ 2101.111434]  handle_irq_event+0x36/0x53
      [ 2101.115292]  handle_edge_irq+0x82/0x190
      [ 2101.119148]  handle_irq+0x1c/0x30
      [ 2101.122480]  do_IRQ+0x49/0xd0
      [ 2101.125465]  common_interrupt+0xf/0xf
      [ 2101.129146]  </IRQ>
      ...
      
      As Andrew correctly mentioned previously[0], the following call
      ladder happens:
      
      ice_misc_intr() <- hardirq
        ice_send_event_to_aux()
          device_lock()
            mutex_lock()
              might_sleep()
                might_resched() <- oops
      
      Add a new PF state bit which indicates that an aux critical error
      occurred and serve it in ice_service_task() in process context.
      The new ice_pf::oicr_err_reg is read-write in both hardirq and
      process contexts, but only 3 bits of non-critical data probably
      aren't worth explicit synchronizing (and they're even in the same
      byte [31:24]).
      
      [0] https://lore.kernel.org/all/YeSRUVmrdmlUXHDn@lunn.ch
      
      Fixes: 348048e7 ("ice: Implement iidc operations")
      Signed-off-by: NAlexander Lobakin <alexandr.lobakin@intel.com>
      Tested-by: NMichal Kubiak <michal.kubiak@intel.com>
      Acked-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      32d53c0a
  5. 15 3月, 2022 3 次提交
    • J
      ice: remove circular header dependencies on ice.h · 649c87c6
      Jacob Keller 提交于
      Several headers in the ice driver include ice.h even though they are
      themselves included by that header. The most notable of these is
      ice_common.h, but several other headers also do this.
      
      Such a recursive inclusion is problematic as it forces headers to be
      included in a strict order, otherwise compilation errors can result. The
      circular inclusions do not trigger an endless loop due to standard
      header inclusion guards, however other errors can occur.
      
      For example, ice_flow.h defines ice_rss_hash_cfg, which is used by
      ice_sriov.h as part of the definition of ice_vf_hash_ip_ctx.
      
      ice_flow.h includes ice_acl.h, which includes ice_common.h, and which
      finally includes ice.h. Since ice.h itself includes ice_sriov.h, this
      creates a circular dependency.
      
      The definition in ice_sriov.h requires things from ice_flow.h, but
      ice_flow.h itself will lead to trying to load ice_sriov.h as part of its
      process for expanding ice.h. The current code avoids this issue by
      having an implicit dependency without the include of ice_flow.h.
      
      If we were to fix that so that ice_sriov.h explicitly depends on
      ice_flow.h the following pattern would occur:
      
        ice_flow.h -> ice_acl.h -> ice_common.h -> ice.h -> ice_sriov.h
      
      At this point, during the expansion of, the header guard for ice_flow.h
      is already set, so when ice_sriov.h attempts to load the ice_flow.h
      header it is skipped. Then, we go on to begin including the rest of
      ice_sriov.h, including structure definitions which depend on
      ice_rss_hash_cfg. This produces a compiler warning because
      ice_rss_hash_cfg hasn't yet been included. Remember, we're just at the
      start of ice_flow.h!
      
      If the order of headers is incorrect (ice_flow.h is not implicitly
      loaded first in all files which include ice_sriov.h) then we get the
      same failure.
      
      Removing this recursive inclusion requires fixing a few cases where some
      headers depended on the header inclusions from ice.h. In addition, a few
      other changes are also required.
      
      Most notably, ice_hw_to_dev is implemented as a macro in ice_osdep.h,
      which is the likely reason that ice_common.h includes ice.h at all. This
      macro implementation requires the full definition of ice_pf in order to
      properly compile.
      
      Fix this by moving it to a function declared in ice_main.c, so that we
      do not require all files to depend on the layout of the ice_pf
      structure.
      
      Note that this change only fixes circular dependencies, but it does not
      fully resolve all implicit dependencies where one header may depend on
      the inclusion of another. I tried to fix as many of the implicit
      dependencies as I noticed, but fixing them all requires a somewhat
      tedious analysis of each header and attempting to compile it separately.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      649c87c6
    • J
      ice: rename ice_virtchnl_pf.c to ice_sriov.c · 0deb0bf7
      Jacob Keller 提交于
      The ice_virtchnl_pf.c and ice_virtchnl_pf.h files are where most of the
      code for implementing Single Root IOV virtualization resides. This code
      includes support for bringing up and tearing down VFs, hooks into the
      kernel SR-IOV netdev operations, and for handling virtchnl messages from
      VFs.
      
      In the future, we plan to support Scalable IOV in addition to Single
      Root IOV as an alternative virtualization scheme. This implementation
      will re-use some but not all of the code in ice_virtchnl_pf.c
      
      To prepare for this future, we want to refactor and split up the code in
      ice_virtchnl_pf.c into the following scheme:
      
       * ice_vf_lib.[ch]
      
         Basic VF structures and accessors. This is where scheme-independent
         code will reside.
      
       * ice_virtchnl.[ch]
      
         Virtchnl message handling. This is where the bulk of the logic for
         processing messages from VFs using the virtchnl messaging scheme will
         reside. This is separated from ice_vf_lib.c because it is distinct
         and has a bulk of the processing code.
      
       * ice_sriov.[ch]
      
         Single Root IOV implementation, including initialization and the
         routines for interacting with SR-IOV based netdev operations.
      
       * (future) ice_siov.[ch]
      
         Scalable IOV implementation.
      
      As a first step, lets assume that all of the code in
      ice_virtchnl_pf.[ch] is for Single Root IOV. Rename this file to
      ice_sriov.c and its header to ice_sriov.h
      
      Future changes will further split out the code in these files following
      the plan outlined here.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      0deb0bf7
    • J
      ice: rename ice_sriov.c to ice_vf_mbx.c · d775155a
      Jacob Keller 提交于
      The ice_sriov.c file primarily contains code which handles the logic for
      mailbox overflow detection and some other utility functions related to
      the virtualization mailbox.
      
      The bulk of the SR-IOV implementation is actually found in
      ice_virtchnl_pf.c, and this file isn't strictly SR-IOV specific.
      
      In the future, the ice driver will support an additional virtualization
      scheme known as Scalable IOV, and the code in this file will be used
      for this alternative implementation.
      
      Rename this file (and its associated header) to ice_vf_mbx.c, so that we
      can later re-use the ice_sriov.c file as the SR-IOV specific file.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d775155a
  6. 12 3月, 2022 1 次提交
  7. 11 3月, 2022 1 次提交
    • I
      ice: Fix race condition during interface enslave · 5cb1ebdb
      Ivan Vecera 提交于
      Commit 5dbbbd01 ("ice: Avoid RTNL lock when re-creating
      auxiliary device") changes a process of re-creation of aux device
      so ice_plug_aux_dev() is called from ice_service_task() context.
      This unfortunately opens a race window that can result in dead-lock
      when interface has left LAG and immediately enters LAG again.
      
      Reproducer:
      ```
      #!/bin/sh
      
      ip link add lag0 type bond mode 1 miimon 100
      ip link set lag0
      
      for n in {1..10}; do
              echo Cycle: $n
              ip link set ens7f0 master lag0
              sleep 1
              ip link set ens7f0 nomaster
      done
      ```
      
      This results in:
      [20976.208697] Workqueue: ice ice_service_task [ice]
      [20976.213422] Call Trace:
      [20976.215871]  __schedule+0x2d1/0x830
      [20976.219364]  schedule+0x35/0xa0
      [20976.222510]  schedule_preempt_disabled+0xa/0x10
      [20976.227043]  __mutex_lock.isra.7+0x310/0x420
      [20976.235071]  enum_all_gids_of_dev_cb+0x1c/0x100 [ib_core]
      [20976.251215]  ib_enum_roce_netdev+0xa4/0xe0 [ib_core]
      [20976.256192]  ib_cache_setup_one+0x33/0xa0 [ib_core]
      [20976.261079]  ib_register_device+0x40d/0x580 [ib_core]
      [20976.266139]  irdma_ib_register_device+0x129/0x250 [irdma]
      [20976.281409]  irdma_probe+0x2c1/0x360 [irdma]
      [20976.285691]  auxiliary_bus_probe+0x45/0x70
      [20976.289790]  really_probe+0x1f2/0x480
      [20976.298509]  driver_probe_device+0x49/0xc0
      [20976.302609]  bus_for_each_drv+0x79/0xc0
      [20976.306448]  __device_attach+0xdc/0x160
      [20976.310286]  bus_probe_device+0x9d/0xb0
      [20976.314128]  device_add+0x43c/0x890
      [20976.321287]  __auxiliary_device_add+0x43/0x60
      [20976.325644]  ice_plug_aux_dev+0xb2/0x100 [ice]
      [20976.330109]  ice_service_task+0xd0c/0xed0 [ice]
      [20976.342591]  process_one_work+0x1a7/0x360
      [20976.350536]  worker_thread+0x30/0x390
      [20976.358128]  kthread+0x10a/0x120
      [20976.365547]  ret_from_fork+0x1f/0x40
      ...
      [20976.438030] task:ip              state:D stack:    0 pid:213658 ppid:213627 flags:0x00004084
      [20976.446469] Call Trace:
      [20976.448921]  __schedule+0x2d1/0x830
      [20976.452414]  schedule+0x35/0xa0
      [20976.455559]  schedule_preempt_disabled+0xa/0x10
      [20976.460090]  __mutex_lock.isra.7+0x310/0x420
      [20976.464364]  device_del+0x36/0x3c0
      [20976.467772]  ice_unplug_aux_dev+0x1a/0x40 [ice]
      [20976.472313]  ice_lag_event_handler+0x2a2/0x520 [ice]
      [20976.477288]  notifier_call_chain+0x47/0x70
      [20976.481386]  __netdev_upper_dev_link+0x18b/0x280
      [20976.489845]  bond_enslave+0xe05/0x1790 [bonding]
      [20976.494475]  do_setlink+0x336/0xf50
      [20976.502517]  __rtnl_newlink+0x529/0x8b0
      [20976.543441]  rtnl_newlink+0x43/0x60
      [20976.546934]  rtnetlink_rcv_msg+0x2b1/0x360
      [20976.559238]  netlink_rcv_skb+0x4c/0x120
      [20976.563079]  netlink_unicast+0x196/0x230
      [20976.567005]  netlink_sendmsg+0x204/0x3d0
      [20976.570930]  sock_sendmsg+0x4c/0x50
      [20976.574423]  ____sys_sendmsg+0x1eb/0x250
      [20976.586807]  ___sys_sendmsg+0x7c/0xc0
      [20976.606353]  __sys_sendmsg+0x57/0xa0
      [20976.609930]  do_syscall_64+0x5b/0x1a0
      [20976.613598]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      1. Command 'ip link ... set nomaster' causes that ice_plug_aux_dev()
         is called from ice_service_task() context, aux device is created
         and associated device->lock is taken.
      2. Command 'ip link ... set master...' calls ice's notifier under
         RTNL lock and that notifier calls ice_unplug_aux_dev(). That
         function tries to take aux device->lock but this is already taken
         by ice_plug_aux_dev() in step 1
      3. Later ice_plug_aux_dev() tries to take RTNL lock but this is already
         taken in step 2
      4. Dead-lock
      
      The patch fixes this issue by following changes:
      - Bit ICE_FLAG_PLUG_AUX_DEV is kept to be set during ice_plug_aux_dev()
        call in ice_service_task()
      - The bit is checked in ice_clear_rdma_cap() and only if it is not set
        then ice_unplug_aux_dev() is called. If it is set (in other words
        plugging of aux device was requested and ice_plug_aux_dev() is
        potentially running) then the function only clears the bit
      - Once ice_plug_aux_dev() call (in ice_service_task) is finished
        the bit ICE_FLAG_PLUG_AUX_DEV is cleared but it is also checked
        whether it was already cleared by ice_clear_rdma_cap(). If so then
        aux device is unplugged.
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Co-developed-by: NPetr Oros <poros@redhat.com>
      Signed-off-by: NPetr Oros <poros@redhat.com>
      Reviewed-by: NDave Ertman <david.m.ertman@intel.com>
      Link: https://lore.kernel.org/r/20220310171641.3863659-1-ivecera@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      5cb1ebdb
  8. 10 3月, 2022 1 次提交
  9. 09 3月, 2022 1 次提交
    • D
      ice: Fix error with handling of bonding MTU · 97b01291
      Dave Ertman 提交于
      When a bonded interface is destroyed, .ndo_change_mtu can be called
      during the tear-down process while the RTNL lock is held.  This is a
      problem since the auxiliary driver linked to the LAN driver needs to be
      notified of the MTU change, and this requires grabbing a device_lock on
      the auxiliary_device's dev.  Currently this is being attempted in the
      same execution context as the call to .ndo_change_mtu which is causing a
      dead-lock.
      
      Move the notification of the changed MTU to a separate execution context
      (watchdog service task) and eliminate the "before" notification.
      
      Fixes: 348048e7 ("ice: Implement iidc operations")
      Signed-off-by: NDave Ertman <david.m.ertman@intel.com>
      Tested-by: NJonathan Toppins <jtoppins@redhat.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      97b01291
  10. 04 3月, 2022 2 次提交
    • J
      ice: factor VF variables to separate structure · 000773c0
      Jacob Keller 提交于
      We maintain a number of values for VFs within the ice_pf structure. This
      includes the VF table, the number of allocated VFs, the maximum number
      of supported SR-IOV VFs, the number of queue pairs per VF, the number of
      MSI-X vectors per VF, and a bitmap of the VFs with detected MDD events.
      
      We're about to add a few more variables to this list. Clean this up
      first by extracting these members out into a new ice_vfs structure
      defined in ice_virtchnl_pf.h
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      000773c0
    • J
      ice: store VF pointer instead of VF ID · b03d519d
      Jacob Keller 提交于
      The VSI structure contains a vf_id field used to associate a VSI with a
      VF. This is used mainly for ICE_VSI_VF as well as partially for
      ICE_VSI_CTRL associated with the VFs.
      
      This API was designed with the idea that VFs are stored in a simple
      array that was expected to be static throughout most of the driver's
      life.
      
      We plan on refactoring VF storage in a few key ways:
      
        1) converting from a simple static array to a hash table
        2) using krefs to track VF references obtained from the hash table
        3) use RCU to delay release of VF memory until after all references
           are dropped
      
      This is motivated by the goal to ensure that the lifetime of VF
      structures is accounted for, and prevent various use-after-free bugs.
      
      With the existing vsi->vf_id, the reference tracking for VFs would
      become somewhat convoluted, because each VSI maintains a vf_id field
      which will then require performing a look up. This means all these flows
      will require reference tracking and proper usage of rcu_read_lock, etc.
      
      We know that the VF VSI will always be backed by a valid VF structure,
      because the VSI is created during VF initialization and removed before
      the VF is destroyed. Rely on this and store a reference to the VF in the
      VSI structure instead of storing a VF ID. This will simplify the usage
      and avoid the need to perform lookups on the hash table in the future.
      
      For ICE_VSI_VF, it is expected that vsi->vf is always non-NULL after
      ice_vsi_alloc succeeds. Because of this, use WARN_ON when checking if a
      vsi->vf pointer is valid when dealing with VF VSIs. This will aid in
      debugging code which violates this assumption and avoid more disastrous
      panics.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      b03d519d
  11. 03 3月, 2022 1 次提交
  12. 19 2月, 2022 1 次提交
    • J
      ice: fix concurrent reset and removal of VFs · fadead80
      Jacob Keller 提交于
      Commit c503e632 ("ice: Stop processing VF messages during teardown")
      introduced a driver state flag, ICE_VF_DEINIT_IN_PROGRESS, which is
      intended to prevent some issues with concurrently handling messages from
      VFs while tearing down the VFs.
      
      This change was motivated by crashes caused while tearing down and
      bringing up VFs in rapid succession.
      
      It turns out that the fix actually introduces issues with the VF driver
      caused because the PF no longer responds to any messages sent by the VF
      during its .remove routine. This results in the VF potentially removing
      its DMA memory before the PF has shut down the device queues.
      
      Additionally, the fix doesn't actually resolve concurrency issues within
      the ice driver. It is possible for a VF to initiate a reset just prior
      to the ice driver removing VFs. This can result in the remove task
      concurrently operating while the VF is being reset. This results in
      similar memory corruption and panics purportedly fixed by that commit.
      
      Fix this concurrency at its root by protecting both the reset and
      removal flows using the existing VF cfg_lock. This ensures that we
      cannot remove the VF while any outstanding critical tasks such as a
      virtchnl message or a reset are occurring.
      
      This locking change also fixes the root cause originally fixed by commit
      c503e632 ("ice: Stop processing VF messages during teardown"), so we
      can simply revert it.
      
      Note that I kept these two changes together because simply reverting the
      original commit alone would leave the driver vulnerable to worse race
      conditions.
      
      Fixes: c503e632 ("ice: Stop processing VF messages during teardown")
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      fadead80
  13. 14 2月, 2022 1 次提交
  14. 11 2月, 2022 1 次提交
  15. 10 2月, 2022 3 次提交
    • B
      ice: Add ability for PF admin to enable VF VLAN pruning · f1da5a08
      Brett Creeley 提交于
      VFs by default are able to see all tagged traffic regardless of trust
      and VLAN filters. Based on legacy devices (i.e. ixgbe, i40e), customers
      expect VFs to receive all VLAN tagged traffic with a matching
      destination MAC.
      
      Add an ethtool private flag 'vf-vlan-pruning' and set the default to
      off so VFs will receive all VLAN traffic directed towards them. When
      the flag is turned on, VF will only be able to receive untagged
      traffic or traffic with VLAN tags it has created interfaces for.
      
      Also, the flag cannot be changed while any VFs are allocated. This was
      done to simplify the implementation. So, if this flag is needed, then
      the PF admin must enable it. If the user tries to enable the flag while
      VFs are active, then print an unsupported message with the
      vf-vlan-pruning flag included. In case multiple flags were specified, this
      makes it clear to the user which flag failed.
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f1da5a08
    • B
      ice: Add outer_vlan_ops and VSI specific VLAN ops implementations · c31af68a
      Brett Creeley 提交于
      Add a new outer_vlan_ops member to the ice_vsi structure as outer VLAN
      ops are only available when the device is in Double VLAN Mode (DVM).
      Depending on the VSI type, the requirements for what operations to
      use/allow differ.
      
      By default all VSI's have unsupported inner and outer VSI VLAN ops. This
      implementation was chosen to prevent unexpected crashes due to null
      pointer dereferences. Instead, if a VSI calls an unsupported op, it will
      just return -EOPNOTSUPP.
      
      Add implementations to support modifying outer VLAN fields for VSI
      context. This includes the ability to modify VLAN stripping, insertion,
      and the port VLAN based on the outer VLAN handling fields of the VSI
      context.
      
      These functions should only ever be used if DVM is enabled because that
      means the firmware supports the outer VLAN fields in the VSI context. If
      the device is in DVM, then always use the outer_vlan_ops, else use the
      vlan_ops since the device is in Single VLAN Mode (SVM).
      
      Also, move adding the untagged VLAN 0 filter from ice_vsi_setup() to
      ice_vsi_vlan_setup() as the latter function is specific to the PF and
      all other VSI types that need an untagged VLAN 0 filter already do this
      in their specific flows. Without this change, Flow Director is failing
      to initialize because it does not implement any VSI VLAN ops.
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      c31af68a
    • B
      ice: Add new VSI VLAN ops · bc42afa9
      Brett Creeley 提交于
      Incoming changes to support 802.1Q and/or 802.1ad VLAN filtering and
      offloads require more flexibility when configuring VLANs. The VSI VLAN
      interface will allow flexibility for configuring VLANs for all VSI
      types. Add new files to separate the VSI VLAN ops and move functions to
      make the code more organized.
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      bc42afa9
  16. 30 12月, 2021 1 次提交
  17. 16 12月, 2021 2 次提交
    • J
      ice: support immediate firmware activation via devlink reload · 399e27db
      Jacob Keller 提交于
      The ice hardware contains an embedded chip with firmware which can be
      updated using devlink flash. The firmware which runs on this chip is
      referred to as the Embedded Management Processor firmware (EMP
      firmware).
      
      Activating the new firmware image currently requires that the system be
      rebooted. This is not ideal as rebooting the system can cause unwanted
      downtime.
      
      In practical terms, activating the firmware does not always require a
      full system reboot. In many cases it is possible to activate the EMP
      firmware immediately. There are a couple of different scenarios to
      cover.
      
       * The EMP firmware itself can be reloaded by issuing a special update
         to the device called an Embedded Management Processor reset (EMP
         reset). This reset causes the device to reset and reload the EMP
         firmware.
      
       * PCI configuration changes are only reloaded after a cold PCIe reset.
         Unfortunately there is no generic way to trigger this for a PCIe
         device without a system reboot.
      
      When performing a flash update, firmware is capable of responding with
      some information about the specific update requirements.
      
      The driver updates the flash by programming a secondary inactive bank
      with the contents of the new image, and then issuing a command to
      request to switch the active bank starting from the next load.
      
      The response to the final command for updating the inactive NVM flash
      bank includes an indication of the minimum reset required to fully
      update the device. This can be one of the following:
      
       * A full power on is required
       * A cold PCIe reset is required
       * An EMP reset is required
      
      The response to the command to switch flash banks includes an indication
      of whether or not the firmware will allow an EMP reset request.
      
      For most updates, an EMP reset is sufficient to load the new EMP
      firmware without issues. In some cases, this reset is not sufficient
      because the PCI configuration space has changed. When this could cause
      incompatibility with the new EMP image, the firmware is capable of
      rejecting the EMP reset request.
      
      Add logic to ice_fw_update.c to handle the response data flash update
      AdminQ commands.
      
      For the reset level, issue a devlink status notification informing the
      user of how to complete the update with a simple suggestion like
      "Activate new firmware by rebooting the system".
      
      Cache the status of whether or not firmware will restrict the EMP reset
      for use in implementing devlink reload.
      
      Implement support for devlink reload with the "fw_activate" flag. This
      allows user space to request the firmware be activated immediately.
      
      For the .reload_down handler, we will issue a request for the EMP reset
      using the appropriate firmware AdminQ command. If we know that the
      firmware will not allow an EMP reset, simply exit with a suitable
      netlink extended ACK message indicating that the EMP reset is not
      available.
      
      For the .reload_up handler, simply wait until the driver has finished
      resetting. Logic to handle processing of an EMP reset already exists in
      the driver as part of its reset and rebuild flows.
      
      Implement support for the devlink reload interface with the
      "fw_activate" action. This allows userspace to request activation of
      firmware without a reboot.
      
      Note that support for indicating the required reset and EMP reset
      restriction is not supported on old versions of firmware. The driver can
      determine if the two features are supported by checking the device
      capabilities report. I confirmed support has existed since at least
      version 5.5.2 as reported by the 'fw.mgmt' version. Support to issue the
      EMP reset request has existed in all version of the EMP firmware for the
      ice hardware.
      
      Check the device capabilities report to determine whether or not the
      indications are reported by the running firmware. If the reset
      requirement indication is not supported, always assume a full power on
      is necessary. If the reset restriction capability is not supported,
      always assume the EMP reset is available.
      
      Users can verify if the EMP reset has activated the firmware by using
      the devlink info report to check that the 'running' firmware version has
      updated. For example a user might do the following:
      
       # Check current version
       $ devlink dev info
      
       # Update the device
       $ devlink dev flash pci/0000:af:00.0 file firmware.bin
      
       # Confirm stored version updated
       $ devlink dev info
      
       # Reload to activate new firmware
       $ devlink dev reload pci/0000:af:00.0 action fw_activate
      
       # Confirm running version updated
       $ devlink dev info
      
      Finally, this change does *not* implement basic driver-only reload
      support. I did look into trying to do this. However, it requires
      significant refactor of how the ice driver probes and loads everything.
      The ice driver probe and allocation flows were not designed with such
      a reload in mind. Refactoring the flow to support this is beyond the
      scope of this change.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      399e27db
    • J
      ice: devlink: add shadow-ram region to snapshot Shadow RAM · 78ad87da
      Jacob Keller 提交于
      We have a region for reading the contents of the NVM flash as
      a snapshot. This region does not allow reading the Shadow RAM, as it
      always passes the FLASH_ONLY bit to the low level firmware interface.
      
      Add a separate shadow-ram region which will allow snapshot of the
      current contents of the Shadow RAM. This data is built from the NVM
      contents but is distinct as the device builds up the Shadow RAM during
      initialization, so being able to snapshot its contents can be useful
      when attempting to debug flash related issues.
      
      Fix the comment description of the nvm-flash region which incorrectly
      stated that it filled the shadow-ram region, and add a comment
      explaining that the nvm-flash region does not actually read the Shadow
      RAM.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      78ad87da
  18. 15 12月, 2021 1 次提交
  19. 23 11月, 2021 1 次提交
  20. 03 11月, 2021 1 次提交
    • B
      ice: Fix VF true promiscuous mode · 1a8c7778
      Brett Creeley 提交于
      When a VF requests promiscuous mode and it's trusted and true promiscuous
      mode is enabled the PF driver attempts to enable unicast and/or
      multicast promiscuous mode filters based on the request. This is fine,
      but there are a couple issues with the current code.
      
      [1] The define to configure the unicast promiscuous mode mask also
          includes bits to configure the multicast promiscuous mode mask, which
          causes multicast to be set/cleared unintentionally.
      [2] All 4 cases for enable/disable unicast/multicast mode are not
          handled in the promiscuous mode message handler, which causes
          unexpected results regarding the current promiscuous mode settings.
      
      To fix [1] make sure any promiscuous mask defines include the correct
      bits for each of the promiscuous modes.
      
      To fix [2] make sure that all 4 cases are handled since there are 2 bits
      (FLAG_VF_UNICAST_PROMISC and FLAG_VF_MULTICAST_PROMISC) that can be
      either set or cleared. Also, since either unicast and/or multicast
      promiscuous configuration can fail, introduce two separate error values
      to handle each of these cases.
      
      Fixes: 01b5e89a ("ice: Add VF promiscuous support")
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NTony Brelinski <tony.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      1a8c7778
  21. 29 10月, 2021 2 次提交
  22. 21 10月, 2021 3 次提交
  23. 15 10月, 2021 3 次提交
    • M
      ice: make use of ice_for_each_* macros · 2faf63b6
      Maciej Fijalkowski 提交于
      Go through the code base and use ice_for_each_* macros.  While at it,
      introduce ice_for_each_xdp_txq() macro that can be used for looping over
      xdp_rings array.
      
      Commit is not introducing any new functionality.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      2faf63b6
    • M
      ice: introduce XDP_TX fallback path · 22bf877e
      Maciej Fijalkowski 提交于
      Under rare circumstances there might be a situation where a requirement
      of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
      resources have to be shared between CPUs. This yields a need for placing
      accesses to xdp_ring inside a critical section protected by spinlock.
      These accesses happen to be in the hot path, so let's introduce the
      static branch that will be triggered from the control plane when driver
      could not provide Tx queue dedicated for XDP on each CPU.
      
      Currently, the design that has been picked is to allow any number of XDP
      Tx queues that is at least half of a count of CPUs that platform has.
      For lower number driver will bail out with a response to user that there
      were not enough Tx resources that would allow configuring XDP. The
      sharing of rings is signalled via static branch enablement which in turn
      indicates that lock for xdp_ring accesses needs to be taken in hot path.
      
      Approach based on static branch has no impact on performance of a
      non-fallback path. One thing that is needed to be mentioned is a fact
      that the static branch will act as a global driver switch, meaning that
      if one PF got out of Tx resources, then other PFs that ice driver is
      servicing will suffer. However, given the fact that HW that ice driver
      is handling has 1024 Tx queues per each PF, this is currently an
      unlikely scenario.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      22bf877e
    • M
      ice: split ice_ring onto Tx/Rx separate structs · e72bba21
      Maciej Fijalkowski 提交于
      While it was convenient to have a generic ring structure that served
      both Tx and Rx sides, next commits are going to introduce several
      Tx-specific fields, so in order to avoid hurting the Rx side, let's
      pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
      
      Rx ring could be handled by the old ice_ring which would reduce the code
      churn within this patch, but this would make things asymmetric.
      
      Make the union out of the ring container within ice_q_vector so that it
      is possible to iterate over newly introduced ice_tx_ring.
      
      Remove the @size as it's only accessed from control path and it can be
      calculated pretty easily.
      
      Change definitions of ice_update_ring_stats and
      ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
      used for both Rx and Tx rings.
      
      Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
      Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
      difference now.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e72bba21
  24. 14 10月, 2021 1 次提交
  25. 12 10月, 2021 1 次提交
  26. 08 10月, 2021 4 次提交
    • W
      ice: add port representor ethtool ops and stats · 7aae80ce
      Wojciech Drewek 提交于
      Introduce the following ethtool operations for VF's representor:
      	-get_drvinfo
      	-get_strings
      	-get_ethtool_stats
      	-get_sset_count
      	-get_link
      
      In all cases, existing operations were used with minor
      changes which allow us to detect if ethtool op was called for
      representor. Only VF VSI stats will be available for representor.
      
      Implement ndo_get_stats64 for port representor. This will update
      VF VSI stats and read them.
      Signed-off-by: NWojciech Drewek <wojciech.drewek@intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      7aae80ce
    • G
      ice: introduce new type of VSI for switchdev · f66756e0
      Grzegorz Nitka 提交于
      New type of VSI has to be defined for switchdev control plane
      VSI. Number of allocated Tx and Rx queue has to be equal to
      amount of VFs, because each port representor should have one
      Tx and Rx queue.
      
      Also to not increase number of used irqs too much, control plane
      VSI uses only one q_vector and handle all queues in one irq.
      To allow handling all queues in one irq , new function to clean
      msix for eswitch was introduced. This function will schedule napi
      for each representor instead of scheduling it only for one like in
      normal clean irq function.
      
      Only one additional msix has to be requested. Always try to request
      it in ice_ena_msix_range function.
      Signed-off-by: NGrzegorz Nitka <grzegorz.nitka@intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f66756e0
    • G
      ice: set and release switchdev environment · 1a1c40df
      Grzegorz Nitka 提交于
      Switchdev environment has to be set up when user create VFs
      and eswitch mode is switchdev. Release is done when user
      delete all VFs.
      
      Data path in this implementation is based on control plane VSI.
      This VSI is used to pass traffic from port representors to
      corresponding VFs and vice versa. Default TX rule has to be
      added to forward packet to control plane VSI. This will redirect
      packets from VFs which don't match other rules to control plane
      VSI.
      
      On RX side default rule is added on uplink VSI to receive all
      traffic that doesn't match other rules. When setting switchdev
      environment all other rules from VFs should be removed. Packet to
      VFs will be forwarded by control plane VSI.
      
      As VF without any mac rules can't send any packet because of
      antispoof mechanism, VSI antispoof should be turned off on each VFs.
      
      To send packet from representor to correct VSI, destination VSI
      field in TX descriptor will have to be filled. Allow that by
      setting destination override bit in control plane VSI security config.
      
      Packet from VFs will be received on control plane VSI. Driver
      should decide to which netdev forward the packet. Decision is
      made based on src_vsi field from descriptor. There is a target
      netdev list in control plane VSI struct which choose netdev
      based on src_vsi number.
      Co-developed-by: NMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: NMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: NGrzegorz Nitka <grzegorz.nitka@intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      1a1c40df
    • M
      ice: introduce VF port representor · 37165e3f
      Michal Swiatkowski 提交于
      Port representor is used to manage VF from host side. To allow
      it each created representor registers netdevice with random hw
      address. Also devlink port is created for all representors.
      
      Port representor name is created based on switch id or managed
      by devlink core if devlink port was registered with success.
      
      Open and stop ndo ops are implemented to allow managing the VF
      link state. Link state is tracked in VF struct.
      
      Struct ice_netdev_priv is extended by pointer to representor
      field. This is needed to get correct representor from netdev
      struct mostly used in ndo calls.
      
      Implement helper functions to check if given netdev is netdev of
      port representor (ice_is_port_repr_netdev) and to get representor
      from netdev (ice_netdev_to_repr).
      
      As driver mostly will create or destroy port representors on all
      VFs instead of on single one, write functions to add and remove
      representor for each VF.
      
      Representor struct contains pointer to source VSI, which is VSI
      configured on VF, backpointer to VF, backpointer to netdev,
      q_vector pointer and metadata_dst which will be used in data path.
      Co-developed-by: NGrzegorz Nitka <grzegorz.nitka@intel.com>
      Signed-off-by: NGrzegorz Nitka <grzegorz.nitka@intel.com>
      Signed-off-by: NMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      37165e3f