1. 04 3月, 2022 11 次提交
    • J
      ice: convert VF storage to hash table with krefs and RCU · 3d5985a1
      Jacob Keller 提交于
      The ice driver stores VF structures in a simple array which is allocated
      once at the time of VF creation. The VF structures are then accessed
      from the array by their VF ID. The ID must be between 0 and the number
      of allocated VFs.
      
      Multiple threads can access this table:
      
       * .ndo operations such as .ndo_get_vf_cfg or .ndo_set_vf_trust
       * interrupts, such as due to messages from the VF using the virtchnl
         communication
       * processing such as device reset
       * commands to add or remove VFs
      
      The current implementation does not keep track of when all threads are
      done operating on a VF and can potentially result in use-after-free
      issues caused by one thread accessing a VF structure after it has been
      released when removing VFs. Some of these are prevented with various
      state flags and checks.
      
      In addition, this structure is quite static and does not support a
      planned future where virtualization can be more dynamic. As we begin to
      look at supporting Scalable IOV with the ice driver (as opposed to just
      supporting Single Root IOV), this structure is not sufficient.
      
      In the future, VFs will be able to be added and removed individually and
      dynamically.
      
      To allow for this, and to better protect against a whole class of
      use-after-free bugs, replace the VF storage with a combination of a hash
      table and krefs to reference track all of the accesses to VFs through
      the hash table.
      
      A hash table still allows efficient look up of the VF given its ID, but
      also allows adding and removing VFs. It does not require contiguous VF
      IDs.
      
      The use of krefs allows the cleanup of the VF memory to be delayed until
      after all threads have released their reference (by calling ice_put_vf).
      
      To prevent corruption of the hash table, a combination of RCU and the
      mutex table_lock are used. Addition and removal from the hash table use
      the RCU-aware hash macros. This allows simple read-only look ups that
      iterate to locate a single VF can be fast using RCU. Accesses which
      modify the hash table, or which can't take RCU because they sleep, will
      hold the mutex lock.
      
      By using this design, we have a stronger guarantee that the VF structure
      can't be released until after all threads are finished operating on it.
      We also pave the way for the more dynamic Scalable IOV implementation in
      the future.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      3d5985a1
    • J
      ice: introduce VF accessor functions · fb916db1
      Jacob Keller 提交于
      Before we switch the VF data structure storage mechanism to a hash,
      introduce new accessor functions to define the new interface.
      
      * ice_get_vf_by_id is a function used to obtain a reference to a VF from
        the table based on its VF ID
      * ice_has_vfs is used to quickly check if any VFs are configured
      * ice_get_num_vfs is used to get an exact count of how many VFs are
        configured
      
      We can drop the old ice_validate_vf_id function, since every caller was
      just going to immediately access the VF table to get a reference
      anyways. This way we simply use the single ice_get_vf_by_id to both
      validate the VF ID is within range and that there exists a VF with that
      ID.
      
      This change enables us to more easily convert the codebase to the hash
      table since most callers now properly use the interface.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      fb916db1
    • J
      ice: factor VF variables to separate structure · 000773c0
      Jacob Keller 提交于
      We maintain a number of values for VFs within the ice_pf structure. This
      includes the VF table, the number of allocated VFs, the maximum number
      of supported SR-IOV VFs, the number of queue pairs per VF, the number of
      MSI-X vectors per VF, and a bitmap of the VFs with detected MDD events.
      
      We're about to add a few more variables to this list. Clean this up
      first by extracting these members out into a new ice_vfs structure
      defined in ice_virtchnl_pf.h
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      000773c0
    • J
      ice: convert ice_for_each_vf to include VF entry iterator · c4c2c7db
      Jacob Keller 提交于
      The ice_for_each_vf macro is intended to be used to loop over all VFs.
      The current implementation relies on an iterator that is the index into
      the VF array in the PF structure. This forces all users to perform a
      look up themselves.
      
      This abstraction forces a lot of duplicate work on callers and leaks the
      interface implementation to the caller. Replace this with an
      implementation that includes the VF pointer the primary iterator. This
      version simplifies callers which just want to iterate over every VF, as
      they no longer need to perform their own lookup.
      
      The "i" iterator value is replaced with a new unsigned int "bkt"
      parameter, as this will match the necessary interface for replacing
      the VF array with a hash table. For now, the bkt is the VF ID, but in
      the future it will simply be the hash bucket index. Document that it
      should not be treated as a VF ID.
      
      This change aims to simplify switching from the array to a hash table. I
      considered alternative implementations such as an xarray but decided
      that the hash table was the simplest and most suitable implementation. I
      also looked at methods to hide the bkt iterator entirely, but I couldn't
      come up with a feasible solution that worked for hash table iterators.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      c4c2c7db
    • J
      ice: use ice_for_each_vf for iteration during removal · 19281e86
      Jacob Keller 提交于
      When removing VFs, the driver takes a weird approach of assigning
      pf->num_alloc_vfs to 0 before iterating over the VFs using a temporary
      variable.
      
      This logic has been in the driver for a long time, and seems to have
      been carried forward from i40e.
      
      We want to refactor the way VFs are stored, and iterating over the data
      structure without the ice_for_each_vf interface impedes this work.
      
      The logic relies on implicitly using the num_alloc_vfs as a sort of
      "safe guard" for accessing VF data.
      
      While this sort of guard makes sense for Single Root IOV where all VFs
      are added at once, the data structures don't work for VFs which can be
      added and removed dynamically. We also have a separate state flag,
      ICE_VF_DEINIT_IN_PROGRESS which is a stronger protection against
      concurrent removal and access.
      
      Avoid the custom tmp iteration and replace it with the standard
      ice_for_each_vf iterator. Delay the assignment of num_alloc_vfs until
      after this loop finishes.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      19281e86
    • J
      ice: remove checks in ice_vc_send_msg_to_vf · 59e1f857
      Jacob Keller 提交于
      The ice_vc_send_msg_to_vf function is used by the PF to send a response
      to a VF. This function has overzealous checks to ensure its not passed a
      NULL VF pointer and to ensure that the passed in struct ice_vf has a
      valid vf_id sub-member.
      
      These checks have existed since commit 1071a835 ("ice: Implement
      virtchnl commands for AVF support") and function as simple sanity
      checks.
      
      We are planning to refactor the ice driver to use a hash table along
      with appropriate locks in a future refactor. This change will modify how
      the ice_validate_vf_id function works. Instead of a simple >= check to
      ensure the VF ID is between some range, it will check the hash table to
      see if the specified VF ID is actually in the table. This requires that
      the function properly lock the table to prevent race conditions.
      
      The checks may seem ok at first glance, but they don't really provide
      much benefit.
      
      In order for ice_vc_send_msg_to_vf to have these checks fail, the
      callers must either (1) pass NULL as the VF, (2) construct an invalid VF
      pointer manually, or (3) be using a VF pointer which becomes invalid
      after they obtain it properly using ice_get_vf_by_id.
      
      For (1), a cursory glance over callers of ice_vc_send_msg_to_vf can show
      that in most cases the functions already operate assuming their VF
      pointer is valid, such as by derferencing vf->pf or other members.
      
      They obtain the VF pointer by accessing the VF array using the VF ID,
      which can never produce a NULL value (since its a simple address
      operation on the array it will not be NULL.
      
      The sole exception for (1) is that ice_vc_process_vf_msg will forward a
      NULL VF pointer to this function as part of its goto error handler
      logic. This requires some minor cleanup to simply exit immediately when
      an invalid VF ID is detected (Rather than use the same error flow as
      the rest of the function).
      
      For (2), it is unexpected for a flow to construct a VF pointer manually
      instead of accessing the VF array. Defending against this is likely to
      just hide bad programming.
      
      For (3), it is definitely true that VF pointers could become invalid,
      for example if a thread is processing a VF message while the VF gets
      removed. However, the correct solution is not to add additional checks
      like this which do not guarantee to prevent the race. Instead we plan to
      solve the root of the problem by preventing the possibility entirely.
      
      This solution will require the change to a hash table with proper
      locking and reference counts of the VF structures. When this is done,
      ice_validate_vf_id will require locking of the hash table. This will be
      problematic because all of the callers of ice_vc_send_msg_to_vf will
      already have to take the lock to obtain the VF pointer anyways. With a
      mutex, this leads to a double lock that could hang the kernel thread.
      
      Avoid this by removing the checks which don't provide much value, so
      that we can safely add the necessary protections properly.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      59e1f857
    • J
      ice: move VFLR acknowledge during ice_free_vfs · 44efe75f
      Jacob Keller 提交于
      After removing all VFs, the driver clears the VFLR indication for VFs.
      This has been in ice since the beginning of SR-IOV support in the ice
      driver.
      
      The implementation was copied from i40e, and the motivation for the VFLR
      indication clearing is described in the commit f7414531 ("i40e:
      acknowledge VFLR when disabling SR-IOV")
      
      The commit explains that we need to clear the VFLR indication because
      the virtual function undergoes a VFLR event. If we don't indicate that
      it is complete it can cause an issue when VFs are re-enabled due to
      a "phantom" VFLR.
      
      The register block read was added under a pci_vfs_assigned check
      originally. This was done because we added the check after calling
      pci_disable_sriov. This was later moved to disable SRIOV earlier in the
      flow so that the VF drivers could be torn down before we removed
      functionality.
      
      Move the VFLR acknowledge into the main loop that tears down VF
      resources. This avoids using the tmp value for iterating over VFs
      multiple times. The result will make it easier to refactor the VF array
      in a future change.
      
      It's possible we might want to modify this flow to also stop checking
      pci_vfs_assigned. However, it seems reasonable to keep this change: we
      should only clear the VFLR if we actually disabled SR-IOV.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      44efe75f
    • J
      ice: move clear_malvf call in ice_free_vfs · 294627a6
      Jacob Keller 提交于
      The ice_mbx_clear_malvf function is used to clear the indication and
      count of how many times a VF was detected as malicious. During
      ice_free_vfs, we use this function to ensure that all removed VFs are
      reset to a clean state.
      
      The call currently is done at the end of ice_free_vfs() using a tmp
      value to iterate over all of the entries in the bitmap.
      
      This separate iteration using tmp is problematic for a planned refactor
      of the VF array data structure. To avoid this, lets move the call
      slightly higher into the function inside the loop where we teardown all
      of the VFs. This avoids one use of the tmp value used for iteration.
      We'll fix the other user in a future change.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      294627a6
    • J
      ice: pass num_vfs to ice_set_per_vf_res() · cd0f4f3b
      Jacob Keller 提交于
      We are planning to replace the simple array structure tracking VFs with
      a hash table. This change will also remove the "num_alloc_vfs" variable.
      
      Instead, new access functions to use the hash table as the source of
      truth will be introduced. These will generally be equivalent to existing
      checks, except during VF initialization.
      
      Specifically, ice_set_per_vf_res() cannot use the hash table as it will
      be operating prior to VF structures being inserted into the hash table.
      
      Instead of using pf->num_alloc_vfs, simply pass the num_vfs value in
      from the caller.
      
      Note that a sub-function of ice_set_per_vf_res, ice_determine_res, also
      implicitly depends on pf->num_alloc_vfs. Replace ice_determine_res with
      a simpler inline implementation based on rounddown_pow_of_two. Note that
      we must explicitly check that the argument is non-zero since it does not
      play well with zero as a value.
      
      Instead of using the function and while loop, simply calculate the
      number of queues we have available by dividing by num_vfs. Check if the
      desired queues are available. If not, round down to the nearest power of
      2 that fits within our available queues.
      
      This matches the behavior of ice_determine_res but is easier to follow
      as simple in-line logic. Remove ice_determine_res entirely.
      
      With this change, we no longer depend on the pf->num_alloc_vfs during
      the initialization phase of VFs. This will allow us to safely remove it
      in a future planned refactor of the VF data structures.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      cd0f4f3b
    • J
      ice: store VF pointer instead of VF ID · b03d519d
      Jacob Keller 提交于
      The VSI structure contains a vf_id field used to associate a VSI with a
      VF. This is used mainly for ICE_VSI_VF as well as partially for
      ICE_VSI_CTRL associated with the VFs.
      
      This API was designed with the idea that VFs are stored in a simple
      array that was expected to be static throughout most of the driver's
      life.
      
      We plan on refactoring VF storage in a few key ways:
      
        1) converting from a simple static array to a hash table
        2) using krefs to track VF references obtained from the hash table
        3) use RCU to delay release of VF memory until after all references
           are dropped
      
      This is motivated by the goal to ensure that the lifetime of VF
      structures is accounted for, and prevent various use-after-free bugs.
      
      With the existing vsi->vf_id, the reference tracking for VFs would
      become somewhat convoluted, because each VSI maintains a vf_id field
      which will then require performing a look up. This means all these flows
      will require reference tracking and proper usage of rcu_read_lock, etc.
      
      We know that the VF VSI will always be backed by a valid VF structure,
      because the VSI is created during VF initialization and removed before
      the VF is destroyed. Rely on this and store a reference to the VF in the
      VSI structure instead of storing a VF ID. This will simplify the usage
      and avoid the need to perform lookups on the hash table in the future.
      
      For ICE_VSI_VF, it is expected that vsi->vf is always non-NULL after
      ice_vsi_alloc succeeds. Because of this, use WARN_ON when checking if a
      vsi->vf pointer is valid when dealing with VF VSIs. This will aid in
      debugging code which violates this assumption and avoid more disastrous
      panics.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      b03d519d
    • J
      ice: refactor unwind cleanup in eswitch mode · df830543
      Jacob Keller 提交于
      The code for supporting eswitch mode and port representors on VFs uses
      an unwind based cleanup flow when handling errors.
      
      These flows are used to cleanup and get everything back to the state
      prior to attempting to switch from legacy to representor mode or back.
      
      The unwind iterations make sense, but complicate a plan to refactor the
      VF array structure. In the future we won't have a clean method of
      reversing an iteration of the VFs.
      
      Instead, we can change the cleanup flow to just iterate over all VF
      structures and clean up appropriately.
      
      First notice that ice_repr_add_for_all_vfs and ice_repr_rem_from_all_vfs
      have an additional step of re-assigning the VC ops. There is no good
      reason to do this outside of ice_repr_add and ice_repr_rem. It can
      simply be done as the last step of these functions.
      
      Second, make sure ice_repr_rem is safe to call on a VF which does not
      have a representor. Check if vf->repr is NULL first and exit early if
      so.
      
      Move ice_repr_rem_from_all_vfs above ice_repr_add_for_all_vfs so that we
      can call it from the cleanup function.
      
      In ice_eswitch.c, replace the unwind iteration with a call to
      ice_eswitch_release_reprs. This will go through all of the VFs and
      revert the VF back to the standard model without the eswitch mode.
      
      To make this safe, ensure this function checks whether or not the
      represent or has been moved. Rely on the metadata destination in
      vf->repr->dst. This must be NULL if the representor has not been moved
      to eswitch mode.
      
      Ensure that we always re-assign this value back to NULL after freeing
      it, and move the ice_eswitch_release_reprs so that it can be called from
      the setup function.
      
      With these changes, eswitch cleanup no longer uses an unwind flow that
      is problematic for the planned VF data structure change.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      df830543
  2. 03 3月, 2022 29 次提交
    • D
      Merge branch 'ptp-ocp-next' · 25bf4df4
      David S. Miller 提交于
      Jonathan Lemon says:
      
      ====================
      ptp: ocp: TOD and monitoring updates
      
      Add a series of patches for monitoring the status of the
      driver and adjusting TOD handling, especially around leap seconds.
      
      Add documentation for the new sysfs nodes.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25bf4df4
    • J
      docs: ABI: Document new timecard sysfs nodes. · 4db07317
      Jonathan Lemon 提交于
      Add documentation for the tod_correction, clock_status_drift,
      and clock_status_offset nodes.
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4db07317
    • V
      ptp: ocp: adjust utc_tai_offset to TOD info · e68462a0
      Vadim Fedorenko 提交于
      utc_tai_offset is used to correct IRIG, DCF and NMEA outputs and is
      set during initialisation but is not corrected during leap second
      announce event.  Add watchdog code to control this correction.
      Signed-off-by: NVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68462a0
    • V
      ptp: ocp: add tod_correction attribute · 44a412d1
      Vadim Fedorenko 提交于
      TOD correction register is used to compensate for leap seconds in
      different domains.  Export it as an attribute with write access.
      Signed-off-by: NVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44a412d1
    • V
      ptp: ocp: Expose clock status drift and offset · 2f23f486
      Vadim Fedorenko 提交于
      Monitoring of clock variance could be done through checking
      the offset and the drift updates that are applied to atomic
      clocks.  Expose these values as attributes for the timecard.
      Signed-off-by: NVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f23f486
    • V
      ptp: ocp: add TOD debug information · 9f492c4c
      Vadim Fedorenko 提交于
      TOD information is currently displayed only on module load,
      which doesn't provide updated information as the system runs.
      
      Create a debug file which provides the current TOD status information,
      and move the information display there.
      Signed-off-by: NVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f492c4c
    • D
      Merge branch 'skb-mono-delivery-time' · 01e2d157
      David S. Miller 提交于
      Martin KaFai Lau says:
      
      ====================
      Preserve mono delivery time (EDT) in skb->tstamp
      
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.
      
      This set is to keep the mono delivery time and make it available to
      the final egress interface.  Please see individual patch for
      the details.
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
      
      v6:
      - Add kdoc and use non-UAPI type in patch 6 (Jakub)
      
      v5:
      netdev:
      - Patch 3 in v4 is broken down into smaller patches 3, 4, and 5 in v5
      - The mono_delivery_time bit clearing in __skb_tstamp_tx() is
        done in __net_timestamp() instead.  This is patch 4 in v5.
      - Missed a skb_clear_delivery_time() for the 'skip_classify' case
        in dev.c in v4.  That is fixed in patch 5 in v5 for correctness.
        The skb_clear_delivery_time() will be moved to a later
        stage in Patch 10, so it was an intermediate error in v4.
      - Added delivery time handling for nfnetlink_{log, queue}.c in patch 9 (Daniel)
      - Added delivery time handling in the IPv6 IOAM hop-by-hop option which has
        an experimental IANA assigned value 49 in patch 8
      - Added delivery time handling in nf_conntrack for the ipv6 defrag case
        in patch 7
      - Removed unlikely() from testing skb->mono_delivery_time (Daniel)
      
      bpf:
      - Remove the skb->tstamp dance in ingress.  Depends on bpf insn
        rewrite to return 0 if skb->tstamp has delivery time in patch 11.
        It is to backward compatible with the existing tc-bpf@ingress in
        patch 11.
      - bpf_set_delivery_time() will also allow dtime == 0 and
        dtime_type == BPF_SKB_DELIVERY_TIME_NONE as argument
        in patch 12.
      
      v4:
      netdev:
      - Push the skb_clear_delivery_time() from
        ip_local_deliver() and ip6_input() to
        ip_local_deliver_finish() and ip6_input_finish()
        to accommodate the ipvs forward path.
        This is the notable change in v4 at the netdev side.
      
          - Patch 3/8 first does the skb_clear_delivery_time() after
            sch_handle_ingress() in dev.c and this will make the
            tc-bpf forward path work via the bpf_redirect_*() helper.
      
          - The next patch 4/8 (new in v4) will then postpone the
            skb_clear_delivery_time() from dev.c to
            the ip_local_deliver_finish() and ip6_input_finish() after
            taking care of the tstamp usage in the ip defrag case.
            This will make the kernel forward path also work, e.g.
            the ip[6]_forward().
      
      - Fixed a case v3 which missed setting the skb->mono_delivery_time bit
        when sending TCP rst/ack in some cases (e.g. from a ctl_sk).
        That case happens at ip_send_unicast_reply() and
        tcp_v6_send_response().  It is fixed in patch 1/8 (and
        then patch 3/8) in v4.
      
      bpf:
      - Adding __sk_buff->delivery_time_type instead of adding
        __sk_buff->mono_delivery_time as in v3.  The tc-bpf can stay with
        one __sk_buff->tstamp instead of having two 'time' fields
        while one is 0 and another is not.
        tc-bpf can use the new __sk_buff->delivery_time_type to tell
        what is stored in __sk_buff->tstamp.
      - bpf_skb_set_delivery_time() helper is added to set
        __sk_buff->tstamp from non mono delivery_time to
        mono delivery_time
      - Most of the convert_ctx_access() bpf insn rewrite in v3
        is gone, so no new rewrite added for __sk_buff->tstamp.
        The only rewrite added is for reading the new
        __sk_buff->delivery_time_type.
      - Added selftests, test_tc_dtime.c
      
      v3:
      - Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
      - Considered to reuse a few bits in skb->tstamp to represent
        different semantics, other than more code churns, it will break
        the bpf usecase which currently can write and then read back
        the skb->tstamp.
      - Went back to v1 idea on adding a bit to skb and address the
        feedbacks on v1:
      - Added one bit skb->mono_delivery_time to flag that
        the skb->tstamp has the mono delivery_time (EDT), instead
        of adding a bit to flag if the skb->tstamp has been forwarded or not.
      - Instead of resetting the delivery_time back to the (rcv) timestamp
        during recvmsg syscall which may be too late and not useful,
        the delivery_time reset in v3 happens earlier once the stack
        knows that the skb will be delivered locally.
      - Handled the tapping@ingress case by af_packet
      - No need to change the (rcv) timestamp to mono clock base as in v1.
        The added one bit to flag skb->mono_delivery_time is enough
        to keep the EDT delivery_time during forward.
      - Added logic to the bpf side to make the existing bpf
        running at ingress can still get the (rcv) timestamp
        when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
        is also added.  Test is still needed to test this piece.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01e2d157
    • M
      bpf: selftests: test skb->tstamp in redirect_neigh · c803475f
      Martin KaFai Lau 提交于
      This patch adds tests on forwarding the delivery_time for
      the following cases
      - tcp/udp + ip4/ip6 + bpf_redirect_neigh
      - tcp/udp + ip4/ip6 + ip[6]_forward
      - bpf_skb_set_delivery_time
      - The old rcv timestamp expectation on tc-bpf@ingress
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c803475f
    • M
      bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() · 8d21ec0e
      Martin KaFai Lau 提交于
      * __sk_buff->delivery_time_type:
      This patch adds __sk_buff->delivery_time_type.  It tells if the
      delivery_time is stored in __sk_buff->tstamp or not.
      
      It will be most useful for ingress to tell if the __sk_buff->tstamp
      has the (rcv) timestamp or delivery_time.  If delivery_time_type
      is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp.
      
      Two non-zero types are defined for the delivery_time_type,
      BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC.  For UNSPEC,
      it can only happen in egress because only mono delivery_time can be
      forwarded to ingress now.  The clock of UNSPEC delivery_time
      can be deduced from the skb->sk->sk_clockid which is how
      the sch_etf doing it also.
      
      * Provide forwarded delivery_time to tc-bpf@ingress:
      With the help of the new delivery_time_type, the tc-bpf has a way
      to tell if the __sk_buff->tstamp has the (rcv) timestamp or
      the delivery_time.  During bpf load time, the verifier will learn if
      the bpf prog has accessed the new __sk_buff->delivery_time_type.
      If it does, it means the tc-bpf@ingress is expecting the
      skb->tstamp could have the delivery_time.  The kernel will then
      read the skb->tstamp as-is during bpf insn rewrite without
      checking the skb->mono_delivery_time.  This is done by adding a
      new prog->delivery_time_access bit.  The same goes for
      writing skb->tstamp.
      
      * bpf_skb_set_delivery_time():
      The bpf_skb_set_delivery_time() helper is added to allow setting both
      delivery_time and the delivery_time_type at the same time.  If the
      tc-bpf does not need to change the delivery_time_type, it can directly
      write to the __sk_buff->tstamp as the existing tc-bpf has already been
      doing.  It will be most useful at ingress to change the
      __sk_buff->tstamp from the (rcv) timestamp to
      a mono delivery_time and then bpf_redirect_*().
      
      bpf only has mono clock helper (bpf_ktime_get_ns), and
      the current known use case is the mono EDT for fq, and
      only mono delivery time can be kept during forward now,
      so bpf_skb_set_delivery_time() only supports setting
      BPF_SKB_DELIVERY_TIME_MONO.  It can be extended later when use cases
      come up and the forwarding path also supports other clock bases.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d21ec0e
    • M
      bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress · 7449197d
      Martin KaFai Lau 提交于
      The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
      as a (rcv) timestamp which currently could either be 0 (not available)
      or ktime_get_real().  This patch is to backward compatible with the
      (rcv) timestamp expectation at ingress.  If the skb->tstamp has
      the delivery_time, the bpf insn rewrite will read 0 for tc-bpf
      running at ingress as it is not available.  When writing at ingress,
      it will also clear the skb->mono_delivery_time bit.
      
      /* BPF_READ: a = __sk_buff->tstamp */
      if (!skb->tc_at_ingress || !skb->mono_delivery_time)
      	a = skb->tstamp;
      else
      	a = 0
      
      /* BPF_WRITE: __sk_buff->tstamp = a */
      if (skb->tc_at_ingress)
      	skb->mono_delivery_time = 0;
      skb->tstamp = a;
      
      [ A note on the BPF_CGROUP_INET_INGRESS which can also access
        skb->tstamp.  At that point, the skb is delivered locally
        and skb_clear_delivery_time() has already been done,
        so the skb->tstamp will only have the (rcv) timestamp. ]
      
      If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
      has to be cleared also.  It could be done together during
      convert_ctx_access().  However, the latter patch will also expose
      the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
      Changing the delivery_time_type in the background may surprise
      the user, e.g. the 2nd read on __sk_buff->delivery_time_type
      may need a READ_ONCE() to avoid compiler optimization.  Thus,
      in expecting the needs in the latter patch, this patch does a
      check on !skb->tstamp after running the tc-bpf and clears the
      skb->mono_delivery_time bit if needed.  The earlier discussion
      on v4 [0].
      
      The bpf insn rewrite requires the skb's mono_delivery_time bit and
      tc_at_ingress bit.  They are moved up in sk_buff so that bpf rewrite
      can be done at a fixed offset.  tc_skip_classify is moved together with
      tc_at_ingress.  To get one bit for mono_delivery_time, csum_not_inet is
      moved down and this bit is currently used by sctp.
      
      [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7449197d
    • M
      net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally · cd14e9b7
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time in the ingress path
      before the routing decision is made.  This patch can postpone clearing
      delivery_time in a skb until knowing it is delivered locally and also
      set the (rcv) timestamp if needed.  This patch moves the
      skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
      and ip6_input_finish().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd14e9b7
    • M
      net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c · 80fcec67
      Martin KaFai Lau 提交于
      If skb has the (rcv) timestamp available, nfnetlink_{log, queue}.c
      logs/outputs it to the userspace.  When the locally generated skb is
      looping from egress to ingress over a virtual interface (e.g. veth,
      loopback...),  skb->tstamp may have the delivery time before it is
      known that will be delivered locally and received by another sk.  Like
      handling the delivery time in network tapping,  use ktime_get_real() to
      get the (rcv) timestamp.  The earlier added helper skb_tstamp_cond() is
      used to do this.  false is passed to the second 'cond' arg such
      that doing ktime_get_real() or not only depends on the
      netstamp_needed_key static key.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80fcec67
    • M
      net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option · b6561f84
      Martin KaFai Lau 提交于
      IOAM is a hop-by-hop option with a temporary iana allocation (49).
      Since it is hop-by-hop, it is done before the input routing decision.
      One of the traced data field is the (rcv) timestamp.
      
      When the locally generated skb is looping from egress to ingress over
      a virtual interface (e.g. veth, loopback...), skb->tstamp may have the
      delivery time before it is known that it will be delivered locally
      and received by another sk.
      
      Like handling the network tapping (tcpdump) in the earlier patch,
      this patch gets the timestamp if needed without over-writing the
      delivery_time in the skb->tstamp.  skb_tstamp_cond() is added to do the
      ktime_get_real() with an extra cond arg to check on top of the
      netstamp_needed_key static key.  skb_tstamp_cond() will also be used in
      a latter patch and it needs the netstamp_needed_key check.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6561f84
    • M
      net: ipv6: Handle delivery_time in ipv6 defrag · 335c8cf3
      Martin KaFai Lau 提交于
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally (i.e. calling
      skb_clear_delivery_time() at ip_local_deliver_finish() for IPv4
      and at ip6_input_finish() for IPv6).  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      A very similar IPv6 defrag codes have been duplicated in
      multiple places: regular IPv6, nf_conntrack, and 6lowpan.
      
      Unlike the IPv4 defrag which is done before ip_local_deliver_finish(),
      the regular IPv6 defrag is done after ip6_input_finish().
      Thus, no change should be needed in the regular IPv6 defrag
      logic because skb_clear_delivery_time() should have been called.
      
      6lowpan also does not need special handling on delivery_time
      because it is a non-inet packet_type.
      
      However, cf_conntrack has a case in NF_INET_PRE_ROUTING that needs
      to do the IPv6 defrag earlier.  Thus, it needs to save the
      mono_delivery_time bit in the inet_frag_queue which is similar
      to how it is handled in the previous patch for the IPv4 defrag.
      
      This patch chooses to do it consistently and stores the mono_delivery_time
      in the inet_frag_queue for all cases such that it will be easier
      for the future refactoring effort on the IPv6 reasm code.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      335c8cf3
    • M
      net: ip: Handle delivery_time in ip defrag · 8672406e
      Martin KaFai Lau 提交于
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally.  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      An earlier attempt was to do skb_clear_delivery_time() in
      ip_local_deliver() and ip6_input().  The discussion [0] requested
      to move it one step later into ip_local_deliver_finish()
      and ip6_input_finish() so that the delivery_time can be kept
      for the ip_vs forwarding path also.
      
      To do that, this patch also needs to take care of the (rcv) timestamp
      usecase in ip_is_fragment().  It needs to expect delivery_time in
      the skb->tstamp, so it needs to save the mono_delivery_time bit in
      inet_frag_queue such that the delivery_time (if any) can be restored
      in the final defragmented skb.
      
      [Note that it will only happen when the locally generated skb is looping
       from egress to ingress over a virtual interface (e.g. veth, loopback...),
       skb->tstamp may have the delivery time before it is known that it will
       be delivered locally and received by another sk.]
      
      [0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8672406e
    • M
      net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() · d98d58a0
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time before sch_handle_ingress().
      
      This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
      and also clear it with skb_clear_delivery_time() after
      sch_handle_ingress().  This will make the bpf_redirect_*()
      to keep the mono delivery_time and used by a qdisc (fq) of
      the egress-ing interface.
      
      A latter patch will postpone the skb_clear_delivery_time() until the
      stack learns that the skb is being delivered locally and that will
      make other kernel forwarding paths (ip[6]_forward) able to keep
      the delivery_time also.  Thus, like the previous patches on using
      the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
      is not limited within the CONFIG_NET_INGRESS to avoid too many code
      churns among this set.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d98d58a0
    • M
      net: Clear mono_delivery_time bit in __skb_tstamp_tx() · d93376f5
      Martin KaFai Lau 提交于
      In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
      the sk_error_queue.  The outgoing skb may have the mono delivery_time
      while the (rcv) timestamp is expected for the clone, so the
      skb->mono_delivery_time bit needs to be cleared from the clone.
      
      This patch adds the skb->mono_delivery_time clearing to the existing
      __net_timestamp() and use it in __skb_tstamp_tx().
      The __net_timestamp() fast path usage in dev.c is changed to directly
      call ktime_get_real() since the mono_delivery_time bit is not set at
      that point.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d93376f5
    • M
      net: Handle delivery_time in skb->tstamp during network tapping with af_packet · 27942a15
      Martin KaFai Lau 提交于
      A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
      skb_clear_tstamp() will then keep this delivery_time during forwarding.
      
      This patch is to make the network tapping (with af_packet) to handle
      the delivery_time stored in skb->tstamp.
      
      Regardless of tapping at the ingress or egress,  the tapped skb is
      received by the af_packet socket, so it is ingress to the af_packet
      socket and it expects the (rcv) timestamp.
      
      When tapping at egress, dev_queue_xmit_nit() is used.  It has already
      expected skb->tstamp may have delivery_time,  so it does
      skb_clone()+net_timestamp_set() to ensure the cloned skb has
      the (rcv) timestamp before passing to the af_packet sk.
      This patch only adds to clear the skb->mono_delivery_time
      bit in net_timestamp_set().
      
      When tapping at ingress, it currently expects the skb->tstamp is either 0
      or the (rcv) timestamp.  Meaning, the tapping at ingress path
      has already expected the skb->tstamp could be 0 and it will get
      the (rcv) timestamp by ktime_get_real() when needed.
      
      There are two cases for tapping at ingress:
      
      One case is af_packet queues the skb to its sk_receive_queue.
      The skb is either not shared or new clone created.  The newly
      added skb_clear_delivery_time() is called to clear the
      delivery_time (if any) and set the (rcv) timestamp if
      needed before the skb is queued to the sk_receive_queue.
      
      Another case, the ingress skb is directly copied to the rx_ring
      and tpacket_get_timestamp() is used to get the (rcv) timestamp.
      The newly added skb_tstamp() is used in tpacket_get_timestamp()
      to check the skb->mono_delivery_time bit before returning skb->tstamp.
      As mentioned earlier, the tapping@ingress has already expected
      the skb may not have the (rcv) timestamp (because no sk has asked
      for it) and has handled this case by directly calling ktime_get_real().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27942a15
    • M
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau 提交于
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de799101
    • M
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau 提交于
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
    • D
      Merge branch 'dsa-unicast-filtering' · 6fb8661c
      David S. Miller 提交于
      Vladimir Oltean says:
      
      ====================
      DSA unicast filtering
      
      This series doesn't attempt anything extremely brave, it just changes
      the way in which standalone ports which support FDB isolation work.
      
      Up until now, DSA has recommended that switch drivers configure
      standalone ports in a separate VID/FID with learning disabled, and with
      the CPU port as the only destination, reached trivially via flooding.
      That works, except that standalone ports will deliver all packets to the
      CPU. We can leverage the hardware FDB as a MAC DA filter, and disable
      flooding towards the CPU port, to force the dropping of packets with
      unknown MAC DA.
      
      We handle port promiscuity by re-enabling flooding towards the CPU port.
      This is relevant because the bridge puts its automatic (learning +
      flooding) ports in promiscuous mode, and this makes some things work
      automagically, like for example bridging with a foreign interface.
      We don't delve yet into the territory of managing CPU flooding more
      aggressively while under a bridge.
      
      The only switch driver that benefits from this work right now is the
      NXP LS1028A switch (felix). The others need to implement FDB isolation
      first, before DSA is going to install entries to the port's standalone
      database. Otherwise, these entries might collide with bridge FDB/MDB
      entries.
      
      This work was done mainly to have all the required features in place
      before somebody starts seriously architecting DSA support for multiple
      CPU ports. Otherwise it is much more difficult to bolt these features on
      top of multiple CPU ports.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fb8661c
    • V
      net: mscc: ocelot: accept configuring bridge port flags on the NPI port · ac455209
      Vladimir Oltean 提交于
      In order for the Felix DSA driver to be able to turn on/off flooding
      towards its CPU port, we need to redirect calls on the NPI port to
      actually act upon the index in the analyzer block that corresponds to
      the CPU port module. This was never necessary until now because DSA
      (or the bridge) never called ocelot_port_bridge_flags() for the NPI
      port.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac455209
    • V
      net: dsa: felix: stop clearing CPU flooding in felix_setup_tag_8021q · 0cc36980
      Vladimir Oltean 提交于
      felix_migrate_flood_to_tag_8021q_port() takes care of clearing the
      flooding bits on the old CPU port (which was the CPU port module), so
      manually clearing this bit from PGID_UC, PGID_MC, PGID_BC is redundant.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cc36980
    • V
      net: dsa: felix: start off with flooding disabled on the CPU port · 90897569
      Vladimir Oltean 提交于
      The driver probes with all ports as standalone, and it supports unicast
      filtering. So DSA will call port_fdb_add() for all necessary addresses
      on the current CPU port. We also handle migrations when the CPU port
      hardware resource changes (on tagging protocol change), so there should
      not be any unknown address that we have to receive while not promiscuous.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90897569
    • V
      net: dsa: felix: migrate flood settings from NPI to tag_8021q CPU port · b903a6bd
      Vladimir Oltean 提交于
      When the tagging protocol changes from "ocelot" to "ocelot-8021q" or in
      reverse, the DSA promiscuity setting that was applied for the old CPU
      port must be transferred to the new one.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b903a6bd
    • V
      net: dsa: felix: migrate host FDB and MDB entries when changing tag proto · f9cef64f
      Vladimir Oltean 提交于
      The "ocelot" and "ocelot-8021q" tagging protocols make use of different
      hardware resources, and host FDB entries have different destination
      ports in the switch analyzer module, practically speaking.
      
      So when the user requests a tagging protocol change, the driver must
      migrate all host FDB and MDB entries from the NPI port (in fact CPU port
      module) towards the same physical port, but this time used as a regular
      port.
      
      It is pointless for the felix driver to keep a copy of the host
      addresses, when we can create and export DSA helpers for walking through
      the addresses that it already needs to keep on the CPU port, for
      refcounting purposes.
      
      felix_classify_db() is moved up to avoid a forward declaration.
      
      We pass "bool change" because dp->fdbs and dp->mdbs are uninitialized
      lists when felix_setup() first calls felix_set_tag_protocol(), so we
      need to avoid calling dsa_port_walk_fdbs() during probe time.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9cef64f
    • V
      net: dsa: manage flooding on the CPU ports · 7569459a
      Vladimir Oltean 提交于
      DSA can treat IFF_PROMISC and IFF_ALLMULTI on standalone user ports as
      signifying whether packets with an unknown MAC DA will be received or
      not. Since known MAC DAs are handled by FDB/MDB entries, this means that
      promiscuity is analogous to including/excluding the CPU port from the
      flood domain of those packets.
      
      There are two ways to signal CPU flooding to drivers.
      
      The first (chosen here) is to synthesize a call to
      ds->ops->port_bridge_flags() for the CPU port, with a mask of
      BR_FLOOD | BR_MCAST_FLOOD. This has the effect of turning on egress
      flooding on the CPU port regardless of source.
      
      The alternative would be to create a new ds->ops->port_host_flood()
      which is called per user port. Some switches (sja1105) have a flood
      domain that is managed per {ingress port, egress port} pair, so it would
      make more sense for this kind of switch to not flood the CPU from port A
      if just port B requires it. Nonetheless, the sja1105 has other quirks
      that prevent it from making use of unicast filtering, and without a
      concrete user making use of this feature, I chose not to implement it.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7569459a
    • V
      net: dsa: install the primary unicast MAC address as standalone port host FDB · 499aa9e1
      Vladimir Oltean 提交于
      To be able to safely turn off CPU flooding for standalone ports, we need
      to ensure that the dev_addr of each DSA slave interface is installed as
      a standalone host FDB entry for compatible switches.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      499aa9e1
    • V
      net: dsa: install secondary unicast and multicast addresses as host FDB/MDB · 5e8a1e03
      Vladimir Oltean 提交于
      In preparation of disabling flooding towards the CPU in standalone ports
      mode, identify the addresses requested by upper interfaces and use the
      new API for DSA FDB isolation to request the hardware driver to offload
      these as FDB or MDB objects. The objects belong to the user port's
      database, and are installed pointing towards the CPU port.
      
      Because dev_uc_add()/dev_mc_add() is VLAN-unaware, we offload to the
      port standalone database addresses with VID 0 (also VLAN-unaware).
      So this excludes switches with global VLAN filtering from supporting
      unicast filtering, because there, it is possible for a port of a switch
      to join a VLAN-aware bridge, and this changes the VLAN awareness of
      standalone ports, requiring VLAN-aware standalone host FDB entries.
      For the same reason, hellcreek, which requires VLAN awareness in
      standalone mode, is also exempted from unicast filtering.
      
      We create "standalone" variants of dsa_port_host_fdb_add() and
      dsa_port_host_mdb_add() (and the _del coresponding functions).
      
      We also create a separate work item type for handling deferred
      standalone host FDB/MDB entries compared to the switchdev one.
      This is done for the purpose of clarity - the procedure for offloading a
      bridge FDB entry is different than offloading a standalone one, and
      the switchdev event work handles only FDBs anyway, not MDBs.
      Deferral is needed for standalone entries because ndo_set_rx_mode runs
      in atomic context. We could probably optimize things a little by first
      queuing up all entries that need to be offloaded, and scheduling the
      work item just once, but the data structures that we can pass through
      __dev_uc_sync() and __dev_mc_sync() are limiting (there is nothing like
      a void *priv), so we'd have to keep the list of queued events somewhere
      in struct dsa_switch, and possibly a lock for it. Too complicated for
      now.
      
      Adding the address to the master is handled by dev_uc_sync(), adding it
      to the hardware is handled by __dev_uc_sync(). So this is the reason why
      dsa_port_standalone_host_fdb_add() does not call dev_uc_add(). Not that
      it had the rtnl_mutex anyway - ndo_set_rx_mode has it, but is atomic.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e8a1e03