1. 10 12月, 2022 1 次提交
    • J
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 043cd1e2
      Jakub Kicinski 提交于
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-12-08 (ice)
      
      Jacob Keller says:
      
      This series of patches primarily consists of changes to fix some corner
      cases that can cause Tx timestamp failures. The issues were discovered and
      reported by Siddaraju DH and primarily affect E822 hardware, though this
      series also includes some improvements that affect E810 hardware as well.
      
      The primary issue is regarding the way that E822 determines when to generate
      timestamp interrupts. If the driver reads timestamp indexes which do not
      have a valid timestamp, the E822 interrupt tracking logic can get stuck.
      This is due to the way that E822 hardware tracks timestamp index reads
      internally. I was previously unaware of this behavior as it is significantly
      different in E810 hardware.
      
      Most of the fixes target refactors to ensure that the ice driver does not
      read timestamp indexes which are not valid on E822 hardware. This is done by
      using the Tx timestamp ready bitmap register from the PHY. This register
      indicates what timestamp indexes have outstanding timestamps waiting to be
      captured.
      
      Care must be taken in all cases where we read the timestamp registers, and
      thus all flows which might have read these registers are refactored. The
      ice_ptp_tx_tstamp function is modified to consolidate as much of the logic
      relating to these registers as possible. It now handles discarding stale
      timestamps which are old or which occurred after a PHC time update. This
      replaces previously standalone thread functions like the periodic work
      function and the ice_ptp_flush_tx_tracker function.
      
      In addition, some minor cleanups noticed while writing these refactors are
      included.
      
      The remaining patches refactor the E822 implementation to remove the
      "bypass" mode for timestamps. The E822 hardware has the ability to provide a
      more precise timestamp by making use of measurements of the precise way that
      packets flow through the hardware pipeline. These measurements are known as
      "Vernier" calibration. The "bypass" mode disables many of these measurements
      in favor of a faster start up time for Tx and Rx timestamping. Instead, once
      these measurements were captured, the driver tries to reconfigure the PHY to
      enable the vernier calibrations.
      
      Unfortunately this recalibration does not work. Testing indicates that the
      PHY simply remains in bypass mode without the increased timestamp precision.
      Remove the attempt at recalibration and always use vernier mode. This has
      one disadvantage that Tx and Rx timestamps cannot begin until after at least
      one packet of that type goes through the hardware pipeline. Because of this,
      further refactor the driver to separate Tx and Rx vernier calibration.
      Complete the Tx and Rx independently, enabling the appropriate type of
      timestamp as soon as the relevant packet has traversed the hardware
      pipeline. This was reported by Milena Olech.
      
      Note that although these might be considered "bug fixes", the required
      changes in order to appropriately resolve these issues is large. Thus it
      does not feel suitable to send this series to net.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        ice: reschedule ice_ptp_wait_for_offset_valid during reset
        ice: make Tx and Rx vernier offset calibration independent
        ice: only check set bits in ice_ptp_flush_tx_tracker
        ice: handle flushing stale Tx timestamps in ice_ptp_tx_tstamp
        ice: cleanup allocations in ice_ptp_alloc_tx_tracker
        ice: protect init and calibrating check in ice_ptp_request_ts
        ice: synchronize the misc IRQ when tearing down Tx tracker
        ice: check Tx timestamp memory register for ready timestamps
        ice: handle discarding old Tx requests in ice_ptp_tx_tstamp
        ice: always call ice_ptp_link_change and make it void
        ice: fix misuse of "link err" with "link status"
        ice: Reset TS memory for all quads
        ice: Remove the E822 vernier "bypass" logic
        ice: Use more generic names for ice_ptp_tx fields
      ====================
      
      Link: https://lore.kernel.org/r/20221208213932.1274143-1-anthony.l.nguyen@intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      043cd1e2
  2. 09 12月, 2022 39 次提交
    • W
      net: openvswitch: Add support to count upcall packets · 1933ea36
      wangchuanlei 提交于
      Add support to count upall packets, when kmod of openvswitch
      upcall to count the number of packets for upcall succeed and
      failed, which is a better way to see how many packets upcalled
      on every interfaces.
      Signed-off-by: Nwangchuanlei <wangchuanlei@inspur.com>
      Acked-by: NEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1933ea36
    • T
      rhashtable: Allow rhashtable to be used from irq-safe contexts · e47877c7
      Tejun Heo 提交于
      rhashtable currently only does bh-safe synchronization making it impossible
      to use from irq-safe contexts. Switch it to use irq-safe synchronization to
      remove the restriction.
      
      v2: Update the lock functions to return the ulong flags value and unlock
          functions to take the value directly instead of passing around the
          pointer. Suggested by Linus.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Vernet <dvernet@meta.com>
      Acked-by: NJosh Don <joshdon@google.com>
      Acked-by: NHao Luo <haoluo@google.com>
      Acked-by: NBarret Rhoden <brho@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e47877c7
    • D
      Merge branch 'net-sched-retpoline' · b602d003
      David S. Miller 提交于
      Pedro Tammela says:
      
      ====================
      net/sched: retpoline wrappers for tc
      
      In tc all qdics, classifiers and actions can be compiled as modules.
      This results today in indirect calls in all transitions in the tc hierarchy.
      Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on
      indirect calls. For newer Intel cpus with IBRS the extra cost is
      nonexistent, but AMD Zen cpus and older x86 cpus still go through the
      retpoline thunk.
      
      Known built-in symbols can be optimized into direct calls, thus
      avoiding the retpoline thunk. So far, tc has not been leveraging this
      build information and leaving out a performance optimization for some
      CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()'
      with direct calls when known modules are compiled as built-in as an
      opt-in optimization.
      
      We measured these changes in one AMD Zen 4 cpu (Retpoline), one AMD Zen 3 cpu (Retpoline),
      one Intel 10th Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one
      Intel Xeon CPU (IBRS) using pktgen with 64b udp packets. Our test setup is a
      dummy device with clsact and matchall in a kernel compiled with every
      tc module as built-in.  We observed a 3-8% speed up on the retpoline CPUs,
      when going through 1 tc filter, and a 60-100% speed up when going through 100 filters.
      For the IBRS cpus we observed a 1-2% degradation in both scenarios, we believe
      the extra branches check introduced a small overhead therefore we added
      a static key that bypasses the wrapper on kernels not using the retpoline mitigation,
      but compiled with CONFIG_RETPOLINE.
      
      1 filter:
      CPU        | before (pps) | after (pps) | diff
      R9 7950X   | 5914980      | 6380227     | +7.8%
      R9 5950X   | 4237838      | 4412241     | +4.1%
      R9 5950X   | 4265287      | 4413757     | +3.4%   [*]
      i5-3337U   | 1580565      | 1682406     | +6.4%
      i5-10210U  | 3006074      | 3006857     | +0.0%
      i5-10210U  | 3160245      | 3179945     | +0.6%   [*]
      Xeon 6230R | 3196906      | 3197059a     | +0.0%
      Xeon 6230R | 3190392      | 3196153     | +0.01%  [*]
      
      100 filters:
      CPU        | before (pps) | after (pps) | diff
      R9 7950X   | 373598       | 820396      | +119.59%
      R9 5950X   | 313469       | 633303      | +102.03%
      R9 5950X   | 313797       | 633150      | +101.77% [*]
      i5-3337U   | 127454       | 211210      | +65.71%
      i5-10210U  | 389259       | 381765      | -1.9%
      i5-10210U  | 408812       | 412730      | +0.9%    [*]
      Xeon 6230R | 415420       | 406612      | -2.1%
      Xeon 6230R | 416705       | 405869      | -2.6%    [*]
      
      [*] In these tests we ran pktgen with clone set to 1000.
      
      On the 7950x system we also tested the impact of filters if iteration order
      placement varied, first by compiling a kernel with the filter under test being
      the first one in the static iteration and then repeating it with being last (of 15 classifiers existing today).
      We saw a difference of +0.5-1% in pps between being the first in the iteration vs being the last.
      Therefore we order the classifiers and actions according to relevance per our current thinking.
      
      v5->v6:
      - Address Eric Dumazet suggestions
      
      v4->v5:
      - Rebase
      
      v3->v4:
      - Address Eric Dumazet suggestions
      
      v2->v3:
      - Address suggestions by Jakub, Paolo and Eric
      - Dropped RFC tag (I forgot to add it on v2)
      
      v1->v2:
      - Fix build errors found by the bots
      - Address Kuniyuki Iwashima suggestions
      
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b602d003
    • P
      net/sched: avoid indirect classify functions on retpoline kernels · 9f3101dc
      Pedro Tammela 提交于
      Expose the necessary tc classifier functions and wire up cls_api to use
      direct calls in retpoline kernels.
      Signed-off-by: NPedro Tammela <pctammela@mojatatu.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f3101dc
    • P
      net/sched: avoid indirect act functions on retpoline kernels · 871cf386
      Pedro Tammela 提交于
      Expose the necessary tc act functions and wire up act_api to use
      direct calls in retpoline kernels.
      Signed-off-by: NPedro Tammela <pctammela@mojatatu.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      871cf386
    • P
      net/sched: add retpoline wrapper for tc · 7f0e8102
      Pedro Tammela 提交于
      On kernels using retpoline as a spectrev2 mitigation,
      optimize actions and filters that are compiled as built-ins into a direct call.
      
      On subsequent patches we expose the classifiers and actions functions
      and wire up the wrapper into tc.
      Signed-off-by: NPedro Tammela <pctammela@mojatatu.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f0e8102
    • P
      net/sched: move struct action_ops definition out of ifdef · 2a7d228f
      Pedro Tammela 提交于
      The type definition should be visible even in configurations not using
      CONFIG_NET_CLS_ACT.
      Signed-off-by: NPedro Tammela <pctammela@mojatatu.com>
      Reviewed-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a7d228f
    • R
      net: phy: remove redundant "depends on" lines · 0bdff115
      Randy Dunlap 提交于
      Delete a few lines of "depends on PHYLIB" since they are inside
      an "if PHYLIB / endif # PHYLIB" block, i.e., they are redundant
      and the other 50+ drivers there don't use "depends on PHYLIB"
      since it is not needed.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20221207044257.30036-1-rdunlap@infradead.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0bdff115
    • W
      net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP · b534dc46
      Willem de Bruijn 提交于
      Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
      write_seq sockets instead of snd_una.
      
      This should have been the behavior from the start. Because processes
      may now exist that rely on the established behavior, do not change
      behavior of the existing option, but add the right behavior with a new
      flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on
      stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID.
      
      Intuitively the contract is that the counter is zero after the
      setsockopt, so that the next write N results in a notification for
      the last byte N - 1.
      
      On idle sockets snd_una == write_seq and this holds for both. But on
      sockets with data in transmission, snd_una records the unacked offset
      in the stream. This depends on the ACK response from the peer. A
      process cannot learn this in a race free manner (ioctl SIOCOUTQ is one
      racy approach).
      
      write_seq records the offset at the last byte written by the process.
      This is a better starting point. It matches the intuitive contract in
      all circumstances, unaffected by external behavior.
      
      The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
      Move the field in struct sock to avoid growing the socket (for some
      common CONFIG variants). The UAPI interface so_timestamping.flags is
      already int, so 32 bits wide.
      Reported-by: NSotirios Delimanolis <sotodel@meta.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b534dc46
    • J
      Merge branch 'fix-possible-deadlock-during-wed-attach' · ecd6df3c
      Jakub Kicinski 提交于
      Lorenzo Bianconi says:
      
      ====================
      fix possible deadlock during WED attach
      
      Fix a possible deadlock in mtk_wed_attach if mtk_wed_wo_init routine fails.
      Check wo pointer is properly allocated before running mtk_wed_wo_reset() and
      mtk_wed_wo_deinit().
      ====================
      
      Link: https://lore.kernel.org/r/cover.1670421354.git.lorenzo@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      ecd6df3c
    • L
      net: ethernet: mtk_wed: fix possible deadlock if mtk_wed_wo_init fails · 587585e1
      Lorenzo Bianconi 提交于
      Introduce __mtk_wed_detach() in order to avoid a deadlock in
      mtk_wed_attach routine if mtk_wed_wo_init fails since both
      mtk_wed_attach and mtk_wed_detach run holding hw_lock mutex.
      
      Fixes: 4c5de09e ("net: ethernet: mtk_wed: add configure wed wo support")
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      587585e1
    • L
      net: ethernet: mtk_wed: fix some possible NULL pointer dereferences · c79e0af5
      Lorenzo Bianconi 提交于
      Fix possible NULL pointer dereference in mtk_wed_detach routine checking
      wo pointer is properly allocated before running mtk_wed_wo_reset() and
      mtk_wed_wo_deinit().
      Even if it is just a theoretical issue at the moment check wo pointer is
      not NULL in mtk_wed_mcu_msg_update.
      Moreover, honor mtk_wed_mcu_send_msg return value in mtk_wed_wo_reset()
      
      Fixes: 79968444 ("net: ethernet: mtk_wed: introduce wed wo support")
      Fixes: 4c5de09e ("net: ethernet: mtk_wed: add configure wed wo support")
      Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c79e0af5
    • C
      nfp: Fix spelling mistake "tha" -> "the" · 3df96774
      Colin Ian King 提交于
      There is a spelling mistake in a nn_dp_warn message. Fix it.
      Signed-off-by: NColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: NSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20221207094312.2281493-1-colin.i.king@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      3df96774
    • B
      selftests: net: Fix O=dir builds · 17961a37
      Björn Töpel 提交于
      The BPF Makefile in net/bpf did incorrect path substitution for O=dir
      builds, e.g.
      
        make O=/tmp/kselftest headers
        make O=/tmp/kselftest -C tools/testing/selftests
      
      would fail in selftest builds [1] net/ with
      
        clang-16: error: no such file or directory: 'kselftest/net/bpf/nat6to4.c'
        clang-16: error: no input files
      
      Add a pattern prerequisite and an order-only-prerequisite (for
      creating the directory), to resolve the issue.
      
      [1] https://lore.kernel.org/all/202212060009.34CkQmCN-lkp@intel.com/Reported-by: Nkernel test robot <lkp@intel.com>
      Fixes: 837a3d66 ("selftests: net: Add cross-compilation support for BPF programs")
      Signed-off-by: NBjörn Töpel <bjorn@rivosinc.com>
      Link: https://lore.kernel.org/r/20221206102838.272584-1-bjorn@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      17961a37
    • J
      Merge branch 'mlxsw-add-spectrum-1-ip6gre-support' · ce87a957
      Jakub Kicinski 提交于
      Petr Machata says:
      
      ====================
      mlxsw: Add Spectrum-1 ip6gre support
      
      Ido Schimmel writes:
      
      Currently, mlxsw only supports ip6gre offload on Spectrum-2 and newer
      ASICs. Spectrum-1 can also offload ip6gre tunnels, but it needs double
      entry router interfaces (RIFs) for the RIFs representing these tunnels.
      In addition, the RIF index needs to be even. This is handled in
      patches #1-#3.
      
      The implementation can otherwise be shared between all Spectrum
      generations. This is handled in patches #4-#5.
      
      Patch #6 moves a mlxsw ip6gre selftest to a shared directory, as ip6gre
      is no longer only supported on Spectrum-2 and newer ASICs.
      
      This work is motivated by users that require multiple GRE tunnels that
      all share the same underlay VRF. Currently, mlxsw only supports
      decapsulation based on the underlay destination IP (i.e., not taking the
      GRE key into account), so users need to configure these tunnels with
      different source IPs and IPv6 addresses are easier to spare than IPv4.
      
      Tested using existing ip6gre forwarding selftests.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1670414573.git.petrm@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      ce87a957
    • I
      selftests: mlxsw: Move IPv6 decap_error test to shared directory · db401875
      Ido Schimmel 提交于
      Now that Spectrum-1 gained ip6gre support we can move the test out of
      the Spectrum-2 directory.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      db401875
    • I
      mlxsw: spectrum_ipip: Add Spectrum-1 ip6gre support · 7ec53643
      Ido Schimmel 提交于
      As explained in the previous patch, the existing Spectrum-2 ip6gre
      implementation can be reused for Spectrum-1. Change the Spectrum-1
      ip6gre operations structure to use the common operations.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7ec53643
    • I
      mlxsw: spectrum_ipip: Rename Spectrum-2 ip6gre operations · ab30e4d4
      Ido Schimmel 提交于
      There are two main differences between Spectrum-1 and newer ASICs in
      terms of IP-in-IP support:
      
      1. In Spectrum-1, RIFs representing ip6gre tunnels require two entries
         in the RIF table.
      
      2. In Spectrum-2 and newer ASICs, packets ingress the underlay (during
         encapsulation) and egress the underlay (during decapsulation) via a
         special generic loopback RIF.
      
      The first difference was handled in previous patches by adding the
      'double_rif_entry' field to the Spectrum-1 operations structure of
      ip6gre RIFs. The second difference is handled during RIF creation, by
      only creating a generic loopback RIF in Spectrum-2 and newer ASICs.
      
      Therefore, the ip6gre operations can be shared between Spectrum-1 and
      newer ASIC in a similar fashion to how the ipgre operations are shared.
      
      Rename the operations to not be Spectrum-2 specific and move them
      earlier in the file so that they could later be used for Spectrum-1.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ab30e4d4
    • I
      mlxsw: spectrum_router: Add support for double entry RIFs · 5ca1b208
      Ido Schimmel 提交于
      In Spectrum-1, loopback router interfaces (RIFs) used for IP-in-IP
      encapsulation with an IPv6 underlay require two RIF entries and the RIF
      index must be even.
      
      Prepare for this change by extending the RIF parameters structure with a
      'double_entry' field that indicates if the RIF being created requires
      two RIF entries or not. Only set it for RIFs representing ip6gre tunnels
      in Spectrum-1.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5ca1b208
    • I
      mlxsw: spectrum_router: Parametrize RIF allocation size · 1a2f65b4
      Ido Schimmel 提交于
      Currently, each router interface (RIF) consumes one entry in the RIFs
      table. This is going to change in subsequent patches where some RIFs
      will consume two table entries.
      
      Prepare for this change by parametrizing the RIF allocation size. For
      now, always pass '1'.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1a2f65b4
    • I
      mlxsw: spectrum_router: Use gen_pool for RIF index allocation · 40ef76de
      Ido Schimmel 提交于
      Currently, each router interface (RIF) consumes one entry in the RIFs
      table and there are no alignment constraints. This is going to change in
      subsequent patches where some RIFs will consume two table entries and
      their indexes will need to be aligned to the allocation size (even).
      
      Prepare for this change by converting the RIF index allocation to use
      gen_pool with the 'gen_pool_first_fit_order_align' algorithm.
      
      No Kconfig changes necessary as mlxsw already selects
      'GENERIC_ALLOCATOR'.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      40ef76de
    • J
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 837e8ac8
      Jakub Kicinski 提交于
      No conflicts.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      837e8ac8
    • L
      Merge tag 'net-6.1-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 010b6761
      Linus Torvalds 提交于
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bluetooth, can and netfilter.
      
        Current release - new code bugs:
      
         - bonding: ipv6: correct address used in Neighbour Advertisement
           parsing (src vs dst typo)
      
         - fec: properly scope IRQ coalesce setup during link up to supported
           chips only
      
        Previous releases - regressions:
      
         - Bluetooth fixes for fake CSR clones (knockoffs):
             - re-add ERR_DATA_REPORTING quirk
             - fix crash when device is replugged
      
         - Bluetooth:
             - silence a user-triggerable dmesg error message
             - L2CAP: fix u8 overflow, oob access
             - correct vendor codec definition
             - fix support for Read Local Supported Codecs V2
      
         - ti: am65-cpsw: fix RGMII configuration at SPEED_10
      
         - mana: fix race on per-CQ variable NAPI work_done
      
        Previous releases - always broken:
      
         - af_unix: diag: fetch user_ns from in_skb in unix_diag_get_exact(),
           avoid null-deref
      
         - af_can: fix NULL pointer dereference in can_rcv_filter
      
         - can: slcan: fix UAF with a freed work
      
         - can: can327: flush TX_work on ldisc .close()
      
         - macsec: add missing attribute validation for offload
      
         - ipv6: avoid use-after-free in ip6_fragment()
      
         - nft_set_pipapo: actually validate intervals in fields after the
           first one
      
         - mvneta: prevent oob access in mvneta_config_rss()
      
         - ipv4: fix incorrect route flushing when table ID 0 is used, or when
           source address is deleted
      
         - phy: mxl-gpy: add workaround for IRQ bug on GPY215B and GPY215C"
      
      * tag 'net-6.1-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
        net: dsa: sja1105: avoid out of bounds access in sja1105_init_l2_policing()
        s390/qeth: fix use-after-free in hsci
        macsec: add missing attribute validation for offload
        net: mvneta: Fix an out of bounds check
        net: thunderbolt: fix memory leak in tbnet_open()
        ipv6: avoid use-after-free in ip6_fragment()
        net: plip: don't call kfree_skb/dev_kfree_skb() under spin_lock_irq()
        net: phy: mxl-gpy: add MDINT workaround
        net: dsa: mv88e6xxx: accept phy-mode = "internal" for internal PHY ports
        xen/netback: don't call kfree_skb() under spin_lock_irqsave()
        dpaa2-switch: Fix memory leak in dpaa2_switch_acl_entry_add() and dpaa2_switch_acl_entry_remove()
        ethernet: aeroflex: fix potential skb leak in greth_init_rings()
        tipc: call tipc_lxc_xmit without holding node_read_lock
        can: esd_usb: Allow REC and TEC to return to zero
        can: can327: flush TX_work on ldisc .close()
        can: slcan: fix freed work crash
        can: af_can: fix NULL pointer dereference in can_rcv_filter
        net: dsa: sja1105: fix memory leak in sja1105_setup_devlink_regions()
        ipv4: Fix incorrect route flushing when table ID 0 is used
        ipv4: Fix incorrect route flushing when source address is deleted
        ...
      010b6761
    • J
      Merge branch 'mlx4-better-big-tcp-support' · ff36c447
      Jakub Kicinski 提交于
      Eric Dumazet says:
      
      ====================
      mlx4: better BIG-TCP support
      
      mlx4 uses a bounce buffer in TX whenever the tx descriptors
      wrap around the right edge of the ring.
      
      Size of this bounce buffer was hard coded and can be
      increased if/when needed.
      ====================
      
      Link: https://lore.kernel.org/r/20221207141237.2575012-1-edumazet@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      ff36c447
    • E
      net/mlx4: small optimization in mlx4_en_xmit() · 0e706f79
      Eric Dumazet 提交于
      Test against MLX4_MAX_DESC_TXBBS only matters if the TX
      bounce buffer is going to be used.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      0e706f79
    • E
      net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS · 26782aad
      Eric Dumazet 提交于
      Google production kernel has increased MAX_SKB_FRAGS to 45
      for BIG-TCP rollout.
      
      Unfortunately mlx4 TX bounce buffer is not big enough whenever
      an skb has up to 45 page fragments.
      
      This can happen often with TCP TX zero copy, as one frag usually
      holds 4096 bytes of payload (order-0 page).
      
      Tested:
       Kernel built with MAX_SKB_FRAGS=45
       ip link set dev eth0 gso_max_size 185000
       netperf -t TCP_SENDFILE
      
      I made sure that "ethtool -G eth0 tx 64" was properly working,
      ring->full_size being set to 15.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      26782aad
    • E
      net/mlx4: rename two constants · 35f31ff0
      Eric Dumazet 提交于
      MAX_DESC_SIZE is really the size of the bounce buffer used
      when reaching the right side of TX ring buffer.
      
      MAX_DESC_TXBBS get a MLX4_ prefix.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      35f31ff0
    • J
      ice: reschedule ice_ptp_wait_for_offset_valid during reset · 95af1f1c
      Jacob Keller 提交于
      If the ice_ptp_wait_for_offest_valid function is scheduled to run while the
      driver is resetting, it will exit without completing calibration. The work
      function gets scheduled by ice_ptp_port_phy_restart which will be called as
      part of the reset recovery process.
      
      It is possible for the first execution to occur before the driver has
      completely cleared its resetting flags. Ensure calibration completes by
      rescheduling the task until reset is fully completed.
      Reported-by: NSiddaraju DH <siddaraju.dh@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      95af1f1c
    • S
      ice: make Tx and Rx vernier offset calibration independent · f029a343
      Siddaraju DH 提交于
      The Tx and Rx calibration and timestamp generation blocks are independent.
      However, the ice driver waits until both blocks are ready before
      configuring either block.
      
      This can result in delay of configuring one block because we have not yet
      received a packet in the other block.
      
      There is no reason to wait to finish programming Tx just because we haven't
      received a packet. Similarly there is no reason to wait to program Rx just
      because we haven't transmitted a packet.
      
      Instead of checking both offset status before programming either block,
      refactor the ice_phy_cfg_tx_offset_e822 and ice_phy_cfg_rx_offset_e822
      functions so that they perform their own offset status checks.
      Additionally, make them also check the offset ready bit to determine if
      the offset values have already been programmed.
      
      Call the individual configure functions directly in
      ice_ptp_wait_for_offset_valid. The functions will now correctly check
      status, and program the offsets if ready. Once the offset is programmed,
      the functions will exit quickly after just checking the offset ready
      register.
      
      Remove the ice_phy_calc_vernier_e822 in ice_ptp_hw.c, as well as the offset
      valid check functions in ice_ptp.c entirely as they are no longer
      necessary.
      
      With this change, the Tx and Rx blocks will each be enabled as soon as
      possible without waiting for the other block to complete calibration. This
      can enable timestamps faster in setups which have a low rate of transmitted
      or received packets. In particular, it can stop a situation where one port
      never receives traffic, and thus never finishes calibration of the Tx
      block, resulting in continuous faults reported by the ptp4l daemon
      application.
      Signed-off-by: NSiddaraju DH <siddaraju.dh@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f029a343
    • J
      ice: only check set bits in ice_ptp_flush_tx_tracker · e3ba5248
      Jacob Keller 提交于
      The ice_ptp_flush_tx_tracker function is called to clear all outstanding Tx
      timestamp requests when the port is being brought down. This function
      iterates over the entire list, but this is unnecessary. We only need to
      check the bits which are actually set in the ready bitmap.
      
      Replace this logic with for_each_set_bit, and follow a similar flow as in
      ice_ptp_tx_tstamp_cleanup. Note that it is safe to call dev_kfree_skb_any
      on a NULL pointer as it will perform a no-op so we do not need to verify
      that the skb is actually NULL.
      
      The new implementation also avoids clearing (and thus reading!) the PHY
      timestamp unless the index is marked as having a valid timestamp in the
      timestamp status bitmap. This ensures that we properly clear the status
      registers as appropriate.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e3ba5248
    • J
      ice: handle flushing stale Tx timestamps in ice_ptp_tx_tstamp · d40fd600
      Jacob Keller 提交于
      In the event of a PTP clock time change due to .adjtime or .settime, the
      ice driver needs to update the cached copy of the PHC time and also discard
      any outstanding Tx timestamps.
      
      This is required because otherwise the wrong copy of the PHC time will be
      used when extending the Tx timestamp. This could result in reporting
      incorrect timestamps to the stack.
      
      The current approach taken to handle this is to call
      ice_ptp_flush_tx_tracker, which will discard any timestamps which are not
      yet complete.
      
      This is problematic for two reasons:
      
      1) it could lead to a potential race condition where the wrong timestamp is
         associated with a future packet.
      
         This can occur with the following flow:
      
         1. Thread A gets request to transmit a timestamped packet, and picks an
            index and transmits the packet
      
         2. Thread B calls ice_ptp_flush_tx_tracker and sees the index in use,
            marking is as disarded. No timestamp read occurs because the status
            bit is not set, but the index is released for re-use
      
         3. Thread A gets a new request to transmit another timestamped packet,
            picks the same (now unused) index and transmits that packet.
      
         4. The PHY transmits the first packet and updates the timestamp slot and
            generates an interrupt.
      
         5. The ice_ptp_tx_tstamp thread executes and sees the interrupt and a
            valid timestamp but associates it with the new Tx SKB and not the one
            that actual timestamp for the packet as expected.
      
         This could result in the previous timestamp being assigned to a new
         packet producing incorrect timestamps and leading to incorrect behavior
         in PTP applications.
      
         This is most likely to occur when the packet rate for Tx timestamp
         requests is very high.
      
      2) on E822 hardware, we must avoid reading a timestamp index more than once
         each time its status bit is set and an interrupt is generated by
         hardware.
      
         We do have some extensive checks for the unread flag to ensure that only
         one of either the ice_ptp_flush_tx_tracker or ice_ptp_tx_tstamp threads
         read the timestamp. However, even with this we can still have cases
         where we "flush" a timestamp that was actually completed in hardware.
         This can lead to cases where we don't read the timestamp index as
         appropriate.
      
      To fix both of these issues, we must avoid calling ice_ptp_flush_tx_tracker
      outside of the teardown path.
      
      Rather than using ice_ptp_flush_tx_tracker, introduce a new state bitmap,
      the stale bitmap. Start this as cleared when we begin a new timestamp
      request. When we're about to extend a timestamp and send it up to the
      stack, first check to see if that stale bit was set. If so, drop the
      timestamp without sending it to the stack.
      
      When we need to update the cached PHC timestamp out of band, just mark all
      currently outstanding timestamps as stale. This will ensure that once
      hardware completes the timestamp we'll ignore it correctly and avoid
      reporting bogus timestamps to userspace.
      
      With this change, we fix potential issues caused  by calling
      ice_ptp_flush_tx_tracker during normal operation.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d40fd600
    • J
      ice: cleanup allocations in ice_ptp_alloc_tx_tracker · c1f3414d
      Jacob Keller 提交于
      The ice_ptp_alloc_tx_tracker function must allocate the timestamp array and
      the bitmap for tracking the currently in use indexes. A future change is
      going to add yet another allocation to this function.
      
      If these allocations fail we need to ensure that we properly cleanup and
      ensure that the pointers in the ice_ptp_tx structure are NULL.
      
      Simplify this logic by allocating to local variables first. If any
      allocation fails, then free everything and exit. Only update the ice_ptp_tx
      structure if all allocations succeed.
      
      This ensures that we have no side effects on the Tx structure unless all
      allocations have succeeded. Thus, no code will see an invalid pointer and
      we don't need to re-assign NULL on cleanup.
      
      This is safe because kernel "free" functions are designed to be NULL safe
      and perform no action if passed a NULL pointer. Thus its safe to simply
      always call kfree or bitmap_free even if one of those pointers was NULL.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      c1f3414d
    • J
      ice: protect init and calibrating check in ice_ptp_request_ts · 3ad5c10b
      Jacob Keller 提交于
      When requesting a new timestamp, the ice_ptp_request_ts function does not
      hold the Tx tracker lock while checking init and calibrating. This means
      that we might issue a new timestamp request just after the Tx timestamp
      tracker starts being deinitialized. This could lead to incorrect access of
      the timestamp structures. Correct this by moving the init and calibrating
      checks under the lock, and updating the flows which modify these fields to
      use the lock.
      
      Note that we do not need to hold the lock while checking for tx->init in
      ice_ptp_tx_tstamp. This is because the teardown function will use
      synchronize_irq after clearing the flag to ensure that the threaded
      interrupt completes. Either a) the tx->init flag will be cleared before the
      ice_ptp_tx_tstamp function starts, thus it will exit immediately, or b) the
      threaded interrupt will be executing and the synchronize_irq will wait
      until the threaded interrupt has completed at which point we know the init
      field has definitely been set and new interrupts will not execute the Tx
      timestamp thread function.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      3ad5c10b
    • J
      Merge branch 'mlx5-Support-tc-police-jump-conform-exceed-attribute' · ddda6326
      Jakub Kicinski 提交于
      Saeed Mahameed says:
      
      ====================
      Support tc police jump conform-exceed attribute
      
      The tc police action conform-exceed option defines how to handle
      packets which exceed or conform to the configured bandwidth limit.
      One of the possible conform-exceed values is jump, which skips over
      a specified number of actions.
      This series adds support for conform-exceed jump action.
      
      The series adds platform support for branching actions by providing
      true/false flow attributes to the branching action.
      This is necessary for supporting police jump, as each branch may
      execute a different action list.
      
      The first five patches are preparation patches:
      - Patches 1 and 2 add support for actions with no destinations (e.g. drop)
      - Patch 3 refactor the code for subsequent function reuse
      - Patch 4 defines an abstract way for identifying terminating actions
      - Patch 5 updates action list validations logic considering branching actions
      
      The following three patches introduce an interface for abstracting branching
      actions:
      - Patch 6 introduces an abstract api for defining branching actions
      - Patch 7 generically instantiates the branching flow attributes using
        the abstract API
      
      Patch 8 adds the platform support for jump actions, by executing the following
      sequence:
        a. Store the jumping flow attr
        b. Identify the jump target action while iterating the actions list.
        c. Instantiate a new flow attribute after the jump target action.
           This is the flow attribute that the branching action should jump to.
        d. Set the target post action id on:
          d.1. The jumping attribute, thus realizing the jump functionality.
          d.2. The attribute preceding the target jump attr, if not terminating.
      
      The next patches apply the platform's branching attributes to the police
      action:
      - Patch 9 is a refactor patch
      - Patch 10 initializes the post meter table with the red/green flow attributes,
                 as were initialized by the platform
      - Patch 11 enables the offload of meter actions using jump conform-exceed
                 value.
      ====================
      
      Link: https://lore.kernel.org/all/20221203221337.29267-1-saeed@kernel.org/Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ddda6326
    • O
      net/mlx5e: TC, allow meter jump control action · 3603f266
      Oz Shlomo 提交于
      Separate the matchall police action validation from flower validation.
      Isolate the action validation logic in the police action parser.
      Signed-off-by: NOz Shlomo <ozsh@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221203221337.29267-12-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      3603f266
    • O
      net/mlx5e: TC, init post meter rules with branching attributes · 0d8c38d4
      Oz Shlomo 提交于
      Instantiate the post meter actions with the platform initialized branching
      action attributes.
      Signed-off-by: NOz Shlomo <ozsh@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221203221337.29267-11-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0d8c38d4
    • O
      net/mlx5e: TC, rename post_meter actions · 3fcb94e3
      Oz Shlomo 提交于
      Currently post meter supports only the pipe/drop conform-exceed policy.
      This assumption is reflected in several variable names.
      Rename the following variables as a pre-step for using the generalized
      branching action platform.
      
      Rename fwd_green_rule/drop_red_rule to green_rule/red_rule respectively.
      Repurpose red_counter/green_counter to act_counter/drop_counter to allow
      police conform-exceed configurations that do not drop.
      Signed-off-by: NOz Shlomo <ozsh@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221203221337.29267-10-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      3fcb94e3
    • O
      net/mlx5e: TC, initialize branching action with target attr · c84fa1ab
      Oz Shlomo 提交于
      Identify the jump target action when iterating the action list.
      Initialize the jump target attr with the jumping attribute during the
      parsing phase. Initialize the jumping attr post action with the target
      during the offload phase.
      Signed-off-by: NOz Shlomo <ozsh@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221203221337.29267-9-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c84fa1ab
    • O
      net/mlx5e: TC, initialize branch flow attributes · f86488cb
      Oz Shlomo 提交于
      Initialize flow attribute for drop, accept, pipe and jump branching actions.
      
      Instantiate a flow attribute instance according to the specified branch
      control action. Store the branching attributes on the branching action
      flow attribute during the parsing phase. Then, during the offload phase,
      allocate the relevant mod header objects to the branching actions.
      Signed-off-by: NOz Shlomo <ozsh@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221203221337.29267-8-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      f86488cb