1. 25 10月, 2021 3 次提交
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · 338a3a47
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      338a3a47
    • V
      net: mscc: ocelot: serialize access to the MAC table · 2468346c
      Vladimir Oltean 提交于
      DSA would like to remove the rtnl_lock from its
      SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handlers, and the felix driver uses
      the same MAC table functions as ocelot.
      
      This means that the MAC table functions will no longer be implicitly
      serialized with respect to each other by the rtnl_mutex, we need to add
      a dedicated lock in ocelot for the non-atomic operations of selecting a
      MAC table row, reading/writing what we want and polling for completion.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2468346c
    • D
      Revert "Merge branch 'dsa-rtnl'" · 2d7e73f0
      David S. Miller 提交于
      This reverts commit 965e6b26, reversing
      changes made to 4d98bb0d.
      2d7e73f0
  2. 24 10月, 2021 10 次提交
    • V
      can: dev: add can_tdc_get_relative_tdco() helper function · fa759a93
      Vincent Mailhol 提交于
      struct can_tdc::tdco represents the absolute offset from TDCV. Some
      controllers use instead an offset relative to the Sample Point (SP)
      such that:
      | SSP = TDCV + absolute TDCO
      |     = TDCV + SP + relative TDCO
      
      Consequently:
      | relative TDCO = absolute TDCO - SP
      
      The function can_tdc_get_relative_tdco() allow to retrieve this
      relative TDCO value.
      
      Link: https://lore.kernel.org/all/20210918095637.20108-7-mailhol.vincent@wanadoo.fr
      CC: Stefan Mätje <Stefan.Maetje@esd.eu>
      Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      fa759a93
    • V
      can: netlink: add can_priv::do_get_auto_tdcv() to retrieve tdcv from device · e8060f08
      Vincent Mailhol 提交于
      Some CAN device can measure the TDCV (Transmission Delay Compensation
      Value) automatically for each transmitted CAN frames.
      
      A callback function do_get_auto_tdcv() is added to retrieve that
      value. This function is used only if CAN_CTRLMODE_TDC_AUTO is enabled
      (if CAN_CTRLMODE_TDC_MANUAL is selected, the TDCV value is provided by
      the user).
      
      If the device does not support reporting of TDCV, do_get_auto_tdcv()
      should be set to NULL and TDCV will not be reported by the netlink
      interface.
      
      On success, do_get_auto_tdcv() shall return 0. If the value can not be
      measured by the device, for example because network is down or because
      no frames were transmitted yet, can_priv::do_get_auto_tdcv() shall
      return a negative error code (e.g. -EINVAL) to signify that the value
      is not yet available. In such cases, TDCV is not reported by the
      netlink interface.
      
      Link: https://lore.kernel.org/all/20210918095637.20108-6-mailhol.vincent@wanadoo.fr
      CC: Stefan Mätje <stefan.maetje@esd.eu>
      Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      e8060f08
    • V
      can: netlink: add interface for CAN-FD Transmitter Delay Compensation (TDC) · d99755f7
      Vincent Mailhol 提交于
      Add the netlink interface for TDC parameters of struct can_tdc_const
      and can_tdc.
      
      Contrary to the can_bittiming(_const) structures for which there is
      just a single IFLA_CAN(_DATA)_BITTMING(_CONST) entry per structure,
      here, we create a nested entry IFLA_CAN_TDC. Within this nested entry,
      additional IFLA_CAN_TDC_TDC* entries are added for each of the TDC
      parameters of the newly introduced struct can_tdc_const and struct
      can_tdc.
      
      For struct can_tdc_const, these are:
              IFLA_CAN_TDC_TDCV_MIN
              IFLA_CAN_TDC_TDCV_MAX
              IFLA_CAN_TDC_TDCO_MIN
              IFLA_CAN_TDC_TDCO_MAX
              IFLA_CAN_TDC_TDCF_MIN
              IFLA_CAN_TDC_TDCF_MAX
      
      For struct can_tdc, these are:
              IFLA_CAN_TDC_TDCV
              IFLA_CAN_TDC_TDCO
              IFLA_CAN_TDC_TDCF
      
      This is done so that changes can be applied in the future to the
      structures without breaking the netlink interface.
      
      The TDC netlink logic works as follow:
      
       * CAN_CTRLMODE_FD is not provided:
          - if any TDC parameters are provided: error.
      
          - TDC parameters not provided: TDC parameters unchanged.
      
       * CAN_CTRLMODE_FD is provided and is false:
           - TDC is deactivated: both the structure and the
             CAN_CTRLMODE_TDC_{AUTO,MANUAL} flags are flushed.
      
       * CAN_CTRLMODE_FD provided and is true:
          - CAN_CTRLMODE_TDC_{AUTO,MANUAL} and tdc{v,o,f} not provided: call
            can_calc_tdco() to automatically decide whether TDC should be
            activated and, if so, set CAN_CTRLMODE_TDC_AUTO and uses the
            calculated tdco value.
      
          - CAN_CTRLMODE_TDC_AUTO and tdco provided: set
            CAN_CTRLMODE_TDC_AUTO and use the provided tdco value. Here,
            tdcv is illegal and tdcf is optional.
      
          - CAN_CTRLMODE_TDC_MANUAL and both of tdcv and tdco provided: set
            CAN_CTRLMODE_TDC_MANUAL and use the provided tdcv and tdco
            value. Here, tdcf is optional.
      
          - CAN_CTRLMODE_TDC_{AUTO,MANUAL} are mutually exclusive. Whenever
            one flag is turned on, the other will automatically be turned
            off. Providing both returns an error.
      
          - Combination other than the one listed above are illegal and will
            return an error.
      
      N.B. above rules mean that whenever CAN_CTRLMODE_FD is provided, the
      previous TDC values will be overwritten. The only option to reuse
      previous TDC value is to not provide CAN_CTRLMODE_FD.
      
      All the new parameters are defined as u32. This arbitrary choice is
      done to mimic the other bittiming values with are also all of type
      u32. An u16 would have been sufficient to hold the TDC values.
      
      This patch completes below series (c.f. [1]):
        - commit 289ea9e4 ("can: add new CAN FD bittiming parameters:
          Transmitter Delay Compensation (TDC)")
        - commit c25cc799 ("can: bittiming: add calculation for CAN FD
          Transmitter Delay Compensation (TDC)")
      
      [1] https://lore.kernel.org/linux-can/20210224002008.4158-1-mailhol.vincent@wanadoo.fr/T/#t
      
      Link: https://lore.kernel.org/all/20210918095637.20108-5-mailhol.vincent@wanadoo.frSigned-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      d99755f7
    • V
      can: bittiming: change can_calc_tdco()'s prototype to not directly modify priv · da45a1e4
      Vincent Mailhol 提交于
      The function can_calc_tdco() directly retrieves can_priv from the
      net_device and directly modifies it.
      
      This is annoying for the upcoming patch. In
      drivers/net/can/dev/netlink.c:can_changelink(), the data bittiming are
      written to a temporary structure and memcpyed to can_priv only after
      everything succeeded. In the next patch, where we will introduce the
      netlink interface for TDC parameters, we will add a new TDC block
      which can potentially fail. For this reason, the data bittiming
      temporary structure has to be copied after that to-be-introduced TDC
      block. However, TDC also needs to access data bittiming information.
      
      We change the prototype so that the data bittiming structure is passed
      to can_calc_tdco() as an argument instead of retrieving it from
      priv. This way can_calc_tdco() can access the data bittiming before it
      gets memcpyed to priv.
      
      Link: https://lore.kernel.org/all/20210918095637.20108-4-mailhol.vincent@wanadoo.frSigned-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      da45a1e4
    • V
      can: bittiming: change unit of TDC parameters to clock periods · 39f66c9e
      Vincent Mailhol 提交于
      In the current implementation, all Transmission Delay Compensation
      (TDC) parameters are expressed in time quantum. However, ISO 11898-1
      actually specifies that these should be expressed in *minimum* time
      quantum.
      
      Furthermore, the minimum time quantum is specified to be "one node
      clock period long" (c.f. paragraph 11.3.1.1 "Bit time"). For sake of
      simplicity, we prefer to use the "clock period" term instead of
      "minimum time quantum" because we believe that it is more broadly
      understood.
      
      This patch fixes that discrepancy by updating the documentation and
      the formula for TDCO calculation.
      
      N.B. In can_calc_tdco(), the sample point (in time quantum) was
      calculated using a division, thus introducing a risk of rounding and
      truncation errors. On top of changing the unit to clock period, we
      also modified the formula to use only additions.
      
      Link: https://lore.kernel.org/all/20210918095637.20108-3-mailhol.vincent@wanadoo.frSuggested-by: NStefan Mätje <Stefan.Maetje@esd.eu>
      Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      39f66c9e
    • V
      can: bittiming: allow TDC{V,O} to be zero and add can_tdc_const::tdc{v,o,f}_min · 63dfe070
      Vincent Mailhol 提交于
      ISO 11898-1 specifies in section 11.3.3 "Transmitter delay
      compensation" that "the configuration range for [the] SSP position
      shall be at least 0 to 63 minimum time quanta."
      
      Because SSP = TDCV + TDCO, it means that we should allow both TDCV and
      TDCO to hold zero value in order to honor SSP's minimum possible
      value.
      
      However, current implementation assigned special meaning to TDCV and
      TDCO's zero values:
        * TDCV = 0 -> TDCV is automatically measured by the transceiver.
        * TDCO = 0 -> TDC is off.
      
      In order to allow for those values to really be zero and to maintain
      current features, we introduce two new flags:
        * CAN_CTRLMODE_TDC_AUTO indicates that the controller support
          automatic measurement of TDCV.
        * CAN_CTRLMODE_TDC_MANUAL indicates that the controller support
          manual configuration of TDCV. N.B.: current implementation failed
          to provide an option for the driver to indicate that only manual
          mode was supported.
      
      TDC is disabled if both CAN_CTRLMODE_TDC_AUTO and
      CAN_CTRLMODE_TDC_MANUAL flags are off, c.f. the helper function
      can_tdc_is_enabled() which is also introduced in this patch.
      
      Also, this patch adds three fields: tdcv_min, tdco_min and tdcf_min to
      struct can_tdc_const. While we are not convinced that those three
      fields could be anything else than zero, we can imagine that some
      controllers might specify a lower bound on these. Thus, those minimums
      are really added "just in case".
      
      Comments of struct can_tdc and can_tdc_const are updated accordingly.
      
      Finally, the changes are applied to the etas_es58x driver.
      
      Link: https://lore.kernel.org/all/20210918095637.20108-2-mailhol.vincent@wanadoo.frSigned-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      63dfe070
    • V
      net: dsa: introduce locking for the address lists on CPU and DSA ports · d3bd8924
      Vladimir Oltean 提交于
      Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
      no one is serializing access to the address lists that DSA keeps for the
      purpose of reference counting on shared ports (CPU and cascade ports).
      
      It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
      element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
      We need to avoid that.
      
      Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
      still runs under the rtnl_mutex. But it would be nice if it would not
      depend on that being the case. So let's introduce a mutex per port (the
      address lists are per port too) and share it between dp->mdbs and
      dp->fdbs.
      
      The place where we put the locking is interesting. It could be tempting
      to put a DSA-level lock which still serializes calls to
      .port_fdb_{add,del}, but it would still not avoid concurrency with other
      driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
      .port_fast_age). So it would add a very false sense of security (and
      adding a global switch-wide lock in DSA to resynchronize with the
      rtnl_lock is also counterproductive and hard).
      
      So the locking is intentionally done only where the dp->fdbs and dp->mdbs
      lists are traversed. That means, from a driver perspective, that
      .port_fdb_add will be called with the dp->addr_lists_lock mutex held on
      the CPU port, but not held on user ports. This is done so that driver
      writers are not encouraged to rely on any guarantee offered by
      dp->addr_lists_lock.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3bd8924
    • V
      net: mscc: ocelot: serialize access to the MAC table · f2c4bdf6
      Vladimir Oltean 提交于
      DSA would like to remove the rtnl_lock from its
      SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handlers, and the felix driver uses
      the same MAC table functions as ocelot.
      
      This means that the MAC table functions will no longer be implicitly
      serialized with respect to each other by the rtnl_mutex, we need to add
      a dedicated lock in ocelot for the non-atomic operations of selecting a
      MAC table row, reading/writing what we want and polling for completion.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2c4bdf6
    • F
      net: phy: bcm7xxx: Add EPHY entry for 7712 · 218f23e8
      Florian Fainelli 提交于
      7712 is a 16nm process SoC with a 10/100 integrated Ethernet PHY,
      utilize the recently defined 16nm EPHY macro to configure that PHY.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218f23e8
    • S
      net: mdio: Add helper functions for accessing MDIO devices · 0ebecb26
      Sean Anderson 提交于
      This adds some helpers for accessing non-phy MDIO devices. They are
      analogous to phy_(read|write|modify), except that they take an mdio_device
      and not a phy_device.
      Signed-off-by: NSean Anderson <sean.anderson@seco.com>
      Reviewed-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ebecb26
  3. 23 10月, 2021 1 次提交
  4. 21 10月, 2021 14 次提交
  5. 20 10月, 2021 2 次提交
  6. 19 10月, 2021 10 次提交
    • J
      ethernet: add a helper for assigning port addresses · e80094a4
      Jakub Kicinski 提交于
      We have 5 drivers which offset base MAC addr by port id.
      Create a helper for them.
      
      This helper takes care of overflows, which some drivers
      did not do, please complain if that's going to break
      anything!
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NShannon Nelson <snelson@pensando.io>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e80094a4
    • P
      net: sch_tbf: Add a graft command · 6b3efbfa
      Petr Machata 提交于
      As another qdisc is linked to the TBF, the latter should issue an event to
      give drivers a chance to react to the grafting. In other qdiscs, this event
      is called GRAFT, so follow suit with TBF as well.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b3efbfa
    • S
      mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() · 79f9bc58
      Sean Christopherson 提交于
      Check for a NULL page->mapping before dereferencing the mapping in
      page_is_secretmem(), as the page's mapping can be nullified while gup()
      is running, e.g.  by reclaim or truncation.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000068
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G        W
        RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0
        Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be
        RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046
        RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900
        ...
        CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0
        Call Trace:
         get_user_pages_fast_only+0x13/0x20
         hva_to_pfn+0xa9/0x3e0
         try_async_pf+0xa1/0x270
         direct_page_fault+0x113/0xad0
         kvm_mmu_page_fault+0x69/0x680
         vmx_handle_exit+0xe1/0x5d0
         kvm_arch_vcpu_ioctl_run+0xd81/0x1c70
         kvm_vcpu_ioctl+0x267/0x670
         __x64_sys_ioctl+0x83/0xa0
         do_syscall_64+0x56/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211007231502.3552715-1-seanjc@google.com
      Fixes: 1507f512 ("mm: introduce memfd_secret system call to create "secret" memory areas")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reported-by: NDarrick J. Wong <djwong@kernel.org>
      Reported-by: NStephen <stephenackerman16@gmail.com>
      Tested-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f9bc58
    • L
      elfcore: correct reference to CONFIG_UML · b0e90128
      Lukas Bulwahn 提交于
      Commit 6e7b64b9 ("elfcore: fix building with clang") introduces
      special handling for two architectures, ia64 and User Mode Linux.
      However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig
      symbol for User-Mode Linux was used.
      
      Although the directory for User Mode Linux is ./arch/um; the Kconfig
      symbol for this architecture is called CONFIG_UML.
      
      Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs:
      
        UM
        Referencing files: include/linux/elfcore.h
        Similar symbols: UML, NUMA
      
      Correct the name of the config to the intended one.
      
      [akpm@linux-foundation.org: fix um/x86_64, per Catalin]
        Link: https://lkml.kernel.org/r/20211006181119.2851441-1-catalin.marinas@arm.com
        Link: https://lkml.kernel.org/r/YV6pejGzLy5ppEpt@arm.com
      
      Link: https://lkml.kernel.org/r/20211006082209.417-1-lukas.bulwahn@gmail.com
      Fixes: 6e7b64b9 ("elfcore: fix building with clang")
      Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Barret Rhoden <brho@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0e90128
    • H
      mm/migrate: fix CPUHP state to update node demotion order · a6a0251c
      Huang Ying 提交于
      The node demotion order needs to be updated during CPU hotplug.  Because
      whether a NUMA node has CPU may influence the demotion order.  The
      update function should be called during CPU online/offline after the
      node_states[N_CPU] has been updated.  That is done in
      CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
      CPU offline.  But in commit 884a6e5d ("mm/migrate: update node
      demotion order on hotplug events"), the function to update node demotion
      order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
      doesn't satisfy the order requirement.
      
      For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
      and P2, P3 in S1), the demotion order is
      
       - S0 -> NUMA_NO_NODE
       - S1 -> NUMA_NO_NODE
      
      After P2 and P3 is offlined, because S1 has no CPU now, the demotion
      order should have been changed to
      
       - S0 -> S1
       - S1 -> NO_NODE
      
      but it isn't changed, because the order updating callback for CPU
      hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
      the demotion order is changed to the expected order as above.
      
      So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
      CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
      CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
      update function on them.
      
      Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6a0251c
    • D
      mm/migrate: add CPU hotplug to demotion #ifdef · 76af6a05
      Dave Hansen 提交于
      Once upon a time, the node demotion updates were driven solely by memory
      hotplug events.  But now, there are handlers for both CPU and memory
      hotplug.
      
      However, the #ifdef around the code checks only memory hotplug.  A
      system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
      hotplug events.
      
      Update the #ifdef around the common code.  Add memory and CPU-specific
      #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
      function warnings when their Kconfig option is off.
      
      [arnd@arndb.de: rework hotplug_memory_notifier() stub]
        Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76af6a05
    • M
      net/mlx5: Introduce new uplink destination type · 58a606db
      Maor Gottlieb 提交于
      The uplink destination type should be used in rules to steer the
      packet to the uplink when the device is in steering based LAG mode.
      Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: NMark Bloch <mbloch@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      58a606db
    • M
      net/mlx5: Add support to create match definer · e7e2519e
      Maor Gottlieb 提交于
      Introduce new APIs to create and destroy flow matcher
      for given format id.
      
      Flow match definer object is used for defining the fields and
      mask used for the hash calculation. User should mask the desired
      fields like done in the match criteria.
      
      This object is assigned to flow group of type hash. In this flow
      group type, packets lookup is done based on the hash result.
      
      This patch also adds the required bits to create such flow group.
      Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: NMark Bloch <mbloch@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      e7e2519e
    • M
      net/mlx5: Introduce port selection namespace · 425a563a
      Maor Gottlieb 提交于
      Add new port selection flow steering namespace. Flow steering rules in
      this namespaceare are used to determine the physical port for egress
      packets.
      Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: NMark Bloch <mbloch@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      425a563a
    • S
      tracing: Have all levels of checks prevent recursion · ed65df63
      Steven Rostedt (VMware) 提交于
      While writing an email explaining the "bit = 0" logic for a discussion on
      making ftrace_test_recursion_trylock() disable preemption, I discovered a
      path that makes the "not do the logic if bit is zero" unsafe.
      
      The recursion logic is done in hot paths like the function tracer. Thus,
      any code executed causes noticeable overhead. Thus, tricks are done to try
      to limit the amount of code executed. This included the recursion testing
      logic.
      
      Having recursion testing is important, as there are many paths that can
      end up in an infinite recursion cycle when tracing every function in the
      kernel. Thus protection is needed to prevent that from happening.
      
      Because it is OK to recurse due to different running context levels (e.g.
      an interrupt preempts a trace, and then a trace occurs in the interrupt
      handler), a set of bits are used to know which context one is in (normal,
      softirq, irq and NMI). If a recursion occurs in the same level, it is
      prevented*.
      
      Then there are infrastructure levels of recursion as well. When more than
      one callback is attached to the same function to trace, it calls a loop
      function to iterate over all the callbacks. Both the callbacks and the
      loop function have recursion protection. The callbacks use the
      "ftrace_test_recursion_trylock()" which has a "function" set of context
      bits to test, and the loop function calls the internal
      trace_test_and_set_recursion() directly, with an "internal" set of bits.
      
      If an architecture does not implement all the features supported by ftrace
      then the callbacks are never called directly, and the loop function is
      called instead, which will implement the features of ftrace.
      
      Since both the loop function and the callbacks do recursion protection, it
      was seemed unnecessary to do it in both locations. Thus, a trick was made
      to have the internal set of recursion bits at a more significant bit
      location than the function bits. Then, if any of the higher bits were set,
      the logic of the function bits could be skipped, as any new recursion
      would first have to go through the loop function.
      
      This is true for architectures that do not support all the ftrace
      features, because all functions being traced must first go through the
      loop function before going to the callbacks. But this is not true for
      architectures that support all the ftrace features. That's because the
      loop function could be called due to two callbacks attached to the same
      function, but then a recursion function inside the callback could be
      called that does not share any other callback, and it will be called
      directly.
      
      i.e.
      
       traced_function_1: [ more than one callback tracing it ]
         call loop_func
      
       loop_func:
         trace_recursion set internal bit
         call callback
      
       callback:
         trace_recursion [ skipped because internal bit is set, return 0 ]
         call traced_function_2
      
       traced_function_2: [ only traced by above callback ]
         call callback
      
       callback:
         trace_recursion [ skipped because internal bit is set, return 0 ]
         call traced_function_2
      
       [ wash, rinse, repeat, BOOM! out of shampoo! ]
      
      Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
      call for all functions.
      
      Since we want to encourage architectures to implement all ftrace features,
      having them slow down due to this extra logic may encourage the
      maintainers to update to the latest ftrace features. And because this
      logic is only safe for them, remove it completely.
      
       [*] There is on layer of recursion that is allowed, and that is to allow
           for the transition between interrupt context (normal -> softirq ->
           irq -> NMI), because a trace may occur before the context update is
           visible to the trace recursion logic.
      
      Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
      Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Jisheng Zhang <jszhang@kernel.org>
      Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: edc15caf ("tracing: Avoid unnecessary multiple recursion checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      ed65df63