1. 23 7月, 2021 4 次提交
    • A
      net: socket: simplify dev_ifconf handling · 876f0bf9
      Arnd Bergmann 提交于
      The dev_ifconf() calling conventions make compat handling
      more complicated than necessary, simplify this by moving
      the in_compat_syscall() check into the function.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      876f0bf9
    • A
      net: socket: remove register_gifconf · b0e99d03
      Arnd Bergmann 提交于
      Since dynamic registration of the gifconf() helper is only used for
      IPv4, and this can not be in a loadable module, this can be simplified
      noticeably by turning it into a direct function call as a preparation
      for cleaning up the compat handling.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0e99d03
    • A
      ethtool: improve compat ioctl handling · dd98d289
      Arnd Bergmann 提交于
      The ethtool compat ioctl handling is hidden away in net/socket.c,
      which introduces a couple of minor oddities:
      
      - The implementation may end up diverging, as seen in the RXNFC
        extension in commit 84a1d9c4 ("net: ethtool: extend RXNFC
        API to support RSS spreading of filter matches") that does not work
        in compat mode.
      
      - Most architectures do not need the compat handling at all
        because u64 and compat_u64 have the same alignment.
      
      - On x86, the conversion is done for both x32 and i386 user space,
        but it's actually wrong to do it for x32 and cannot work there.
      
      - On 32-bit Arm, it never worked for compat oabi user space, since
        that needs to do the same conversion but does not.
      
      - It would be nice to get rid of both compat_alloc_user_space()
        and copy_in_user() throughout the kernel.
      
      None of these actually seems to be a serious problem that real
      users are likely to encounter, but fixing all of them actually
      leads to code that is both shorter and more readable.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd98d289
    • A
      compat: make linux/compat.h available everywhere · 1a33b18b
      Arnd Bergmann 提交于
      Parts of linux/compat.h are under an #ifdef, but we end up
      using more of those over time, moving things around bit by
      bit.
      
      To get it over with once and for all, make all of this file
      uncondititonal now so it can be accessed everywhere. There
      are only a few types left that are in asm/compat.h but not
      yet in the asm-generic version, so add those in the process.
      
      This requires providing a few more types in asm-generic/compat.h
      that were not already there. The only tricky one is
      compat_sigset_t, which needs a little help on 32-bit architectures
      and for x86.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a33b18b
  2. 22 7月, 2021 2 次提交
    • V
      net: bridge: move the switchdev object replay helpers to "push" mode · 4e51bf44
      Vladimir Oltean 提交于
      Starting with commit 4f2673b3 ("net: bridge: add helper to replay
      port and host-joined mdb entries"), DSA has introduced some bridge
      helpers that replay switchdev events (FDB/MDB/VLAN additions and
      deletions) that can be lost by the switchdev drivers in a variety of
      circumstances:
      
      - an IP multicast group was host-joined on the bridge itself before any
        switchdev port joined the bridge, leading to the host MDB entries
        missing in the hardware database.
      - during the bridge creation process, the MAC address of the bridge was
        added to the FDB as an entry pointing towards the bridge device
        itself, but with no switchdev ports being part of the bridge yet, this
        local FDB entry would remain unknown to the switchdev hardware
        database.
      - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,
        before any switchdev port joined that LAG, leading to the hardware
        database missing those entries.
      - a switchdev port left a LAG that is a bridge port, while the LAG
        remained part of the bridge, and all FDB/MDB/VLAN entries remained
        installed in the hardware database of the switchdev port.
      
      Also, since commit 0d2cfbd4 ("net: bridge: ignore switchdev events
      for LAG ports which didn't request replay"), DSA introduced a method,
      based on a const void *ctx, to ensure that two switchdev ports under the
      same LAG that is a bridge port do not see the same MDB/VLAN entry being
      replayed twice by the bridge, once for every bridge port that joins the
      LAG.
      
      With so many ordering corner cases being possible, it seems unreasonable
      to expect a switchdev driver writer to get it right from the first try.
      Therefore, now that DSA has experimented with the bridge replay helpers
      for a little bit, we can move the code to the bridge driver where it is
      more readily available to all switchdev drivers.
      
      To convert the switchdev object replay helpers from "pull mode" (where
      the driver asks for them) to a "push mode" (where the bridge offers them
      automatically), the biggest problem is that the bridge needs to be aware
      when a switchdev port joins and leaves, even when the switchdev is only
      indirectly a bridge port (for example when the bridge port is a LAG
      upper of the switchdev).
      
      Luckily, we already have a hook for that, in the form of the newly
      introduced switchdev_bridge_port_offload() and
      switchdev_bridge_port_unoffload() calls. These offer a natural place for
      hooking the object addition and deletion replays.
      
      Extend the above 2 functions with:
      - pointers to the switchdev atomic notifier (for FDB replays) and the
        blocking notifier (for MDB and VLAN replays).
      - the "const void *ctx" argument required for drivers to be able to
        disambiguate between which port is targeted, when multiple ports are
        lowers of the same LAG that is a bridge port. Most of the drivers pass
        NULL to this argument, except the ones that support LAG offload and have
        the proper context check already in place in the switchdev blocking
        notifier handler.
      
      Also unexport the replay helpers, since nobody except the bridge calls
      them directly now.
      
      Note that:
      (a) we abuse the terminology slightly, because FDB entries are not
          "switchdev objects", but we count them as objects nonetheless.
          With no direct way to prove it, I think they are not modeled as
          switchdev objects because those can only be installed by the bridge
          to the hardware (as opposed to FDB entries which can be propagated
          in the other direction too). This is merely an abuse of terms, FDB
          entries are replayed too, despite not being objects.
      (b) the bridge does not attempt to sync port attributes to newly joined
          ports, just the countable stuff (the objects). The reason for this
          is simple: no universal and symmetric way to sync and unsync them is
          known. For example, VLAN filtering: what to do on unsync, disable or
          leave it enabled? Similarly, STP state, ageing timer, etc etc. What
          a switchdev port does when it becomes standalone again is not really
          up to the bridge's competence, and the driver should deal with it.
          On the other hand, replaying deletions of switchdev objects can be
          seen a matter of cleanup and therefore be treated by the bridge,
          hence this patch.
      
      We make the replay helpers opt-in for drivers, because they might not
      bring immediate benefits for them:
      
      - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),
        so br_vlan_replay() should not do anything for the new drivers on
        which we call it. The existing drivers where there was even a slight
        possibility for there to exist a VLAN on a bridge port before they
        join it are already guarded against this: mlxsw and prestera deny
        joining LAG interfaces that are members of a bridge.
      
      - br_fdb_replay() should now notify of local FDB entries, but I patched
        all drivers except DSA to ignore these new entries in commit
        2c4eca3e ("net: bridge: switchdev: include local flag in FDB
        notifications"). Driver authors can lift this restriction as they
        wish, and when they do, they can also opt into the FDB replay
        functionality.
      
      - br_mdb_replay() should fix a real issue which is described in commit
        4f2673b3 ("net: bridge: add helper to replay port and host-joined
        mdb entries"). However most drivers do not offload the
        SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw
        offload this switchdev object, and I don't completely understand the
        way in which they offload this switchdev object anyway. So I'll leave
        it up to these drivers' respective maintainers to opt into
        br_mdb_replay().
      
      So most of the drivers pass NULL notifier blocks for the replay helpers,
      except:
      - dpaa2-switch which was already acked/regression-tested with the
        helpers enabled (and there isn't much of a downside in having them)
      - ocelot which already had replay logic in "pull" mode
      - DSA which already had replay logic in "pull" mode
      
      An important observation is that the drivers which don't currently
      request bridge event replays don't even have the
      switchdev_bridge_port_{offload,unoffload} calls placed in proper places
      right now. This was done to avoid unnecessary rework for drivers which
      might never even add support for this. For driver writers who wish to
      add replay support, this can be used as a tentative placement guide:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e51bf44
    • V
      net: bridge: switchdev: let drivers inform which bridge ports are offloaded · 2f5dc00f
      Vladimir Oltean 提交于
      On reception of an skb, the bridge checks if it was marked as 'already
      forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
      is, it assigns the source hardware domain of that skb based on the
      hardware domain of the ingress port. Then during forwarding, it enforces
      that the egress port must have a different hardware domain than the
      ingress one (this is done in nbp_switchdev_allowed_egress).
      
      Non-switchdev drivers don't report any physical switch id (neither
      through devlink nor .ndo_get_port_parent_id), therefore the bridge
      assigns them a hardware domain of 0, and packets coming from them will
      always have skb->offload_fwd_mark = 0. So there aren't any restrictions.
      
      Problems appear due to the fact that DSA would like to perform software
      fallback for bonding and team interfaces that the physical switch cannot
      offload.
      
             +-- br0 ---+
            / /   |      \
           / /    |       \
          /  |    |      bond0
         /   |    |     /    \
       swp0 swp1 swp2 swp3 swp4
      
      There, it is desirable that the presence of swp3 and swp4 under a
      non-offloaded LAG does not preclude us from doing hardware bridging
      beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
      enough that software bridging between {swp0,swp1,swp2} and bond0 is not
      impractical.
      
      But this creates an impossible paradox given the current way in which
      port hardware domains are assigned. When the driver receives a packet
      from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
      something.
      
      - If we set it to 0, then the bridge will forward it towards swp1, swp2
        and bond0. But the switch has already forwarded it towards swp1 and
        swp2 (not to bond0, remember, that isn't offloaded, so as far as the
        switch is concerned, ports swp3 and swp4 are not looking up the FDB,
        and the entire bond0 is a destination that is strictly behind the
        CPU). But we don't want duplicated traffic towards swp1 and swp2, so
        it's not ok to set skb->offload_fwd_mark = 0.
      
      - If we set it to 1, then the bridge will not forward the skb towards
        the ports with the same switchdev mark, i.e. not to swp1, swp2 and
        bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
        have forwarded the skb there.
      
      So the real issue is that bond0 will be assigned the same hardware
      domain as {swp0,swp1,swp2}, because the function that assigns hardware
      domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
      lower interfaces until it finds something that implements devlink (calls
      dev_get_port_parent_id with bool recurse = true). This is a problem
      because the fact that bond0 can be offloaded by swp3 and swp4 in our
      example is merely an assumption.
      
      A solution is to give the bridge explicit hints as to what hardware
      domain it should use for each port.
      
      Currently, the bridging offload is very 'silent': a driver registers a
      netdevice notifier, which is put on the netns's notifier chain, and
      which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
      bridge, and the lower is an interface it knows about (one registered by
      this driver, normally). Then, from within that notifier, it does a bunch
      of stuff behind the bridge's back, without the bridge necessarily
      knowing that there's somebody offloading that port. It looks like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v
              call_netdevice_notifiers
                        |
                        v
             dsa_slave_netdevice_event
                        |
                        v
              oh, hey! it's for me!
                        |
                        v
                 .port_bridge_join
      
      What we do to solve the conundrum is to be less silent, and change the
      switchdev drivers to present themselves to the bridge. Something like this:
      
           ip link set swp0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  hardware domain for
                        v                        |  this port, and zero
             dsa_slave_netdevice_event           |  if I got nothing.
                        |                        |
                        v                        |
              oh, hey! it's for me!              |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                        |                        |
                        +------------------------+
                   switchdev_bridge_port_offload(swp0, swp0)
      
      Then stacked interfaces (like bond0 on top of swp3/swp4) would be
      treated differently in DSA, depending on whether we can or cannot
      offload them.
      
      The offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge: Aye! I'll use this
              call_netdevice_notifiers           ^  ppid as the
                        |                        |  switchdev mark for
                        v                        |        bond0.
             dsa_slave_netdevice_event           | Coincidentally (or not),
                        |                        | bond0 and swp0, swp1, swp2
                        v                        | all have the same switchdev
              hmm, it's not quite for me,        | mark now, since the ASIC
               but my driver has already         | is able to forward towards
                 called .port_lag_join           | all these ports in hw.
                for it, because I have           |
            a port with dp->lag_dev == bond0.    |
                        |                        |
                        v                        |
                 .port_bridge_join               |
                 for swp3 and swp4               |
                        |                        |
                        +------------------------+
                  switchdev_bridge_port_offload(bond0, swp3)
                  switchdev_bridge_port_offload(bond0, swp4)
      
      And the non-offload case:
      
          ip link set bond0 master br0
                        |
                        v
       br_add_if() calls netdev_master_upper_dev_link()
                        |
                        v                    bridge waiting:
              call_netdevice_notifiers           ^  huh, switchdev_bridge_port_offload
                        |                        |  wasn't called, okay, I'll use a
                        v                        |  hwdom of zero for this one.
             dsa_slave_netdevice_event           :  Then packets received on swp0 will
                        |                        :  not be software-forwarded towards
                        v                        :  swp1, but they will towards bond0.
               it's not for me, but
             bond0 is an upper of swp3
            and swp4, but their dp->lag_dev
             is NULL because they couldn't
                  offload it.
      
      Basically we can draw the conclusion that the lowers of a bridge port
      can come and go, so depending on the configuration of lowers for a
      bridge port, it can dynamically toggle between offloaded and unoffloaded.
      Therefore, we need an equivalent switchdev_bridge_port_unoffload too.
      
      This patch changes the way any switchdev driver interacts with the
      bridge. From now on, everybody needs to call switchdev_bridge_port_offload
      and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
      port as non-offloaded and allow software flooding to other ports from
      the same ASIC.
      
      Note that these functions lay the ground for a more complex handshake
      between switchdev drivers and the bridge in the future.
      
      For drivers that will request a replay of the switchdev objects when
      they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we
      place the call to switchdev_bridge_port_unoffload() strategically inside
      the NETDEV_PRECHANGEUPPER notifier's code path, and not inside
      NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers
      need the netdev adjacency lists to be valid, and that is only true in
      NETDEV_PRECHANGEUPPER.
      
      Cc: Vadym Kochan <vkochan@marvell.com>
      Cc: Taras Chornyi <tchornyi@marvell.com>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression
      Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
      Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f5dc00f
  3. 21 7月, 2021 3 次提交
    • J
      ipv6: ioam: Support for IOAM injection with lwtunnels · 3edede08
      Justin Iurman 提交于
      Add support for the IOAM inline insertion (only for the host-to-host use case)
      which is per-route configured with lightweight tunnels. The target is iproute2
      and the patch is ready. It will be posted as soon as this patchset is merged.
      Here is an overview:
      
      $ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0
      
      This example configures an IOAM Pre-allocated Trace option attached to the
      fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated
      trace data block is 12 octets (size) and only the first IOAM data (bit 0:
      hop_limit + node id) is included in the trace (type) represented as a bitfield.
      
      The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not
      implemented is explained on the patchset cover.
      Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3edede08
    • J
      ipv6: ioam: IOAM Generic Netlink API · 8c6f6fa6
      Justin Iurman 提交于
      Add Generic Netlink commands to allow userspace to configure IOAM
      namespaces and schemas. The target is iproute2 and the patch is ready.
      It will be posted as soon as this patchset is merged. Here is an overview:
      
      $ ip ioam
      Usage:	ip ioam { COMMAND | help }
      	ip ioam namespace show
      	ip ioam namespace add ID [ data DATA32 ] [ wide DATA64 ]
      	ip ioam namespace del ID
      	ip ioam schema show
      	ip ioam schema add ID DATA
      	ip ioam schema del ID
      	ip ioam namespace set ID schema { ID | none }
      Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c6f6fa6
    • J
      ipv6: ioam: Data plane support for Pre-allocated Trace · 9ee11f0f
      Justin Iurman 提交于
      Implement support for processing the IOAM Pre-allocated Trace with IPv6,
      see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3].
      
      A new per-interface sysctl is introduced. The value is a boolean to accept (=1)
      or ignore (=0, by default) IPv6 IOAM options on ingress for an interface:
       - net.ipv6.conf.XXX.ioam6_enabled
      
      Two other sysctls are introduced to define IOAM IDs, represented by an integer.
      They are respectively per-namespace and per-interface:
       - net.ipv6.ioam6_id
       - net.ipv6.conf.XXX.ioam6_id
      
      The value of the first one represents the IOAM ID of the node itself (u32; max
      and default value = U32_MAX>>8, due to hop limit concatenation) while the other
      represents the IOAM ID of an interface (u16; max and default value = U16_MAX).
      
      Each "ioam6_id" sysctl has a "_wide" equivalent:
       - net.ipv6.ioam6_id_wide
       - net.ipv6.conf.XXX.ioam6_id_wide
      
      The value of the first one represents the wide IOAM ID of the node itself (u64;
      max and default value = U64_MAX>>8, due to hop limit concatenation) while the
      other represents the wide IOAM ID of an interface (u32; max and default value
      = U32_MAX).
      
      The use of short and wide equivalents is not exclusive, a deployment could
      choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format)
      could be an identifier for a physical interface, whereas
      net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a
      logical sub-interface. Documentation about new sysctls is provided at the end
      of this patchset.
      
      Two relativistic hash tables are used: one for IOAM namespaces, the other for
      IOAM schemas. A namespace can only have a single active schema and a schema
      can only be attached to a single namespace (1:1 relationship).
      
        [1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options
        [2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data
        [3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee11f0f
  4. 20 7月, 2021 8 次提交
    • X
      net: phy: add API to read 802.3-c45 IDs · 8b72b301
      Xu Liang 提交于
      Add API to read 802.3-c45 IDs so that C22/C45 mixed device can use
      C45 APIs without failing ID checks.
      Signed-off-by: NXu Liang <lxu@maxlinear.com>
      Acked-by: NHauke Mehrtens <hmehrtens@maxlinear.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b72b301
    • V
      net: dsa: tag_8021q: add proper cross-chip notifier support · c64b9c05
      Vladimir Oltean 提交于
      The big problem which mandates cross-chip notifiers for tag_8021q is
      this:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
      
      When the user runs:
      
      ip link add br0 type bridge
      ip link set sw0p0 master br0
      ip link set sw2p0 master br0
      
      It doesn't work.
      
      This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
      "other_ds" are at most 1 hop away from each other, so it is sufficient
      to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
      versa and presto, the cross-chip link works. When there is another
      switch in the middle, such as in this case switch 1 with its DSA links
      sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.
      
      Which is exactly why the problem is quadratic: when a port joins a
      bridge, for each port in the tree that's already in that same bridge we
      notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
      tree. It is a very complicated web of VLANs.
      
      It must be mentioned that currently we install tag_8021q VLANs on too
      many ports (DSA links - to be precise, on all of them). For example,
      when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
      RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
      isn't any port of switch 0 that is a member of br0 (at least yet).
      In theory we could notify only the switches which sit in between the
      port joining the bridge and the port reacting to that bridge_join event.
      But in practice that is impossible, because of the way 'link' properties
      are described in the device tree. The DSA bindings require DT writers to
      list out not only the real/physical DSA links, but in fact the entire
      routing table, like for example switch 0 above will have:
      
      	sw0p3: port@3 {
      		link = <&sw1p4 &sw2p4>;
      	};
      
      This was done because:
      
      /* TODO: ideally DSA ports would have a single dp->link_dp member,
       * and no dst->rtable nor this struct dsa_link would be needed,
       * but this would require some more complex tree walking,
       * so keep it stupid at the moment and list them all.
       */
      
      but it is a perfect example of a situation where too much information is
      actively detrimential, because we are now in the position where we
      cannot distinguish a real DSA link from one that is put there to avoid
      the 'complex tree walking'. And because DT is ABI, there is not much we
      can change.
      
      And because we do not know which DSA links are real and which ones
      aren't, we can't really know if DSA switch A is in the data path between
      switches B and C, in the general case.
      
      So this is why tag_8021q RX VLANs are added on all DSA links, and
      probably why it will never change.
      
      On the other hand, at least the number of additions/deletions is well
      balanced, and this means that once we implement reference counting at
      the cross-chip notifier level a la fdb/mdb, there is absolutely zero
      need for a struct dsa_8021q_crosschip_link, it's all self-managing.
      
      In fact, with the tag_8021q notifiers emitted from the bridge join
      notifiers, it becomes so generic that sja1105 does not need to do
      anything anymore, we can just delete its implementation of the
      .crosschip_bridge_{join,leave} methods.
      
      Among other things we can simply delete is the home-grown implementation
      of sja1105_notify_crosschip_switches(). The reason why that is wrong is
      because it is not quadratic - it only covers remote switches to which we
      have a cross-chip bridging link and that does not cover in-between
      switches. This deletion is part of the same patch because sja1105 used
      to poke deep inside the guts of the tag_8021q context in order to do
      that. Because the cross-chip links went away, so needs the sja1105 code.
      
      Last but not least, dsa_8021q_setup_port() is simplified (and also
      renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
      on the CPU port too, the four dsa_8021q_vid_apply() calls:
      - 1 for RX VLAN on user port
      - 1 for the user port's RX VLAN on the CPU port
      - 1 for TX VLAN on user port
      - 1 for the user port's TX VLAN on the CPU port
      
      now get squashed into only 2 notifier calls via
      dsa_port_tag_8021q_vlan_add.
      
      And because the notifiers to add and to delete a tag_8021q VLAN are
      distinct, now we finally break up the port setup and teardown into
      separate functions instead of relying on a "bool enabled" flag which
      tells us what to do. Arguably it should have been this way from the
      get go.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c64b9c05
    • V
      net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register · 328621f6
      Vladimir Oltean 提交于
      Right now, setting up tag_8021q is a 2-step operation for a driver,
      first the context structure needs to be created, then the VLANs need to
      be installed on the ports. A similar thing is true for teardown.
      
      Merge the 2 steps into the register/unregister methods, to be as
      transparent as possible for the driver as to what tag_8021q does behind
      the scenes. This also gets rid of the funny "bool setup == true means
      setup, == false means teardown" API that tag_8021q used to expose.
      
      Note that dsa_tag_8021q_register() must be called at least in the
      .setup() driver method and never earlier (like in the driver probe
      function). This is because the DSA switch tree is not initialized at
      probe time, and the cross-chip notifiers will not work.
      
      For symmetry with .setup(), the unregister method should be put in
      .teardown().
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      328621f6
    • V
      net: dsa: make tag_8021q operations part of the core · 5da11eb4
      Vladimir Oltean 提交于
      Make tag_8021q a more central element of DSA and move the 2 driver
      specific operations outside of struct dsa_8021q_context (which is
      supposed to hold dynamic data and not really constant function
      pointers).
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5da11eb4
    • V
      net: dsa: let the core manage the tag_8021q context · d7b1fd52
      Vladimir Oltean 提交于
      The basic problem description is as follows:
      
      Be there 3 switches in a daisy chain topology:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  user ] [  dsa  ]
      
      The CPU will not be able to ping through the user ports of the
      bottom-most switch (like for example sw2p0), simply because tag_8021q
      was not coded up for this scenario - it has always assumed DSA switch
      trees with a single switch.
      
      To add support for the topology above, we must admit that the RX VLAN of
      sw2p0 must be added on some ports of switches 0 and 1 as well. This is
      in fact a textbook example of thing that can use the cross-chip notifier
      framework that DSA has set up in switch.c.
      
      There is only one problem: core DSA (switch.c) is not able right now to
      make the connection between a struct dsa_switch *ds and a struct
      dsa_8021q_context *ctx. Right now, it is drivers who call into
      tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
      and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
      methods.
      
      But with cross-chip notifiers, it is possible for tag_8021q to call
      drivers without drivers having ever asked for anything. A good example
      is right above: when sw2p0 wants to set itself up for tag_8021q,
      the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
      so that they transport sw2p0's VLANs towards the CPU without dropping
      them.
      
      So instead of letting drivers manage the tag_8021q context, add a
      tag_8021q_ctx pointer inside of struct dsa_switch, which will be
      populated when dsa_tag_8021q_register() returns success.
      
      The patch is fairly long-winded because we are partly reverting commit
      5899ee36 ("net: dsa: tag_8021q: add a context structure") which made
      the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we
      can access "ctx" directly from "ds", this is no longer needed.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7b1fd52
    • V
      net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpers · cedf4670
      Vladimir Oltean 提交于
      In preparation of moving tag_8021q to core DSA, move all initialization
      and teardown related to tag_8021q which is currently done by drivers in
      2 functions called "register" and "unregister". These will gather more
      functionality in future patches, which will better justify the chosen
      naming scheme.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cedf4670
    • V
      net: dsa: tag_8021q: remove struct packet_type declaration · 8afbea18
      Vladimir Oltean 提交于
      This is no longer necessary since tag_8021q doesn't register itself as a
      full-blown tagger anymore.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8afbea18
    • V
      net: dsa: sja1105: delete the best_effort_vlan_filtering mode · 0fac6aa0
      Vladimir Oltean 提交于
      Simply put, the best-effort VLAN filtering mode relied on VLAN retagging
      from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to
      decode the source port in the tagger, but the VLAN retagging
      implementation inside the sja1105 chips is not the best and we were
      relying on marginal operating conditions.
      
      The most notable limitation of the best-effort VLAN filtering mode is
      its incapacity to treat this case properly:
      
      ip link add br0 type bridge vlan_filtering 1
      ip link set swp2 master br0
      ip link set swp4 master br0
      bridge vlan del dev swp4 vid 1
      bridge vlan add dev swp4 vid 1 pvid
      
      When sending an untagged packet through swp2, the expectation is for it
      to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1
      on egress). But the switch will send it as egress-untagged.
      
      There was an attempt to fix this here:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210407201452.1703261-2-olteanv@gmail.com/
      
      but it failed miserably because it broke PTP RX timestamping, in a way
      that cannot be corrected due to hardware issues related to VLAN
      retagging.
      
      So with either PTP broken or pushing VLAN headers on egress for untagged
      packets being broken, the sad reality is that the best-effort VLAN
      filtering code is broken. Delete it.
      
      Note that this means there will be a temporary loss of functionality in
      this driver until it is replaced with something better (network stack
      RX/TX capability for "mode 2" as described in
      Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware
      bridge" case). We simply cannot keep this code until that driver rework
      is done, it is super bloated and tangled with tag_8021q.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fac6aa0
  5. 16 7月, 2021 9 次提交
    • C
      sock_map: Relax config dependency to CONFIG_NET · 17edea21
      Cong Wang 提交于
      Currently sock_map still has Kconfig dependency on CONFIG_INET,
      but there is no actual functional dependency on it after we
      introduce ->psock_update_sk_prot().
      
      We have to extend it to CONFIG_NET now as we are going to
      support AF_UNIX.
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
      17edea21
    • J
      bpf: Enable BPF_TRAMP_F_IP_ARG for trampolines with call_get_func_ip · 1e37392c
      Jiri Olsa 提交于
      Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.
      
      The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
      and is used only by programs with bpf_get_func_ip helper, which is
      added in following patch and sets call_get_func_ip bit.
      
      This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
      trampolines that have programs with call_get_func_ip set.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
      1e37392c
    • J
      bpf, x86: Store caller's ip in trampoline stack · 7e6f3cd8
      Jiri Olsa 提交于
      Storing caller's ip in trampoline's stack. Trampoline programs
      can reach the IP in (ctx - 8) address, so there's no change in
      program's arguments interface.
      
      The IP address is takes from [fp + 8], which is return address
      from the initial 'call fentry' call to trampoline.
      
      This IP address will be returned via bpf_get_func_ip helper
      helper, which is added in following patches.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
      7e6f3cd8
    • A
      bpf: Teach stack depth check about async callbacks. · 7ddc80a4
      Alexei Starovoitov 提交于
      Teach max stack depth checking algorithm about async callbacks
      that don't increase bpf program stack size.
      Also add sanity check that bpf_tail_call didn't sneak into async cb.
      It's impossible, since PTR_TO_CTX is not available in async cb,
      hence the program cannot contain bpf_tail_call(ctx,...);
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
      7ddc80a4
    • A
      bpf: Implement verifier support for validation of async callbacks. · bfc6bb74
      Alexei Starovoitov 提交于
      bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
      PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
      and pass them into the helpers as function callbacks.
      In case of bpf_for_each_map_elem() the callback is invoked synchronously
      and the verifier treats it as a normal subprogram call by adding another
      bpf_func_state and new frame in __check_func_call().
      bpf_timer_set_callback() doesn't invoke the callback directly.
      The subprogram will be called asynchronously from bpf_timer_cb().
      Teach the verifier to validate such async callbacks as special kind
      of jump by pushing verifier state into stack and let pop_stack() process it.
      
      Special care needs to be taken during state pruning.
      The call insn doing bpf_timer_set_callback has to be a prune_point.
      Otherwise short timer callbacks might not have prune points in front of
      bpf_timer_set_callback() which means is_state_visited() will be called
      after this call insn is processed in __check_func_call(). Which means that
      another async_cb state will be pushed to be walked later and the verifier
      will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
      Since push_async_cb() looks like another push_stack() branch the
      infinite loop detection will trigger false positive. To recognize
      this case mark such states as in_async_callback_fn.
      To distinguish infinite loop in async callback vs the same callback called
      with different arguments for different map and timer add async_entry_cnt
      to bpf_func_state.
      
      Enforce return zero from async callbacks.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
      bfc6bb74
    • A
      bpf: Prevent pointer mismatch in bpf_timer_init. · 3e8ce298
      Alexei Starovoitov 提交于
      bpf_timer_init() arguments are:
      1. pointer to a timer (which is embedded in map element).
      2. pointer to a map.
      Make sure that pointer to a timer actually belongs to that map.
      
      Use map_uid (which is unique id of inner map) to reject:
      inner_map1 = bpf_map_lookup_elem(outer_map, key1)
      inner_map2 = bpf_map_lookup_elem(outer_map, key2)
      if (inner_map1 && inner_map2) {
          timer = bpf_map_lookup_elem(inner_map1);
          if (timer)
              // mismatch would have been allowed
              bpf_timer_init(timer, inner_map2);
      }
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
      3e8ce298
    • A
      bpf: Add map side support for bpf timers. · 68134668
      Alexei Starovoitov 提交于
      Restrict bpf timers to array, hash (both preallocated and kmalloced), and
      lru map types. The per-cpu maps with timers don't make sense, since 'struct
      bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
      the number of timers depends on number of possible cpus and timers would not be
      accessible from all cpus. lpm map support can be added in the future.
      The timers in inner maps are supported.
      
      The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
      bpf_timer in a given map element.
      
      Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
      that map element indeed contains 'struct bpf_timer'.
      
      Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
      map element data is reused in preallocated htab and lru maps.
      
      Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
      map element. There could be one of each, but not more than one. Due to 'one
      bpf_timer in one element' restriction do not support timers in global data,
      since global data is a map of single element, but from bpf program side it's
      seen as many global variables and restriction of single global timer would be
      odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
      timers, since user space could have corrupted mmap element and crashed the
      kernel. The maps with timers cannot be readonly. Due to these restrictions
      search for bpf_timer in datasec BTF in case it was placed in the global data to
      report clear error.
      
      The previous patch allowed 'struct bpf_timer' as a first field in a map
      element only. Relax this restriction.
      
      Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
      the timer when lru map deletes an element as a part of it eviction algorithm.
      
      Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
      The timer operation are done through helpers only.
      This is similar to 'struct bpf_spin_lock'.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
      68134668
    • A
      bpf: Introduce bpf timers. · b00628b1
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
      in hash/array/lru maps as a regular field and helpers to operate on it:
      
      // Initialize the timer.
      // First 4 bits of 'flags' specify clockid.
      // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
      long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
      
      // Configure the timer to call 'callback_fn' static function.
      long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
      
      // Arm the timer to expire 'nsec' nanoseconds from the current time.
      long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
      
      // Cancel the timer and wait for callback_fn to finish if it was running.
      long bpf_timer_cancel(struct bpf_timer *timer);
      
      Here is how BPF program might look like:
      struct map_elem {
          int counter;
          struct bpf_timer timer;
      };
      
      struct {
          __uint(type, BPF_MAP_TYPE_HASH);
          __uint(max_entries, 1000);
          __type(key, int);
          __type(value, struct map_elem);
      } hmap SEC(".maps");
      
      static int timer_cb(void *map, int *key, struct map_elem *val);
      /* val points to particular map element that contains bpf_timer. */
      
      SEC("fentry/bpf_fentry_test1")
      int BPF_PROG(test1, int a)
      {
          struct map_elem *val;
          int key = 0;
      
          val = bpf_map_lookup_elem(&hmap, &key);
          if (val) {
              bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
              bpf_timer_set_callback(&val->timer, timer_cb);
              bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
          }
      }
      
      This patch adds helper implementations that rely on hrtimers
      to call bpf functions as timers expire.
      The following patches add necessary safety checks.
      
      Only programs with CAP_BPF are allowed to use bpf_timer.
      
      The amount of timers used by the program is constrained by
      the memcg recorded at map creation time.
      
      The bpf_timer_init() helper needs explicit 'map' argument because inner maps
      are dynamic and not known at load time. While the bpf_timer_set_callback() is
      receiving hidden 'aux->prog' argument supplied by the verifier.
      
      The prog pointer is needed to do refcnting of bpf program to make sure that
      program doesn't get freed while the timer is armed. This approach relies on
      "user refcnt" scheme used in prog_array that stores bpf programs for
      bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
      paired with bpf_timer_cancel() that will drop the prog refcnt. The
      ops->map_release_uref is responsible for cancelling the timers and dropping
      prog refcnt when user space reference to a map reaches zero.
      This uref approach is done to make sure that Ctrl-C of user space process will
      not leave timers running forever unless the user space explicitly pinned a map
      that contained timers in bpffs.
      
      bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
      have user references (is not held by open file descriptor from user space and
      not pinned in bpffs).
      
      The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
      and free the timer if given map element had it allocated.
      "bpftool map update" command can be used to cancel timers.
      
      The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
      '__u64 :64' has 1 byte alignment of 8 byte padding.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
      b00628b1
    • R
      bus: mhi: pci-generic: configurable network interface MRU · 5c2c8531
      Richard Laing 提交于
      The MRU value used by the MHI MBIM network interface affects
      the throughput performance of the interface. Different modem
      models use different default MRU sizes based on their bandwidth
      capabilities. Large values generally result in higher throughput
      for larger packet sizes.
      
      In addition if the MRU used by the MHI device is larger than that
      specified in the MHI net device the data is fragmented and needs
      to be re-assembled which generates a (single) warning message about
      the fragmented packets. Setting the MRU on both ends avoids the
      extra processing to re-assemble the packets.
      
      This patch allows the documented MRU for a modem to be automatically
      set as the MHI net device MRU avoiding fragmentation and improving
      throughput performance.
      Signed-off-by: NRichard Laing <richard.laing@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c2c8531
  6. 15 7月, 2021 1 次提交
  7. 13 7月, 2021 1 次提交
  8. 12 7月, 2021 1 次提交
  9. 09 7月, 2021 11 次提交
    • J
      bpf: Track subprog poke descriptors correctly and fix use-after-free · f263a814
      John Fastabend 提交于
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697 ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
      f263a814
    • A
      mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t * · dc4875f0
      Aneesh Kumar K.V 提交于
      No functional change in this patch.
      
      [aneesh.kumar@linux.ibm.com: m68k build error reported by kernel robot]
        Link: https://lkml.kernel.org/r/87tulxnb2v.fsf@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210615110859.320299-2-aneesh.kumar@linux.ibm.com
      Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc4875f0
    • A
      mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t * · 9cf6fa24
      Aneesh Kumar K.V 提交于
      No functional change in this patch.
      
      [aneesh.kumar@linux.ibm.com: fix]
        Link: https://lkml.kernel.org/r/87wnqtnb60.fsf@linux.ibm.com
      [sfr@canb.auug.org.au: another fix]
        Link: https://lkml.kernel.org/r/20210619134410.89559-1-aneesh.kumar@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210615110859.320299-1-aneesh.kumar@linux.ibm.com
      Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cf6fa24
    • S
      kdump: use vmlinux_build_id to simplify · 44e8a5e9
      Stephen Boyd 提交于
      We can use the vmlinux_build_id array here now instead of open coding it.
      This mostly consolidates code.
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-14-swboyd@chromium.orgSigned-off-by: NStephen Boyd <swboyd@chromium.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44e8a5e9
    • S
      module: add printk formats to add module build ID to stacktraces · 9294523e
      Stephen Boyd 提交于
      Let's make kernel stacktraces easier to identify by including the build
      ID[1] of a module if the stacktrace is printing a symbol from a module.
      This makes it simpler for developers to locate a kernel module's full
      debuginfo for a particular stacktrace.  Combined with
      scripts/decode_stracktrace.sh, a developer can download the matching
      debuginfo from a debuginfod[2] server and find the exact file and line
      number for the functions plus offsets in a stacktrace that match the
      module.  This is especially useful for pstore crash debugging where the
      kernel crashes are recorded in something like console-ramoops and the
      recovery kernel/modules are different or the debuginfo doesn't exist on
      the device due to space concerns (the debuginfo can be too large for space
      limited devices).
      
      Originally, I put this on the %pS format, but that was quickly rejected
      given that %pS is used in other places such as ftrace where build IDs
      aren't meaningful.  There was some discussions on the list to put every
      module build ID into the "Modules linked in:" section of the stacktrace
      message but that quickly becomes very hard to read once you have more than
      three or four modules linked in.  It also provides too much information
      when we don't expect each module to be traversed in a stacktrace.  Having
      the build ID for modules that aren't important just makes things messy.
      Splitting it to multiple lines for each module quickly explodes the number
      of lines printed in an oops too, possibly wrapping the warning off the
      console.  And finally, trying to stash away each module used in a
      callstack to provide the ID of each symbol printed is cumbersome and would
      require changes to each architecture to stash away modules and return
      their build IDs once unwinding has completed.
      
      Instead, we opt for the simpler approach of introducing new printk formats
      '%pS[R]b' for "pointer symbolic backtrace with module build ID" and '%pBb'
      for "pointer backtrace with module build ID" and then updating the few
      places in the architecture layer where the stacktrace is printed to use
      this new format.
      
      Before:
      
       Call trace:
        lkdtm_WARNING+0x28/0x30 [lkdtm]
        direct_entry+0x16c/0x1b4 [lkdtm]
        full_proxy_write+0x74/0xa4
        vfs_write+0xec/0x2e8
      
      After:
      
       Call trace:
        lkdtm_WARNING+0x28/0x30 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
        direct_entry+0x16c/0x1b4 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
        full_proxy_write+0x74/0xa4
        vfs_write+0xec/0x2e8
      
      [akpm@linux-foundation.org: fix build with CONFIG_MODULES=n, tweak code layout]
      [rdunlap@infradead.org: fix build when CONFIG_MODULES is not set]
        Link: https://lkml.kernel.org/r/20210513171510.20328-1-rdunlap@infradead.org
      [akpm@linux-foundation.org: make kallsyms_lookup_buildid() static]
      [cuibixuan@huawei.com: fix build error when CONFIG_SYSFS is disabled]
        Link: https://lkml.kernel.org/r/20210525105049.34804-1-cuibixuan@huawei.com
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-6-swboyd@chromium.org
      Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1]
      Link: https://sourceware.org/elfutils/Debuginfod.html [2]
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9294523e
    • S
      dump_stack: add vmlinux build ID to stack traces · 22f4e66d
      Stephen Boyd 提交于
      Add the running kernel's build ID[1] to the stacktrace information header.
      This makes it simpler for developers to locate the vmlinux with full
      debuginfo for a particular kernel stacktrace.  Combined with
      scripts/decode_stracktrace.sh, a developer can download the correct
      vmlinux from a debuginfod[2] server and find the exact file and line
      number for the functions plus offsets in a stacktrace.
      
      This is especially useful for pstore crash debugging where the kernel
      crashes are recorded in the pstore logs and the recovery kernel is
      different or the debuginfo doesn't exist on the device due to space
      concerns (the data can be large and a security concern).  The stacktrace
      can be analyzed after the crash by using the build ID to find the matching
      vmlinux and understand where in the function something went wrong.
      
      Example stacktrace from lkdtm:
      
       WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm]
       Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE
       CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1
       Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
       pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
       pc : lkdtm_WARNING+0x28/0x30 [lkdtm]
      
      The hex string aa23f7a1231c229de205662d5a9e0d4c580f19a1 is the build ID,
      following the kernel version number. Put it all behind a config option,
      STACKTRACE_BUILD_ID, so that kernel developers can remove this
      information if they decide it is too much.
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-5-swboyd@chromium.org
      Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1]
      Link: https://sourceware.org/elfutils/Debuginfod.html [2]
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f4e66d
    • S
      buildid: stash away kernels build ID on init · 83cc6fa0
      Stephen Boyd 提交于
      Parse the kernel's build ID at initialization so that other code can print
      a hex format string representation of the running kernel's build ID.  This
      will be used in the kdump and dump_stack code so that developers can
      easily locate the vmlinux debug symbols for a crash/stacktrace.
      
      [swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()]
        Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.orgSigned-off-by: NStephen Boyd <swboyd@chromium.org>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83cc6fa0
    • S
      buildid: add API to parse build ID out of buffer · 7eaf3cf3
      Stephen Boyd 提交于
      Add an API that can parse the build ID out of a buffer, instead of a vma,
      to support printing a kernel module's build ID for stack traces.
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-3-swboyd@chromium.orgSigned-off-by: NStephen Boyd <swboyd@chromium.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7eaf3cf3
    • K
      mm: add setup_initial_init_mm() helper · 5748fbc5
      Kefeng Wang 提交于
      Patch series "init_mm: cleanup ARCH's text/data/brk setup code", v3.
      
      Add setup_initial_init_mm() helper, then use it to cleanup the text, data
      and brk setup code.
      
      This patch (of 15):
      
      Add setup_initial_init_mm() helper to setup kernel text, data and brk.
      
      Link: https://lkml.kernel.org/r/20210608083418.137226-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20210608083418.137226-2-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5748fbc5
    • Z
      mm: fix spelling mistakes in header files · 06c88398
      Zhen Lei 提交于
      Fix some spelling mistakes in comments:
      successfull ==> successful
      potentialy ==> potentially
      alloced ==> allocated
      indicies ==> indices
      wont ==> won't
      resposible ==> responsible
      dirtyness ==> dirtiness
      droppped ==> dropped
      alread ==> already
      occured ==> occurred
      interupts ==> interrupts
      extention ==> extension
      slighly ==> slightly
      Dont't ==> Don't
      
      Link: https://lkml.kernel.org/r/20210531034849.9549-2-thunder.leizhen@huawei.comSigned-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06c88398
    • M
      arch, mm: wire up memfd_secret system call where relevant · 7bb7f2ac
      Mike Rapoport 提交于
      Wire up memfd_secret system call on architectures that define
      ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.
      
      Link: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Bottomley <jejb@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Will Deacon <will@kernel.org>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bb7f2ac