1. 02 11月, 2021 6 次提交
    • J
      net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0
      James Prestwood 提交于
      This change introduces a new sysctl parameter, arp_evict_nocarrier.
      When set (default) the ARP cache will be cleared on a NOCARRIER event.
      This new option has been defaulted to '1' which maintains existing
      behavior.
      
      Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
      
      commit 859bd2ef
      Author: David Ahern <dsahern@gmail.com>
      Date:   Thu Oct 11 20:33:49 2018 -0700
      
          net: Evict neighbor entries on carrier down
      
      The reason for this changes is to prevent the ARP cache from being
      cleared when a wireless device roams. Specifically for wireless roams
      the ARP cache should not be cleared because the underlying network has not
      changed. Clearing the ARP cache in this case can introduce significant
      delays sending out packets after a roam.
      
      A user reported such a situation here:
      
      https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
      
      After some investigation it was found that the kernel was holding onto
      packets until ARP finished which resulted in this 1 second delay. It
      was also found that the first ARP who-has was never responded to,
      which is actually what caues the delay. This change is more or less
      working around this behavior, but again, there is no reason to clear
      the cache on a roam anyways.
      
      As for the unanswered who-has, we know the packet made it OTA since
      it was seen while monitoring. Why it never received a response is
      unknown. In any case, since this is a problem on the AP side of things
      all that can be done is to work around it until it is solved.
      
      Some background on testing/reproducing the packet delay:
      
      Hardware:
       - 2 access points configured for Fast BSS Transition (Though I don't
         see why regular reassociation wouldn't have the same behavior)
       - Wireless station running IWD as supplicant
       - A device on network able to respond to pings (I used one of the APs)
      
      Procedure:
       - Connect to first AP
       - Ping once to establish an ARP entry
       - Start a tcpdump
       - Roam to second AP
       - Wait for operstate UP event, and note the timestamp
       - Start pinging
      
      Results:
      
      Below is the tcpdump after UP. It was recorded the interface went UP at
      10:42:01.432875.
      
      10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
      10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
      10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
      10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
      10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
      10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
      10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
      10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
      10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
      10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
      10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
      10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
      10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
      
      You can see the first ARP who-has went out very quickly after UP, but
      was never responded to. Nearly a second later the kernel retries and
      gets a response. Only then do the ping packets go out. If an ARP entry
      is manually added prior to UP (after the cache is cleared) it is seen
      that the first ping is never responded to, so its not only an issue with
      ARP but with data packets in general.
      
      As mentioned prior, the wireless interface was also monitored to verify
      the ping/ARP packet made it OTA which was observed to be true.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fcdb44d0
    • J
      net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c · 1d6d336f
      Jean Sacren 提交于
      In one if branch, (ec->rx_coalesce_usecs != 0) is checked.  When it is
      checked again in two more places, it is always false and has no effect
      on the whole check expression.  We should remove it in both places.
      
      In another if branch, (ec->use_adaptive_rx_coalesce != 0) is checked.
      When it is checked again, it is always false.  We should remove the
      entire branch with it.
      
      In addition we might as well let C precedence dictate by getting rid of
      two pairs of parentheses in the neighboring lines in order to keep
      expressions on both sides of '||' in balance with checkpatch warning
      silenced.
      Signed-off-by: NJean Sacren <sakiwit@gmail.com>
      Link: https://lore.kernel.org/r/20211031012728.8325-1-sakiwit@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1d6d336f
    • J
      Merge branch 'accurate-memory-charging-for-msg_zerocopy' · 8a75e30e
      Jakub Kicinski 提交于
      Talal Ahmad says:
      
      ====================
      Accurate Memory Charging For MSG_ZEROCOPY
      
      This series improves the accuracy of msg_zerocopy memory accounting.
      At present, when msg_zerocopy is used memory is charged twice for the
      data - once when user space allocates it, and then again within
      __zerocopy_sg_from_iter. The memory charging in the kernel is excessive
      because data is held in user pages and is never actually copied to skb
      fragments. This leads to incorrectly inflated memory statistics for
      programs passing MSG_ZEROCOPY.
      
      We reduce this inaccuracy by introducing the notion of "pure" zerocopy
      SKBs - where all the frags in the SKB are backed by pinned userspace
      pages, and none are backed by copied pages. For such SKBs, tracked via
      the new SKBFL_PURE_ZEROCOPY flag, we elide sk_mem_charge/uncharge
      calls, leading to more accurate accounting.
      
      However, SKBs can also be coalesced by the stack at present,
      potentially leading to "impure" SKBs. We restrict this coalescing so
      it can only happen within the sendmsg() system call itself, for the
      most recently allocated SKB. While this can lead to a small degree of
      double-charging of memory, this case does not arise often in practice
      for workloads that set MSG_ZEROCOPY.
      
      Testing verified that memory usage in the kernel is lowered.
      Instrumentation with counters also showed that accounting at time
      charging and uncharging is balanced.
      ====================
      
      Link: https://lore.kernel.org/r/20211030020542.3870542-1-mailtalalahmad@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8a75e30e
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
    • T
      tcp: rename sk_wmem_free_skb · 03271f3a
      Talal Ahmad 提交于
      sk_wmem_free_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration to
      include/net/tcp.h
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      03271f3a
    • J
      netdevsim: fix uninit value in nsim_drv_configure_vfs() · 047304d0
      Jakub Kicinski 提交于
      Build bot points out that I missed initializing ret
      after refactoring.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Fixes: 1c401078 ("netdevsim: move details of vf config to dev")
      Link: https://lore.kernel.org/r/20211101221845.3188490-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      047304d0
  2. 01 11月, 2021 34 次提交
    • D
      Merge branch 'SMC-tracepoints' · d4a07dc5
      David S. Miller 提交于
      Tony Lu says:
      
      ====================
      Tracepoints for SMC
      
      This patch set introduces tracepoints for SMC, including the tracepoints
      basic code. The tracepoitns would help us to track SMC's behaviors by
      automatic tools, or other BPF tools, and zero overhead if not enabled.
      
      Compared with kprobe and other dymatic tools, the tracepoints are
      considered as stable API, and less overhead for tracing with easy-to-use
      API.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4a07dc5
    • T
      net/smc: Introduce tracepoint for smcr link down · a3a0e81b
      Tony Lu 提交于
      SMC-R link down event is important to help us find links' issues, we
      should track this event, especially in the single nic mode, which means
      upper layer connection would be shut down. Then find out the direct
      link-down reason in time, not only increased the counter, also the
      location of the code who triggered this event.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3a0e81b
    • T
      net/smc: Introduce tracepoints for tx and rx msg · aff3083f
      Tony Lu 提交于
      This introduce two tracepoints for smc tx and rx msg to help us
      diagnosis issues of data path. These two tracepoitns don't cover the
      path of CORK or MSG_MORE in tx, just the top half of data path.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aff3083f
    • T
      net/smc: Introduce tracepoint for fallback · 48262608
      Tony Lu 提交于
      This introduces tracepoint for smc fallback to TCP, so that we can track
      which connection and why it fallbacks, and map the clcsocks' pointer with
      /proc/net/tcp to find more details about TCP connections. Compared with
      kprobe or other dynamic tracing, tracepoints are stable and easy to use.
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: NWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48262608
    • D
      Merge branch 'amt-driver' · 60088891
      David S. Miller 提交于
      Taehee Yoo says:
      
      ====================
      amt: add initial driver for Automatic Multicast Tunneling (AMT)
      
      This is an implementation of AMT(Automatic Multicast Tunneling), RFC 7450.
      https://datatracker.ietf.org/doc/html/rfc7450
      
      This implementation supports IGMPv2, IGMPv3, MLDv1, MLDv2, and IPv4
      underlay.
      
       Summary of RFC 7450
      The purpose of this protocol is to provide multicast tunneling.
      The main use-case of this protocol is to provide delivery multicast
      traffic from a multicast-enabled network to sites that lack multicast
      connectivity to the source network.
      There are two roles in AMT protocol, Gateway, and Relay.
      The main purpose of Gateway mode is to forward multicast listening
      information(IGMP, MLD) to the source.
      The main purpose of Relay mode is to forward multicast data to listeners.
      These multicast traffics(IGMP, MLD, multicast data packets) are tunneled.
      
      Listeners are located behind Gateway endpoint.
      But gateway itself can be a listener too.
      Senders are located behind Relay endpoint.
      
          ___________       _________       _______       ________
         |           |     |         |     |       |     |        |
         | Listeners <-----> Gateway <-----> Relay <-----> Source |
         |___________|     |_________|     |_______|     |________|
            IGMP/MLD---------(encap)----------->
               <-------------(decap)--------(encap)------Multicast Data
      
       Usage of AMT interface
      1. Create gateway interface
      ip link add amtg type amt mode gateway local 10.0.0.1 discovery 10.0.0.2 \
      dev gw1_rt gateway_port 2268 relay_port 2268
      
      2. Create Relay interface
      ip link add amtr type amt mode relay local 10.0.0.2 dev relay_rt \
      relay_port 2268 max_tunnels 4
      
      v1 -> v2:
       - Eliminate sparse warnings.
         - Use bool type instead of __be16 for identifying v4/v6 protocol.
      
      v2 -> v3:
       - Fix compile warning due to unsed variable.
       - Add missing spinlock comment.
       - Update help message of amt in Kconfig.
      
      v3 -> v4:
       - Split patch.
       - Use CHECKSUM_NONE instead of CHECKSUM_UNNECESSARY.
       - Fix compile error.
      
      v4 -> v5:
       - Remove unnecessary rcu_read_lock().
       - Remove unnecessary amt_change_mtu().
       - Change netlink error message.
       - Add validation for IFLA_AMT_LOCAL_IP and IFLA_AMT_DISCOVERY_IP.
       - Add comments in amt.h.
       - Add missing dev_put() in error path of amt_newlink().
       - Fix typo.
       - Add BUILD_BUG_ON() in amt_smb_cb().
       - Use macro instead of magic values.
       - Use kzalloc() instead of kmalloc().
       - Add selftest script.
      
      v5 -> v6:
       - Reset remote_ip in amt_dev_stop().
      
      v6 -> v7:
       - Fix compile error.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60088891
    • T
      selftests: add amt interface selftest script · c08e8bae
      Taehee Yoo 提交于
      This is selftest script for amt interface.
      This script includes basic forwarding scenarion and torture scenario.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c08e8bae
    • T
      amt: add mld report message handler · b75f7095
      Taehee Yoo 提交于
      In the previous patch, igmp report handler was added.
      That handler can be used for mld too.
      So, it uses that common code to parse mld report message.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75f7095
    • T
      amt: add multicast(IGMP) report message handler · bc54e49c
      Taehee Yoo 提交于
      amt 'Relay' interface manages multicast groups(igmp/mld) and sources.
      In order to manage, it should have the function to parse igmp/mld
      report messages. So, this adds the logic for parsing igmp report messages
      and saves them on their own data structure.
      
         struct amt_group_node means one group(igmp/mld).
         struct amt_source_node means one source.
      
      The same source can't exist in the same group.
      The same group can exist in the same tunnel because it manages
      the host address too.
      
      The group information is used when forwarding multicast data.
      If there are no groups in the specific tunnel, Relay doesn't forward it.
      
      Although Relay manages sources, it doesn't support the source filtering
      feature. Because the reason to manage sources is just that in order
      to manage group more correctly.
      
      In the next patch, MLD part will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc54e49c
    • T
      amt: add data plane of amt interface · cbc21dc1
      Taehee Yoo 提交于
      Before forwarding multicast traffic, the amt interface establishes between
      gateway and relay. In order to establish, amt defined some message type
      and those message flow looks like the below.
      
                            Gateway                  Relay
                            -------                  -----
                               :        Request        :
                           [1] |           N           |
                               |---------------------->|
                               |    Membership Query   | [2]
                               |    N,MAC,gADDR,gPORT  |
                               |<======================|
                           [3] |   Membership Update   |
                               |   ({G:INCLUDE({S})})  |
                               |======================>|
                               |                       |
          ---------------------:-----------------------:---------------------
         |                     |                       |                     |
         |                     |    *Multicast Data    |  *IP Packet(S,G)    |
         |                     |      gADDR,gPORT      |<-----------------() |
         |    *IP Packet(S,G)  |<======================|                     |
         | ()<-----------------|                       |                     |
         |                     |                       |                     |
          ---------------------:-----------------------:---------------------
                               ~                       ~
                               ~        Request        ~
                           [4] |           N'          |
                               |---------------------->|
                               |   Membership Query    | [5]
                               | N',MAC',gADDR',gPORT' |
                               |<======================|
                           [6] |                       |
                               |       Teardown        |
                               |   N,MAC,gADDR,gPORT   |
                               |---------------------->|
                               |                       | [7]
                               |   Membership Update   |
                               |  ({G:INCLUDE({S})})   |
                               |======================>|
                               |                       |
          ---------------------:-----------------------:---------------------
         |                     |                       |                     |
         |                     |    *Multicast Data    |  *IP Packet(S,G)    |
         |                     |     gADDR',gPORT'     |<-----------------() |
         |    *IP Packet (S,G) |<======================|                     |
         | ()<-----------------|                       |                     |
         |                     |                       |                     |
          ---------------------:-----------------------:---------------------
                               |                       |
                               :                       :
      
      1. Discovery
       - Sent by Gateway to Relay
       - To find Relay unique ip address
      2. Advertisement
       - Sent by Relay to Gateway
       - Contains the unique IP address
      3. Request
       - Sent by Gateway to Relay
       - Solicit to receive 'Query' message.
      4. Query
       - Sent by Relay to Gateway
       - Contains General Query message.
      5. Update
       - Sent by  Gateway to Relay
       - Contains report message.
      6. Multicast Data
       - Sent by Relay to Gateway
       - encapsulated multicast traffic.
      7. Teardown
       - Not supported at this time.
      
      Except for the Teardown message, it supports all messages.
      
      In the next patch, IGMP/MLD logic will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc21dc1
    • T
      amt: add control plane of amt interface · b9022b53
      Taehee Yoo 提交于
      It adds definitions and control plane code for AMT.
      this is very similar to udp tunneling interfaces such as gtp, vxlan, etc.
      In the next patch, data plane code will be added.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9022b53
    • D
      Merge branch 'netdevsim-device-and-bus' · 741948ff
      David S. Miller 提交于
      Jakub Kicinski says:
      
      ====================
      netdevsim: improve separation between device and bus
      
      VF config falls strangely in between device and bus
      responsibilities today. Because of this bus.c sticks fingers
      directly into struct nsim_dev and we look at nsim_bus_dev
      in many more places than necessary.
      
      Make bus.c contain pure interface code, and move
      the particulars of the logic (which touch on eswitch,
      devlink reloads etc) to dev.c. Rename the functions
      at the boundary of the interface to make the separation
      clearer.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      741948ff
    • J
      netdevsim: rename 'driver' entry points · a66f64b8
      Jakub Kicinski 提交于
      Rename functions serving as driver entry points
      from nsim_dev_... to nsim_drv_... this makes the
      API boundary between bus and dev clearer.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a66f64b8
    • J
      netdevsim: move max vf config to dev · a3353ec3
      Jakub Kicinski 提交于
      max_vfs is a strange little beast because the file
      hangs off of nsim's debugfs, but it configures a field
      in the bus device. Move it to dev.c, let's look at it
      as if the device driver was imposing VF limit based
      on FW info (like pci_sriov_set_totalvfs()).
      
      Again, when moving refactor the function not to hold
      the vfs lock pointlessly while parsing the input.
      Wrap the access from the read side in READ_ONCE()
      to appease concurrency checkers. Do not check if
      return value from snprintf() is negative...
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3353ec3
    • J
      netdevsim: move details of vf config to dev · 1c401078
      Jakub Kicinski 提交于
      Since "eswitch" configuration was added bus.c contains
      a lot of device details which really belong to dev.c.
      
      Restructure the code while moving it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c401078
    • J
      netdevsim: move vfconfig to nsim_dev · 5e388f3d
      Jakub Kicinski 提交于
      When netdevsim got split into the faux bus vfconfig ended
      up in the bus device (think pci_dev) which is strange because
      it contains very networky not to say netdevy information.
      Move it to nsim_dev, which is the driver "priv" structure
      for the device.
      
      To make sure we don't race with probe/remove take
      the device lock (much like PCI).
      
      While at it remove the NULL-checking of vfconfigs.
      It appears to be pointless.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e388f3d
    • J
      netdevsim: take rtnl_lock when assigning num_vfs · 26c37d89
      Jakub Kicinski 提交于
      Legacy VF NDOs look at num_vfs and then based on that
      index into vfconfig. If we don't rtnl_lock() num_vfs
      may get set to 0 and vfconfig freed/replaced while
      the NDO is running.
      
      We don't need to protect replacing vfconfig since it's
      only done when num_vfs is 0.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26c37d89
    • D
      Merge branch 'devlink-locking' · 1adc58ea
      David S. Miller 提交于
      Jakub Kicinski says:
      
      ====================
      improve ethtool/rtnl vs devlink locking
      
      During ethtool netlink development we decided to move some of
      the commmands to devlink. Since we don't want drivers to implement
      both devlink and ethtool version of the commands ethtool ioctl
      falls back to calling devlink. Unfortunately devlink locks must
      be taken before rtnl_lock. This results in a questionable
      dev_hold() / rtnl_unlock() / devlink / rtnl_lock() / dev_put()
      pattern.
      
      This method "works" but it working depends on drivers in question
      not doing much in ethtool_ops->begin / complete, and on the netdev
      not having needs_free_netdev set.
      
      Since commit 437ebfd9 ("devlink: Count struct devlink consumers")
      we can hold a reference on a devlink instance and prevent it from
      going away (sort of like netdev with dev_hold()). We can use this
      to create a more natural reference nesting where we get a ref on
      the devlink instance and make the devlink call entirely outside
      of the rtnl_lock section.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1adc58ea
    • J
      ethtool: don't drop the rtnl_lock half way thru the ioctl · 1af0a094
      Jakub Kicinski 提交于
      devlink compat code needs to drop rtnl_lock to take
      devlink->lock to ensure correct lock ordering.
      
      This is problematic because we're not strictly guaranteed
      that the netdev will not disappear after we re-lock.
      It may open a possibility of nested ->begin / ->complete
      calls.
      
      Instead of calling into devlink under rtnl_lock take
      a ref on the devlink instance and make the call after
      we've dropped rtnl_lock.
      
      We (continue to) assume that netdevs have an implicit
      reference on the devlink returned from ndo_get_devlink_port
      
      Note that ndo_get_devlink_port will now get called
      under rtnl_lock. That should be fine since none of
      the drivers seem to be taking serious locks inside
      ndo_get_devlink_port.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1af0a094
    • J
      devlink: expose get/put functions · 46db1b77
      Jakub Kicinski 提交于
      Allow those who hold implicit reference on a devlink instance
      to try to take a full ref on it. This will be used from netdev
      code which has an implicit ref because of driver call ordering.
      
      Note that after recent changes devlink_unregister() may happen
      before netdev unregister, but devlink_free() should still happen
      after, so we are safe to try, but we can't just refcount_inc()
      and assume it's not zero.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46db1b77
    • J
      ethtool: handle info/flash data copying outside rtnl_lock · 095cfcfe
      Jakub Kicinski 提交于
      We need to increase the lifetime of the data for .get_info
      and .flash_update beyond their handlers inside rtnl_lock.
      
      Allocate a union on the heap and use it instead.
      
      Note that we now copy the ethcmd before we lookup dev,
      hopefully there is no crazy user space depending on error
      codes.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      095cfcfe
    • J
      ethtool: push the rtnl_lock into dev_ethtool() · f49deaa6
      Jakub Kicinski 提交于
      Don't take the lock in net/core/dev_ioctl.c,
      we'll have things to do outside rtnl_lock soon.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f49deaa6
    • D
      Merge branch 'mana-misc' · c6e03dbe
      David S. Miller 提交于
      Dexuan Cui says:
      
      ====================
      net: mana: some misc patches
      
      Patch 1 is a small fix.
      
      Patch 2 reports OS info to the PF driver.
      Before the patch, the req fields were all zeros.
      
      Patch 3 fixes and cleans up the error handling of HWC creation failure.
      
      Patch 4 adds the callbacks for hibernation/kexec. It's based on patch 3.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6e03dbe
    • D
      net: mana: Support hibernation and kexec · 635096a8
      Dexuan Cui 提交于
      Implement the suspend/resume/shutdown callbacks for hibernation/kexec.
      
      Add mana_gd_setup() and mana_gd_cleanup() for some common code, and
      use them in the mand_gd_* callbacks.
      
      Reuse mana_probe/remove() for the hibernation path.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Reviewed-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      635096a8
    • D
      net: mana: Improve the HWC error handling · 62ea8b77
      Dexuan Cui 提交于
      Currently when the HWC creation fails, the error handling is flawed,
      e.g. if mana_hwc_create_channel() -> mana_hwc_establish_channel() fails,
      the resources acquired in mana_hwc_init_queues() is not released.
      
      Enhance mana_hwc_destroy_channel() to do the proper cleanup work and
      call it accordingly.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Reviewed-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62ea8b77
    • D
      net: mana: Report OS info to the PF driver · 3c37f357
      Dexuan Cui 提交于
      The PF driver might use the OS info for statistical purposes.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Reviewed-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c37f357
    • D
      net: mana: Fix the netdev_err()'s vPort argument in mana_init_port() · 6c7ea696
      Dexuan Cui 提交于
      Use the correct port index rather than 0.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Reviewed-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c7ea696
    • D
      Merge branch 'mptcp-selftests' · 986d2e3d
      David S. Miller 提交于
      Mat Martineau says:
      
      ====================
      mptcp: Some selftest improvements
      
      Here are a couple of selftest changes for MPTCP.
      
      Patch 1 fixes a mistake where the wrong protocol (TCP vs MPTCP) could be
      requested on the listening socket in some link failure tests.
      
      Patch 2 refactors the simulataneous flow tests to improve timing
      accuracy and give more consistent results.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      986d2e3d
    • P
      selftests: mptcp: more stable simult_flows tests · b6ab64b0
      Paolo Abeni 提交于
      Currently the simult_flows.sh self-tests are not very stable,
      especially when running on slow VMs.
      
      The tests measure runtime for transfers on multiple subflows
      and check that the time is near the theoretical maximum.
      
      The current test infra introduces a bit of jitter in test
      runtime, due to multiple explicit delays. Additionally the
      runtime is measured by the shell script wrapper. On a slow
      VM, the script overhead is measurable and subject to relevant
      jitter.
      
      One solution to make the test more stable would be adding more
      slack to the expected time; that could possibly hide real
      regressions. Instead move the measurement inside the command
      doing the transfer, and drop most unneeded sleeps.
      Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6ab64b0
    • G
      selftests: mptcp: fix proto type in link_failure tests · 7c909a98
      Geliang Tang 提交于
      In listener_ns, we should pass srv_proto argument to mptcp_connect command,
      not cl_proto.
      
      Fixes: 7d1e6f16 ("selftests: mptcp: add testcase for active-back")
      Signed-off-by: NGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c909a98
    • Y
      nfp: flower: Allow ipv6gretap interface for offloading · f7536ffb
      Yu Xiao 提交于
      The tunnel_type check only allows for "netif_is_gretap", but for
      OVS the port is actually "netif_is_ip6gretap" when setting up GRE
      for ipv6, which means offloading request was rejected before.
      
      Therefore, adding "netif_is_ip6gretap" allow ipv6gretap interface
      for offloading.
      Signed-off-by: NYu Xiao <yu.xiao@corigine.com>
      Signed-off-by: NLouis Peens <louis.peens@corigine.com>
      Signed-off-by: NSimon Horman <simon.horman@corigine.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7536ffb
    • M
      net: dsa: populate supported_interfaces member · c07c6e8e
      Marek Behún 提交于
      Add a new DSA switch operation, phylink_get_interfaces, which should
      fill in which PHY_INTERFACE_MODE_* are supported by given port.
      
      Use this before phylink_create() to fill phylinks supported_interfaces
      member, allowing phylink to determine which PHY_INTERFACE_MODEs are
      supported.
      Signed-off-by: NMarek Behún <kabel@kernel.org>
      [tweaked patch and description to add more complete support -- rmk]
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c07c6e8e
    • D
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · ebed1cf5
      David S. Miller 提交于
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2021-10-29
      
      This series contains updates to ice and iavf drivers and virtchnl header
      file.
      
      Brett removes vlan_promisc argument from a function call for ice driver.
      In the virtchnl header file he removes an unused, reserved define and
      converts raw value defines to instead use the BIT macro.
      
      Marcin adds syncing of MAC addresses when creating switchdev VFs to
      remove error messages on link up and stops showing buffer information
      for port representors to remove duplicated entries being displayed for
      ice driver.
      
      Karen introduces a helper to go from pci_dev to iavf_adapter in the
      iavf driver.
      
      Przemyslaw fixes an issue where iavf was attempting to free IRQs before
      calling disable.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebed1cf5
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next · 06f1ecd4
      David S. Miller 提交于
      Steffen Klassert says:
      
      ====================
      pull request (net-next): ipsec-next 2021-10-30
      
      Just two minor changes this time:
      
      1) Remove some superfluous header files from xfrm4_tunnel.c
         From Mianhan Liu.
      
      2) Simplify some error checks in xfrm_input().
         From luo penghao.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06f1ecd4
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 894d0844
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Use array_size() in ebtables, from Gustavo A. R. Silva.
      
      2) Attach IPS_ASSURED to internal UDP stream state, reported by
         Maciej Zenczykowski.
      
      3) Add NFT_META_IFTYPE to match on the interface type either
         from ingress or egress.
      
      4) Generalize pktinfo->tprot_set to flags field.
      
      5) Allow to match on inner headers / payload data.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      894d0844