1. 10 3月, 2022 3 次提交
    • E
      tcp: adjust TSO packet sizes based on min_rtt · 65466904
      Eric Dumazet 提交于
      Back when tcp_tso_autosize() and TCP pacing were introduced,
      our focus was really to reduce burst sizes for long distance
      flows.
      
      The simple heuristic of using sk_pacing_rate/1024 has worked
      well, but can lead to too small packets for hosts in the same
      rack/cluster, when thousands of flows compete for the bottleneck.
      
      Neal Cardwell had the idea of making the TSO burst size
      a function of both sk_pacing_rate and tcp_min_rtt()
      
      Indeed, for local flows, sending bigger bursts is better
      to reduce cpu costs, as occasional losses can be repaired
      quite fast.
      
      This patch is based on Neal Cardwell implementation
      done more than two years ago.
      bbr is adjusting max_pacing_rate based on measured bandwidth,
      while cubic would over estimate max_pacing_rate.
      
      /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
      this new feature, in logarithmic steps.
      
      Tested:
      
      100Gbit NIC, two hosts in the same rack, 4K MTU.
      600 flows rate-limited to 20000000 bytes per second.
      
      Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
      
      ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96005
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               65,945.29 msec task-clock                #    2.845 CPUs utilized
               1,314,632      context-switches          # 19935.279 M/sec
                   5,292      cpu-migrations            #   80.249 M/sec
                 940,641      page-faults               # 14264.023 M/sec
         201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
          17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
         136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
          53,809,530,436      instructions              #    0.27  insn per cycle
                                                        #    2.54  stalled cycles per insn  (83.36%)
           9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
             153,008,621      branch-misses             #    1.69% of all branches          (83.32%)
      
            23.182970846 seconds time elapsed
      
      TcpInSegs                       15648792           0.0
      TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
      TcpExtTCPDelivered              58654791           0.0
      TcpExtTCPDeliveredCE            19                 0.0
      
      After patch:
      
      ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96046
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               48,982.58 msec task-clock                #    2.104 CPUs utilized
                 186,014      context-switches          # 3797.599 M/sec
                   3,109      cpu-migrations            #   63.472 M/sec
                 941,180      page-faults               # 19214.814 M/sec
         153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
          12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
         120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
          36,803,672,106      instructions              #    0.24  insn per cycle
                                                        #    3.27  stalled cycles per insn  (83.18%)
           5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
              87,984,616      branch-misses             #    1.48% of all branches          (83.43%)
      
            23.281200256 seconds time elapsed
      
      TcpInSegs                       1434706            0.0
      TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
      TcpExtTCPDelivered              58878971           0.0
      TcpExtTCPDeliveredCE            9664               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      65466904
    • D
      net/tls: Provide {__,}tls_driver_ctx() unconditionally · 77f09e66
      Dimitris Michailidis 提交于
      Having the definitions of {__,}tls_driver_ctx() under an #if
      guard means code referencing them also needs to rely on the
      preprocessor. The protection doesn't appear needed so make the
      definitions unconditional.
      
      Fixes: db37bc17 ("net/funeth: add the data path")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDimitris Michailidis <dmichail@fungible.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      77f09e66
    • V
      net: tcp: fix shim definition of tcp_inbound_md5_hash · 24055bb8
      Vladimir Oltean 提交于
      When CONFIG_TCP_MD5SIG isn't enabled, there is a compilation bug due to
      the fact that the static inline definition of tcp_inbound_md5_hash() has
      an unexpected semicolon. Remove it.
      
      Fixes: 1330b6ef ("skb: make drop reason booleanable")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220309122012.668986-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      24055bb8
  2. 09 3月, 2022 2 次提交
    • J
      skb: make drop reason booleanable · 1330b6ef
      Jakub Kicinski 提交于
      We have a number of cases where function returns drop/no drop
      decision as a boolean. Now that we want to report the reason
      code as well we have to pass extra output arguments.
      
      We can make the reason code evaluate correctly as bool.
      
      I believe we're good to reorder the reasons as they are
      reported to user space as strings.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1330b6ef
    • V
      net: dsa: felix: avoid early deletion of host FDB entries · 7e580490
      Vladimir Oltean 提交于
      The Felix driver declares FDB isolation but puts all standalone ports in
      VID 0. This is mostly problem-free as discussed with Alvin here:
      https://patchwork.kernel.org/project/netdevbpf/cover/20220302191417.1288145-1-vladimir.oltean@nxp.com/#24763870
      
      however there is one catch. DSA still thinks that FDB entries are
      installed on the CPU port as many times as there are user ports, and
      this is problematic when multiple user ports share the same MAC address.
      
      Consider the default case where all user ports inherit their MAC address
      from the DSA master, and then the user runs:
      
      ip link set swp0 address 00:01:02:03:04:05
      
      The above will make dsa_slave_set_mac_address() call
      dsa_port_standalone_host_fdb_add() for 00:01:02:03:04:05 in port 0's
      standalone database, and dsa_port_standalone_host_fdb_del() for the old
      address of swp0, again in swp0's standalone database.
      
      Both the ->port_fdb_add() and ->port_fdb_del() will be propagated down
      to the felix driver, which will end up deleting the old MAC address from
      the CPU port. But this is still in use by other user ports, so we end up
      breaking unicast termination for them.
      
      There isn't a problem in the fact that DSA keeps track of host
      standalone addresses in the individual database of each user port: some
      drivers like sja1105 need this. There also isn't a problem in the fact
      that some drivers choose the same VID/FID for all standalone ports.
      It is just that the deletion of these host addresses must be delayed
      until they are known to not be in use any longer, and only the driver
      has this knowledge. Since DSA keeps these addresses in &cpu_dp->fdbs and
      &cpu_db->mdbs, it is just a matter of walking over those lists and see
      whether the same MAC address is present on the CPU port in the port db
      of another user port.
      
      I have considered reusing the generic dsa_port_walk_fdbs() and
      dsa_port_walk_mdbs() schemes for this, but locking makes it difficult.
      In the ->port_fdb_add() method and co, &dp->addr_lists_lock is held, but
      dsa_port_walk_fdbs() also acquires that lock. Also, even assuming that
      we introduce an unlocked variant of the address iterator, we'd still
      need some relatively complex data structures, and a void *ctx in the
      dsa_fdb_walk_cb_t which we don't currently pass, such that drivers are
      able to figure out, after iterating, whether the same MAC address is or
      isn't present in the port db of another port.
      
      All the above, plus the fact that I expect other drivers to follow the
      same model as felix where all standalone ports use the same FID, made me
      conclude that a generic method provided by DSA is necessary:
      dsa_fdb_present_in_other_db() and the mdb equivalent. Felix calls this
      from the ->port_fdb_del() handler for the CPU port, when the database
      was classified to either a port db, or a LAG db.
      
      For symmetry, we also call this from ->port_fdb_add(), because if the
      address was installed once, then installing it a second time serves no
      purpose: it's already in hardware in VID 0 and it affects all standalone
      ports.
      
      This change moves dsa_db_equal() from switch.c to dsa.c, since it now
      has one more caller.
      
      Fixes: 54c31984 ("net: mscc: ocelot: enforce FDB isolation when VLAN-unaware")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e580490
  3. 05 3月, 2022 1 次提交
  4. 04 3月, 2022 3 次提交
  5. 03 3月, 2022 6 次提交
  6. 02 3月, 2022 1 次提交
    • P
      net/sched: act_ct: Fix flow table lookup failure with no originating ifindex · db6140e5
      Paul Blakey 提交于
      After cited commit optimizted hw insertion, flow table entries are
      populated with ifindex information which was intended to only be used
      for HW offload. This tuple ifindex is hashed in the flow table key, so
      it must be filled for lookup to be successful. But tuple ifindex is only
      relevant for the netfilter flowtables (nft), so it's not filled in
      act_ct flow table lookup, resulting in lookup failure, and no SW
      offload and no offload teardown for TCP connection FIN/RST packets.
      
      To fix this, add new tc ifindex field to tuple, which will
      only be used for offloading, not for lookup, as it will not be
      part of the tuple hash.
      
      Fixes: 9795ded7 ("net/sched: act_ct: Fill offloading tuple iifidx")
      Signed-off-by: NPaul Blakey <paulb@nvidia.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      db6140e5
  7. 01 3月, 2022 5 次提交
    • D
      net/smc: add sysctl for autocorking · 12bbb0d1
      Dust Li 提交于
      This add a new sysctl: net.smc.autocorking_size
      
      We can dynamically change the behaviour of autocorking
      by change the value of autocorking_size.
      Setting to 0 disables autocorking in SMC
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12bbb0d1
    • D
      net/smc: add sysctl interface for SMC · 462791bb
      Dust Li 提交于
      This patch add sysctl interface to support container environment
      for SMC as we talk in the mail list.
      
      Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.comCo-developed-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      462791bb
    • F
      netfilter: nf_queue: fix possible use-after-free · c3873070
      Florian Westphal 提交于
      Eric Dumazet says:
        The sock_hold() side seems suspect, because there is no guarantee
        that sk_refcnt is not already 0.
      
      On failure, we cannot queue the packet and need to indicate an
      error.  The packet will be dropped by the caller.
      
      v2: split skb prefetch hunk into separate change
      
      Fixes: 271b72c7 ("udp: RCU handling for Unicast packets.")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      c3873070
    • N
      drivers: vxlan: vnifilter: per vni stats · 4095e0e1
      Nikolay Aleksandrov 提交于
      Add per-vni statistics for vni filter mode. Counting Rx/Tx
      bytes/packets/drops/errors at the appropriate places.
      
      This patch changes vxlan_vs_find_vni to also return the
      vxlan_vni_node in cases where the vni belongs to a vni
      filtering vxlan device
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4095e0e1
    • R
      vxlan: vni filtering support on collect metadata device · f9c4bb0b
      Roopa Prabhu 提交于
      This patch adds vnifiltering support to collect metadata device.
      
      Motivation:
      You can only use a single vxlan collect metadata device for a given
      vxlan udp port in the system today. The vxlan collect metadata device
      terminates all received vxlan packets. As shown in the below diagram,
      there are use-cases where you need to support multiple such vxlan devices in
      independent bridge domains. Each vxlan device must terminate the vni's
      it is configured for.
      Example usecase: In a service provider network a service provider
      typically supports multiple bridge domains with overlapping vlans.
      One bridge domain per customer. Vlans in each bridge domain are
      mapped to globally unique vxlan ranges assigned to each customer.
      
      vnifiltering support in collect metadata devices terminates only configured
      vnis. This is similar to vlan filtering in bridge driver. The vni filtering
      capability is provided by a new flag on collect metadata device.
      
      In the below pic:
      	- customer1 is mapped to br1 bridge domain
      	- customer2 is mapped to br2 bridge domain
      	- customer1 vlan 10-11 is mapped to vni 1001-1002
      	- customer2 vlan 10-11 is mapped to vni 2001-2002
      	- br1 and br2 are vlan filtering bridges
      	- vxlan1 and vxlan2 are collect metadata devices with
      	  vnifiltering enabled
      
      ┌──────────────────────────────────────────────────────────────────┐
      │  switch                                                          │
      │                                                                  │
      │         ┌───────────┐                 ┌───────────┐              │
      │         │           │                 │           │              │
      │         │   br1     │                 │   br2     │              │
      │         └┬─────────┬┘                 └──┬───────┬┘              │
      │     vlans│         │               vlans │       │               │
      │     10,11│         │                10,11│       │               │
      │          │     vlanvnimap:               │    vlanvnimap:        │
      │          │       10-1001,11-1002         │      10-2001,11-2002  │
      │          │         │                     │       │               │
      │   ┌──────┴┐     ┌──┴─────────┐       ┌───┴────┐  │               │
      │   │ swp1  │     │vxlan1      │       │ swp2   │ ┌┴─────────────┐ │
      │   │       │     │  vnifilter:│       │        │ │vxlan2        │ │
      │   └───┬───┘     │   1001,1002│       └───┬────┘ │ vnifilter:   │ │
      │       │         └────────────┘           │      │  2001,2002   │ │
      │       │                                  │      └──────────────┘ │
      │       │                                  │                       │
      └───────┼──────────────────────────────────┼───────────────────────┘
              │                                  │
              │                                  │
        ┌─────┴───────┐                          │
        │  customer1  │                    ┌─────┴──────┐
        │ host/VM     │                    │customer2   │
        └─────────────┘                    │ host/VM    │
                                           └────────────┘
      
      With this implementation, vxlan dst metadata device can
      be associated with range of vnis.
      struct vxlan_vni_node is introduced to represent
      a configured vni. We start with vni and its
      associated remote_ip in this structure. This
      structure can be extended to bring in other
      per vni attributes if there are usecases for it.
      A vni inherits an attribute from the base vxlan device
      if there is no per vni attributes defined.
      
      struct vxlan_dev gets a new rhashtable for
      vnis called vxlan_vni_group. vxlan_vnifilter.c
      implements the necessary netlink api, notifications
      and helper functions to process and manage lifecycle
      of vxlan_vni_node.
      
      This patch also adds new helper functions in vxlan_multicast.c
      to handle per vni remote_ip multicast groups which are part
      of vxlan_vni_group.
      
      Fix build problems:
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9c4bb0b
  8. 28 2月, 2022 2 次提交
  9. 27 2月, 2022 2 次提交
    • V
      net: dsa: pass extack to .port_bridge_join driver methods · 06b9cce4
      Vladimir Oltean 提交于
      As FDB isolation cannot be enforced between VLAN-aware bridges in lack
      of hardware assistance like extra FID bits, it seems plausible that many
      DSA switches cannot do it. Therefore, they need to reject configurations
      with multiple VLAN-aware bridges from the two code paths that can
      transition towards that state:
      
      - joining a VLAN-aware bridge
      - toggling VLAN awareness on an existing bridge
      
      The .port_vlan_filtering method already propagates the netlink extack to
      the driver, let's propagate it from .port_bridge_join too, to make sure
      that the driver can use the same function for both.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06b9cce4
    • V
      net: dsa: request drivers to perform FDB isolation · c2693363
      Vladimir Oltean 提交于
      For DSA, to encourage drivers to perform FDB isolation simply means to
      track which bridge does each FDB and MDB entry belong to. It then
      becomes the driver responsibility to use something that makes the FDB
      entry from one bridge not match the FDB lookup of ports from other
      bridges.
      
      The top-level functions where the bridge is determined are:
      - dsa_port_fdb_{add,del}
      - dsa_port_host_fdb_{add,del}
      - dsa_port_mdb_{add,del}
      - dsa_port_host_mdb_{add,del}
      
      aka the pre-crosschip-notifier functions.
      
      Changing the API to pass a reference to a bridge is not superfluous, and
      looking at the passed bridge argument is not the same as having the
      driver look at dsa_to_port(ds, port)->bridge from the ->port_fdb_add()
      method.
      
      DSA installs FDB and MDB entries on shared (CPU and DSA) ports as well,
      and those do not have any dp->bridge information to retrieve, because
      they are not in any bridge - they are merely the pipes that serve the
      user ports that are in one or multiple bridges.
      
      The struct dsa_bridge associated with each FDB/MDB entry is encapsulated
      in a larger "struct dsa_db" database. Although only databases associated
      to bridges are notified for now, this API will be the starting point for
      implementing IFF_UNICAST_FLT in DSA. There, the idea is to install FDB
      entries on the CPU port which belong to the corresponding user port's
      port database. These are supposed to match only when the port is
      standalone.
      
      It is better to introduce the API in its expected final form than to
      introduce it for bridges first, then to have to change drivers which may
      have made one or more assumptions.
      
      Drivers can use the provided bridge.num, but they can also use a
      different numbering scheme that is more convenient.
      
      DSA must perform refcounting on the CPU and DSA ports by also taking
      into account the bridge number. So if two bridges request the same local
      address, DSA must notify the driver twice, once for each bridge.
      
      In fact, if the driver supports FDB isolation, DSA must perform
      refcounting per bridge, but if the driver doesn't, DSA must refcount
      host addresses across all bridges, otherwise it would be telling the
      driver to delete an FDB entry for a bridge and the driver would delete
      it for all bridges. So introduce a bool fdb_isolation in drivers which
      would make all bridge databases passed to the cross-chip notifier have
      the same number (0). This makes dsa_mac_addr_find() -> dsa_db_equal()
      say that all bridge databases are the same database - which is
      essentially the legacy behavior.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2693363
  10. 25 2月, 2022 9 次提交
    • D
      net/tcp: Merge TCP-MD5 inbound callbacks · 7bbb765b
      Dmitry Safonov 提交于
      The functions do essentially the same work to verify TCP-MD5 sign.
      Code can be merged into one family-independent function in order to
      reduce copy'n'paste and generated code.
      Later with TCP-AO option added, this will allow to create one function
      that's responsible for segment verification, that will have all the
      different checks for MD5/AO/non-signed packets, which in turn will help
      to see checks for all corner-cases in one function, rather than spread
      around different families and functions.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7bbb765b
    • V
      net: dsa: support FDB events on offloaded LAG interfaces · e212fa7c
      Vladimir Oltean 提交于
      This change introduces support for installing static FDB entries towards
      a bridge port that is a LAG of multiple DSA switch ports, as well as
      support for filtering towards the CPU local FDB entries emitted for LAG
      interfaces that are bridge ports.
      
      Conceptually, host addresses on LAG ports are identical to what we do
      for plain bridge ports. Whereas FDB entries _towards_ a LAG can't simply
      be replicated towards all member ports like we do for multicast, or VLAN.
      Instead we need new driver API. Hardware usually considers a LAG to be a
      "logical port", and sets the entire LAG as the forwarding destination.
      The physical egress port selection within the LAG is made by hashing
      policy, as usual.
      
      To represent the logical port corresponding to the LAG, we pass by value
      a copy of the dsa_lag structure to all switches in the tree that have at
      least one port in that LAG.
      
      To illustrate why a refcounted list of FDB entries is needed in struct
      dsa_lag, it is enough to say that:
      - a LAG may be a bridge port and may therefore receive FDB events even
        while it isn't yet offloaded by any DSA interface
      - DSA interfaces may be removed from a LAG while that is a bridge port;
        we don't want FDB entries lingering around, but we don't want to
        remove entries that are still in use, either
      
      For all the cases below to work, the idea is to always keep an FDB entry
      on a LAG with a reference count equal to the DSA member ports. So:
      - if a port joins a LAG, it requests the bridge to replay the FDB, and
        the FDB entries get created, or their refcount gets bumped by one
      - if a port leaves a LAG, the FDB replay deletes or decrements refcount
        by one
      - if an FDB is installed towards a LAG with ports already present, that
        entry is created (if it doesn't exist) and its refcount is bumped by
        the amount of ports already present in the LAG
      
      echo "Adding FDB entry to bond with existing ports"
      ip link del bond0
      ip link add bond0 type bond mode 802.3ad
      ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up
      ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up
      ip link del br0
      ip link add br0 type bridge
      ip link set bond0 master br0
      bridge fdb add dev bond0 00:01:02:03:04:05 master static
      
      ip link del br0
      ip link del bond0
      
      echo "Adding FDB entry to empty bond"
      ip link del bond0
      ip link add bond0 type bond mode 802.3ad
      ip link del br0
      ip link add br0 type bridge
      ip link set bond0 master br0
      bridge fdb add dev bond0 00:01:02:03:04:05 master static
      ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up
      ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up
      
      ip link del br0
      ip link del bond0
      
      echo "Adding FDB entry to empty bond, then removing ports one by one"
      ip link del bond0
      ip link add bond0 type bond mode 802.3ad
      ip link del br0
      ip link add br0 type bridge
      ip link set bond0 master br0
      bridge fdb add dev bond0 00:01:02:03:04:05 master static
      ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up
      ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up
      
      ip link set swp1 nomaster
      ip link set swp2 nomaster
      ip link del br0
      ip link del bond0
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e212fa7c
    • V
      net: switchdev: remove lag_mod_cb from switchdev_handle_fdb_event_to_device · ec638740
      Vladimir Oltean 提交于
      When the switchdev_handle_fdb_event_to_device() event replication helper
      was created, my original thought was that FDB events on LAG interfaces
      should most likely be special-cased, not just replicated towards all
      switchdev ports beneath that LAG. So this replication helper currently
      does not recurse through switchdev lower interfaces of LAG bridge ports,
      but rather calls the lag_mod_cb() if that was provided.
      
      No switchdev driver uses this helper for FDB events on LAG interfaces
      yet, so that was an assumption which was yet to be tested. It is
      certainly usable for that purpose, as my RFC series shows:
      
      https://patchwork.kernel.org/project/netdevbpf/cover/20220210125201.2859463-1-vladimir.oltean@nxp.com/
      
      however this approach is slightly convoluted because:
      
      - the switchdev driver gets a "dev" that isn't its own net device, but
        rather the LAG net device. It must call switchdev_lower_dev_find(dev)
        in order to get a handle of any of its own net devices (the ones that
        pass check_cb).
      
      - in order for FDB entries on LAG ports to be correctly refcounted per
        the number of switchdev ports beneath that LAG, we haven't escaped the
        need to iterate through the LAG's lower interfaces. Except that is now
        the responsibility of the switchdev driver, because the replication
        helper just stopped half-way.
      
      So, even though yes, FDB events on LAG bridge ports must be
      special-cased, in the end it's simpler to let switchdev_handle_fdb_*
      just iterate through the LAG port's switchdev lowers, and let the
      switchdev driver figure out that those physical ports are under a LAG.
      
      The switchdev_handle_fdb_event_to_device() helper takes a
      "foreign_dev_check" callback so it can figure out whether @dev can
      autonomously forward to @foreign_dev. DSA fills this method properly:
      if the LAG is offloaded by another port in the same tree as @dev, then
      it isn't foreign. If it is a software LAG, it is foreign - forwarding
      happens in software.
      
      Whether an interface is foreign or not decides whether the replication
      helper will go through the LAG's switchdev lowers or not. Since the
      lan966x doesn't properly fill this out, FDB events on software LAG
      uppers will get called. By changing lan966x_foreign_dev_check(), we can
      suppress them.
      
      Whereas DSA will now start receiving FDB events for its offloaded LAG
      uppers, so we need to return -EOPNOTSUPP, since we currently don't do
      the right thing for them.
      
      Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      ec638740
    • V
      net: dsa: create a dsa_lag structure · dedd6a00
      Vladimir Oltean 提交于
      The main purpose of this change is to create a data structure for a LAG
      as seen by DSA. This is similar to what we have for bridging - we pass a
      copy of this structure by value to ->port_lag_join and ->port_lag_leave.
      For now we keep the lag_dev, id and a reference count in it. Future
      patches will add a list of FDB entries for the LAG (these also need to
      be refcounted to work properly).
      
      The LAG structure is created using dsa_port_lag_create() and destroyed
      using dsa_port_lag_destroy(), just like we have for bridging.
      
      Because now, the dsa_lag itself is refcounted, we can simplify
      dsa_lag_map() and dsa_lag_unmap(). These functions need to keep a LAG in
      the dst->lags array only as long as at least one port uses it. The
      refcounting logic inside those functions can be removed now - they are
      called only when we should perform the operation.
      
      dsa_lag_dev() is renamed to dsa_lag_by_id() and now returns the dsa_lag
      structure instead of the lag_dev net_device.
      
      dsa_lag_foreach_port() now takes the dsa_lag structure as argument.
      
      dst->lags holds an array of dsa_lag structures.
      
      dsa_lag_map() now also saves the dsa_lag->id value, so that linear
      walking of dst->lags in drivers using dsa_lag_id() is no longer
      necessary. They can just look at lag.id.
      
      dsa_port_lag_id_get() is a helper, similar to dsa_port_bridge_num_get(),
      which can be used by drivers to get the LAG ID assigned by DSA to a
      given port.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      dedd6a00
    • V
      net: dsa: make LAG IDs one-based · 3d4a0a2a
      Vladimir Oltean 提交于
      The DSA LAG API will be changed to become more similar with the bridge
      data structures, where struct dsa_bridge holds an unsigned int num,
      which is generated by DSA and is one-based. We have a similar thing
      going with the DSA LAG, except that isn't stored anywhere, it is
      calculated dynamically by dsa_lag_id() by iterating through dst->lags.
      
      The idea of encoding an invalid (or not requested) LAG ID as zero for
      the purpose of simplifying checks in drivers means that the LAG IDs
      passed by DSA to drivers need to be one-based too. So back-and-forth
      conversion is needed when indexing the dst->lags array, as well as in
      drivers which assume a zero-based index.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3d4a0a2a
    • V
      net: dsa: rename references to "lag" as "lag_dev" · 46a76724
      Vladimir Oltean 提交于
      In preparation of converting struct net_device *dp->lag_dev into a
      struct dsa_lag *dp->lag, we need to rename, for consistency purposes,
      all occurrences of the "lag" variable in the DSA core to "lag_dev".
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      46a76724
    • L
      Bluetooth: hci_sync: Fix not using conn_timeout · a56a1138
      Luiz Augusto von Dentz 提交于
      When using hci_le_create_conn_sync it shall wait for the conn_timeout
      since the connection complete may take longer than just 2 seconds.
      
      Also fix the masking of HCI_EV_LE_ENHANCED_CONN_COMPLETE and
      HCI_EV_LE_CONN_COMPLETE so they are never both set so we can predict
      which one the controller will use in case of HCI_OP_LE_CREATE_CONN.
      
      Fixes: 6cd29ec6 ("Bluetooth: hci_sync: Wait for proper events when connecting LE")
      Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      a56a1138
    • L
      Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunks · 29fb6083
      Luiz Augusto von Dentz 提交于
      Since bt_skb_sendmmsg can be used with the likes of SOCK_STREAM it
      shall return the partial chunks it could allocate instead of freeing
      everything as otherwise it can cause problems like bellow.
      
      Fixes: 81be03e0 ("Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg")
      Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Link: https://lore.kernel.org/r/d7206e12-1b99-c3be-84f4-df22af427ef5@molgen.mpg.de
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215594Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> (Nokia N9 (MeeGo/Harmattan)
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      29fb6083
    • P
      openvswitch: Fix setting ipv6 fields causing hw csum failure · d9b5ae5c
      Paul Blakey 提交于
      Ipv6 ttl, label and tos fields are modified without first
      pulling/pushing the ipv6 header, which would have updated
      the hw csum (if available). This might cause csum validation
      when sending the packet to the stack, as can be seen in
      the trace below.
      
      Fix this by updating skb->csum if available.
      
      Trace resulted by ipv6 ttl dec and then sending packet
      to conntrack [actions: set(ipv6(hlimit=63)),ct(zone=99)]:
      [295241.900063] s_pf0vf2: hw csum failure
      [295241.923191] Call Trace:
      [295241.925728]  <IRQ>
      [295241.927836]  dump_stack+0x5c/0x80
      [295241.931240]  __skb_checksum_complete+0xac/0xc0
      [295241.935778]  nf_conntrack_tcp_packet+0x398/0xba0 [nf_conntrack]
      [295241.953030]  nf_conntrack_in+0x498/0x5e0 [nf_conntrack]
      [295241.958344]  __ovs_ct_lookup+0xac/0x860 [openvswitch]
      [295241.968532]  ovs_ct_execute+0x4a7/0x7c0 [openvswitch]
      [295241.979167]  do_execute_actions+0x54a/0xaa0 [openvswitch]
      [295242.001482]  ovs_execute_actions+0x48/0x100 [openvswitch]
      [295242.006966]  ovs_dp_process_packet+0x96/0x1d0 [openvswitch]
      [295242.012626]  ovs_vport_receive+0x6c/0xc0 [openvswitch]
      [295242.028763]  netdev_frame_hook+0xc0/0x180 [openvswitch]
      [295242.034074]  __netif_receive_skb_core+0x2ca/0xcb0
      [295242.047498]  netif_receive_skb_internal+0x3e/0xc0
      [295242.052291]  napi_gro_receive+0xba/0xe0
      [295242.056231]  mlx5e_handle_rx_cqe_mpwrq_rep+0x12b/0x250 [mlx5_core]
      [295242.062513]  mlx5e_poll_rx_cq+0xa0f/0xa30 [mlx5_core]
      [295242.067669]  mlx5e_napi_poll+0xe1/0x6b0 [mlx5_core]
      [295242.077958]  net_rx_action+0x149/0x3b0
      [295242.086762]  __do_softirq+0xd7/0x2d6
      [295242.090427]  irq_exit+0xf7/0x100
      [295242.093748]  do_IRQ+0x7f/0xd0
      [295242.096806]  common_interrupt+0xf/0xf
      [295242.100559]  </IRQ>
      [295242.102750] RIP: 0033:0x7f9022e88cbd
      [295242.125246] RSP: 002b:00007f9022282b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
      [295242.132900] RAX: 0000000000000005 RBX: 0000000000000010 RCX: 0000000000000000
      [295242.140120] RDX: 00007f9022282ba8 RSI: 00007f9022282a30 RDI: 00007f9014005c30
      [295242.147337] RBP: 00007f9014014d60 R08: 0000000000000020 R09: 00007f90254a8340
      [295242.154557] R10: 00007f9022282a28 R11: 0000000000000246 R12: 0000000000000000
      [295242.161775] R13: 00007f902308c000 R14: 000000000000002b R15: 00007f9022b71f40
      
      Fixes: 3fdbd1ce ("openvswitch: add ipv6 'set' action")
      Signed-off-by: NPaul Blakey <paulb@nvidia.com>
      Link: https://lore.kernel.org/r/20220223163416.24096-1-paulb@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d9b5ae5c
  11. 21 2月, 2022 5 次提交
  12. 20 2月, 2022 1 次提交