1. 29 6月, 2021 7 次提交
    • D
      Merge branch 'reset-mac' · 8eb517a2
      David S. Miller 提交于
      Guillaume Nault says:
      
      ====================
      net: reset MAC header consistently across L3 virtual devices
      
      Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
      reset the MAC header pointer after they parsed the outer headers. This
      accurately reflects the fact that the decapsulated packet is pure L3
      packet, as that makes the MAC header 0 bytes long (the MAC and network
      header pointers are equal).
      
      However, many L3 devices only adjust the network header after
      decapsulation and leave the MAC header pointer to its original value.
      This can confuse other parts of the networking stack, like TC, which
      then considers the outer headers as one big MAC header.
      
      This patch series makes the following L3 tunnels behave like VXLAN-GPE:
      bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
      
      The case of gre is a bit special. It already resets the MAC header
      pointer in collect_md mode, so only the classical mode needs to be
      adjusted. However, gre also has a special case that expects the MAC
      header pointer to keep pointing to the outer header even after
      decapsulation. Therefore, patch 4 keeps an exception for this case.
      
      Ideally, we'd centralise the call to skb_reset_mac_header() in
      ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
      sit (patch 3) and gre (patch 4). That's unfortunately not feasible
      currently, because of the gre special case discussed above that
      precludes us from resetting the MAC header unconditionally.
      
      The original motivation is to redirect bareudp packets to Ethernet
      devices (as described in patch 1). The rest of this series aims at
      bringing consistency across all L3 devices (apart from gre's special
      case unfortunately).
      
      Note: the gtp patch results from pure code inspection and has been
      compiled tested only.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8eb517a2
    • G
      gtp: reset mac_header after decap · b2d898c8
      Guillaume Nault 提交于
      For consistency with other L3 tunnel devices, reset the mac_header
      pointer after decapsulation. This makes the mac_header 0 bytes long,
      thus making it clear that this skb has no mac_header.
      
      Compile tested only.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2d898c8
    • G
      ip6_tunnel: allow redirecting ip6gre and ipxip6 packets to eth devices · da5a2e49
      Guillaume Nault 提交于
      Reset the mac_header pointer even when the tunnel transports only L3
      data (in the ARPHRD_ETHER case, this is already done by eth_type_trans).
      This prevents other parts of the stack from mistakenly accessing the
      outer header after the packet has been decapsulated.
      
      In practice, this allows to push an Ethernet header to ipip6, ip6ip6,
      mplsip6 or ip6gre packets and redirect them to an Ethernet device:
      
        $ tc filter add dev ip6tnl0 ingress matchall       \
            action vlan push_eth dst_mac 00:00:5e:00:53:01 \
                                 src_mac 00:00:5e:00:53:00 \
            action mirred egress redirect dev eth0
      
      Without this patch, push_eth refuses to add an ethernet header because
      the skb appears to already have a MAC header.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da5a2e49
    • G
      gre: let mac_header point to outer header only when necessary · aab1e898
      Guillaume Nault 提交于
      Commit e271c7b4 ("gre: do not keep the GRE header around in collect
      medata mode") did reset the mac_header for the collect_md case. Let's
      extend this behaviour to classical gre devices as well.
      
      ipgre_header_parse() seems to be the only case that requires mac_header
      to point to the outer header. We can detect this case accurately by
      checking ->header_ops. For all other cases, we can reset mac_header.
      
      This allows to push an Ethernet header to ipgre packets and redirect
      them to an Ethernet device:
      
        $ tc filter add dev gre0 ingress matchall          \
            action vlan push_eth dst_mac 00:00:5e:00:53:01 \
                                 src_mac 00:00:5e:00:53:00 \
            action mirred egress redirect dev eth0
      
      Before this patch, this worked only for collect_md gre devices.
      Now this works for regular gre devices as well. Only the special case
      of gre devices that use ipgre_header_ops isn't supported.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aab1e898
    • G
      sit: allow redirecting ip6ip, ipip and mplsip packets to eth devices · 730eed27
      Guillaume Nault 提交于
      Even though sit transports L3 data (IPv6, IPv4 or MPLS) packets, it
      needs to reset the mac_header pointer, so that other parts of the stack
      don't mistakenly access the outer header after the packet has been
      decapsulated. There are two rx handlers to modify: ipip6_rcv() for the
      ip6ip mode and sit_tunnel_rcv() which is used to re-implement the ipip
      and mplsip modes of ipip.ko.
      
      This allows to push an Ethernet header to sit packets and redirect
      them to an Ethernet device:
      
        $ tc filter add dev sit0 ingress matchall          \
            action vlan push_eth dst_mac 00:00:5e:00:53:01 \
                                 src_mac 00:00:5e:00:53:00 \
            action mirred egress redirect dev eth0
      
      Without this patch, push_eth refuses to add an ethernet header because
      the skb appears to already have a MAC header.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      730eed27
    • G
      ipip: allow redirecting ipip and mplsip packets to eth devices · 7ad136fd
      Guillaume Nault 提交于
      Even though ipip transports IPv4 or MPLS packets, it needs to reset the
      mac_header pointer, so that other parts of the stack don't mistakenly
      access the outer header after the packet has been decapsulated.
      
      This allows to push an Ethernet header to ipip or mplsip packets and
      redirect them to an Ethernet device:
      
        $ tc filter add dev ipip0 ingress matchall         \
            action vlan push_eth dst_mac 00:00:5e:00:53:01 \
                                 src_mac 00:00:5e:00:53:00 \
            action mirred egress redirect dev eth0
      
      Without this patch, push_eth refuses to add an ethernet header because
      the skb appears to already have a MAC header.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ad136fd
    • G
      bareudp: allow redirecting bareudp packets to eth devices · 99c8719b
      Guillaume Nault 提交于
      Even though bareudp transports L3 data (typically IP or MPLS), it needs
      to reset the mac_header pointer, so that other parts of the stack don't
      mistakenly access the outer header after the packet has been
      decapsulated.
      
      This allows to push an Ethernet header to bareudp packets and redirect
      them to an Ethernet device:
      
        $ tc filter add dev bareudp0 ingress matchall      \
            action vlan push_eth dst_mac 00:00:5e:00:53:01 \
                                 src_mac 00:00:5e:00:53:00 \
            action mirred egress redirect dev eth0
      
      Without this patch, push_eth refuses to add an ethernet header because
      the skb appears to already have a MAC header.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c8719b
  2. 26 6月, 2021 9 次提交
    • D
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · ff8744b5
      David S. Miller 提交于
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2021-06-25
      
      This series contains updates to ice driver only.
      
      Jesse adds support for tracepoints to aide in debugging.
      
      Maciej adds support for PTP auxiliary pin support.
      
      Victor removes the VSI info from the old aggregator when moving the VSI
      to another aggregator.
      
      Tony removes an unnecessary VSI assignment.
      
      Christophe Jaillet fixes a memory leak for failed allocation in
      ice_pf_dcb_cfg().
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff8744b5
    • G
      net/smc: Ensure correct state of the socket in send path · 17081633
      Guvenc Gulce 提交于
      When smc_sendmsg() is called before the SMC socket initialization has
      completed, smc_tx_sendmsg() will access un-initialized fields of the
      SMC socket which results in a null-pointer dereference.
      Fix this by checking the socket state first in smc_tx_sendmsg().
      
      Fixes: e0e4b8fa ("net/smc: Add SMC statistics support")
      Reported-by: syzbot+5dda108b672b54141857@syzkaller.appspotmail.com
      Reviewed-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17081633
    • D
      Merge tag 'wireless-drivers-next-2021-06-25' of... · 4e3db44a
      David S. Miller 提交于
      Merge tag 'wireless-drivers-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      wireless-drivers-next patches for v5.14
      
      Second, and most likely the last, set of patches for v5.14. mt76 and
      iwlwifi have most patches in this round, but rtw88 also has some new
      features. Nothing special really standing out.
      
      mt76
      
      * mt7915 MSI support
      
      * disable ASPM on mt7915
      
      * mt7915 tx status reporting
      
      * mt7921 decap offload
      
      rtw88
      
      * beacon filter support
      
      * path diversity support
      
      * firmware crash information via devcoredump
      
      * quirks for disabling pci capabilities
      
      mt7601u
      
      * add USB ID for a XiaoDu WiFi Dongle
      
      ath11k
      
      * enable support for QCN9074 PCI devices
      
      brcmfmac
      
      * support parse country code map from DeviceTree
      
      iwlwifi
      
      * support for new hardware
      
      * support for BIOS control of 11ax enablement in Russia
      
      * support UNII4 band enablement from BIOS
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e3db44a
    • M
      net: mdiobus: withdraw fwnode_mdbiobus_register · ac53c264
      Marcin Wojtas 提交于
      The newly implemented fwnode_mdbiobus_register turned out to be
      problematic - in case the fwnode_/of_/acpi_mdio are built as
      modules, a dependency cycle can be observed during the depmod phase of
      modules_install, eg.:
      
      depmod: ERROR: Cycle detected: fwnode_mdio -> of_mdio -> fwnode_mdio
      depmod: ERROR: Found 2 modules in dependency cycles!
      
      OR:
      
      depmod: ERROR: Cycle detected: acpi_mdio -> fwnode_mdio -> acpi_mdio
      depmod: ERROR: Found 2 modules in dependency cycles!
      
      A possible solution could be to rework fwnode_mdiobus_register,
      so that to merge the contents of acpi_mdiobus_register and
      of_mdiobus_register. However feasible, such change would
      be very intrusive and affect huge amount of the of_mdiobus_register
      users.
      
      Since there are currently 2 users of ACPI and MDIO
      (xgmac_mdio and mvmdio), withdraw the fwnode_mdbiobus_register
      and roll back to a simple 'if' condition in affected drivers.
      
      Fixes: 62a6ef6a ("net: mdiobus: Introduce fwnode_mdbiobus_register()")
      Signed-off-by: NMarcin Wojtas <mw@semihalf.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac53c264
    • C
      ice: Fix a memory leak in an error handling path in 'ice_pf_dcb_cfg()' · b81c191c
      Christophe JAILLET 提交于
      If this 'kzalloc()' fails we must free some resources as in all the other
      error handling paths of this function.
      
      Fixes: 348048e7 ("ice: Implement iidc operations")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      b81c191c
    • T
      ice: remove unnecessary VSI assignment · 70fa0a07
      Tony Nguyen 提交于
      ice_get_vf_vsi() is being called twice for the same VSI. Remove the
      unnecessary call/assignment.
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      70fa0a07
    • V
      ice: remove the VSI info from previous agg · 37c59206
      Victor Raj 提交于
      Remove the VSI info from previous aggregator after moving the VSI to a
      new aggregator.
      Signed-off-by: NVictor Raj <victor.raj@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      37c59206
    • M
      ice: add support for auxiliary input/output pins · 172db5f9
      Maciej Machnikowski 提交于
      The E810 device supports programmable pins for enabling both input and
      output events related to the PTP hardware clock. This includes both
      output signals with programmable period, as well as timestamping of
      events on input pins.
      
      Add support for enabling these using the CONFIG_PTP_1588_CLOCK
      interface.
      
      This allows programming the software defined pins to take advantage of
      the hardware clock features.
      Signed-off-by: NMaciej Machnikowski <maciej.machnikowski@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      172db5f9
    • D
      Add Mellanox BlueField Gigabit Ethernet driver · f92e1869
      David Thompson 提交于
      This patch adds build and driver logic for the "mlxbf_gige"
      Ethernet driver from Mellanox Technologies. The second
      generation BlueField SoC from Mellanox supports an
      out-of-band GigaBit Ethernet management port to the Arm
      subsystem.  This driver supports TCP/IP network connectivity
      for that port, and provides back-end routines to handle
      basic ethtool requests.
      
      The driver interfaces to the Gigabit Ethernet block of
      BlueField SoC via MMIO accesses to registers, which contain
      control information or pointers describing transmit and
      receive resources.  There is a single transmit queue, and
      the port supports transmit ring sizes of 4 to 256 entries.
      There is a single receive queue, and the port supports
      receive ring sizes of 32 to 32K entries. The transmit and
      receive rings are allocated from DMA coherent memory. There
      is a 16-bit producer and consumer index per ring to denote
      software ownership and hardware ownership, respectively.
      
      The main driver logic such as probe(), remove(), and netdev
      ops are in "mlxbf_gige_main.c".  Logic in "mlxbf_gige_rx.c"
      and "mlxbf_gige_tx.c" handles the packet processing for
      receive and transmit respectively.
      
      The logic in "mlxbf_gige_ethtool.c" supports the handling
      of some basic ethtool requests: get driver info, get ring
      parameters, get registers, and get statistics.
      
      The logic in "mlxbf_gige_mdio.c" is the driver controlling
      the Mellanox BlueField hardware that interacts with a PHY
      device via MDIO/MDC pins.  This driver does the following:
        - At driver probe time, it configures several BlueField MDIO
          parameters such as sample rate, full drive, voltage and MDC
        - It defines functions to read and write MDIO registers and
          registers the MDIO bus.
        - It defines the phy interrupt handler reporting a
          link up/down status change
        - This driver's probe is invoked from the main driver logic
          while the phy interrupt handler is registered in ndo_open.
      
      Driver limitations
        - Only supports 1Gbps speed
        - Only supports GMII protocol
        - Supports maximum packet size of 2KB
        - Does not support scatter-gather buffering
      
      Testing
        - Successful build of kernel for ARM64, ARM32, X86_64
        - Tested ARM64 build on FastModels & Palladium
        - Tested ARM64 build on several Mellanox boards that are built with
          the BlueField-2 SoC.  The testing includes coverage in the areas
          of networking (e.g. ping, iperf, ifconfig, route), file transfers
          (e.g. SCP), and various ethtool options relevant to this driver.
      Signed-off-by: NDavid Thompson <davthompson@nvidia.com>
      Signed-off-by: NAsmaa Mnebhi <asmaa@nvidia.com>
      Reviewed-by: NLiming Sun <limings@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f92e1869
  3. 25 6月, 2021 24 次提交
    • J
      ice: add tracepoints · 3089cf6d
      Jesse Brandeburg 提交于
      This patch is modeled after one by Scott Peterson for i40e.
      
      Add tracepoints to the driver, via a new file ice_trace.h and some new
      trace calls added in interesting places in the driver. Add some tracing
      for DIMLIB to help debug interrupt moderation problems.
      
      Performance should not be affected, and this can be very useful
      for debugging and adding new trace events to paths in the future.
      
      Note eBPF programs can attach to these events, as well as perf
      can count them since we're attaching to the events subsystem
      in the kernel.
      Co-developed-by: NBen Shelton <benjamin.h.shelton@intel.com>
      Signed-off-by: NBen Shelton <benjamin.h.shelton@intel.com>
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      3089cf6d
    • J
      net: bcmgenet: Add mdio-bcm-unimac soft dependency · 19938baf
      Jian-Hong Pan 提交于
      The Broadcom UniMAC MDIO bus from mdio-bcm-unimac module comes too late.
      So, GENET cannot find the ethernet PHY on UniMAC MDIO bus. This leads
      GENET fail to attach the PHY as following log:
      
      bcmgenet fd580000.ethernet: GENET 5.0 EPHY: 0x0000
      ...
      could not attach to PHY
      bcmgenet fd580000.ethernet eth0: failed to connect to PHY
      uart-pl011 fe201000.serial: no DMA platform data
      libphy: bcmgenet MII bus: probed
      ...
      unimac-mdio unimac-mdio.-19: Broadcom UniMAC MDIO bus
      
      It is not just coming too late, there is also no way for the module
      loader to figure out the dependency between GENET and its MDIO bus
      driver unless we provide this MODULE_SOFTDEP hint.
      
      This patch adds the soft dependency to load mdio-bcm-unimac module
      before genet module to fix this issue.
      
      Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=213485
      Fixes: 9a4e7969 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver")
      Signed-off-by: NJian-Hong Pan <jhp@endlessos.org>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19938baf
    • Z
      ipv6: delete useless dst check in ip6_dst_lookup_tail · c305b9e6
      zhang kai 提交于
      parameter dst always points to null.
      Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c305b9e6
    • I
      mlxsw: core_env: Avoid unnecessary memcpy()s · 911bd1b1
      Ido Schimmel 提交于
      Simply get a pointer to the data in the register payload instead of
      copying it to a temporary buffer.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      911bd1b1
    • B
      gve: Fix warnings reported for DQO patchset · e8192476
      Bailey Forrest 提交于
      https://patchwork.kernel.org/project/netdevbpf/list/?series=506637&state=*
      
      - Remove unused variable
      - Use correct integer type for string formatting.
      - Remove `inline` in C files
      
      Fixes: 9c1a59a2 ("gve: DQO: Add ring allocation and initialization")
      Fixes: a57e5de4 ("gve: DQO: Add TX path")
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8192476
    • D
      Merge branch 'sctp-pmtud-convergence' · 1ed1fe24
      David S. Miller 提交于
      Xin Long says:
      
      ====================
      sctp: make the PLPMTUD probe more effective and efficient
      
      As David Laight noticed, it currently takes quite some time to find
      the optimal pmtu in the Search state, and also lacks the black hole
      detection in the Search Complete state. This patchset is to address
      them to mke the PLPMTUD probe more effective and efficient.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ed1fe24
    • X
      sctp: send the next probe immediately once the last one is acked · fea1d5b1
      Xin Long 提交于
      These is no need to wait for 'interval' period for the next probe
      if the last probe is already acked in search state. The 'interval'
      period waiting should be only for probe failure timeout and the
      current pmtu check when it's in search complete state.
      
      This change will shorten the probe time a lot in search state, and
      also fix the document accordingly.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fea1d5b1
    • X
      sctp: do black hole detection in search complete state · 0dac127c
      Xin Long 提交于
      Currently the PLPMUTD probe will stop for a long period (interval * 30)
      after it enters search complete state. If there's a pmtu change on the
      route path, it takes a long time to be aware if the ICMP TooBig packet
      is lost or filtered.
      
      As it says in rfc8899#section-4.3:
      
        "A DPLPMTUD method MUST NOT rely solely on this method."
        (ICMP PTB message).
      
      This patch is to enable the other method for search complete state:
      
        "A PL can use the DPLPMTUD probing mechanism to periodically
         generate probe packets of the size of the current PLPMTU."
      
      With this patch, the probe will continue with the current pmtu every
      'interval' until the PMTU_RAISE_TIMER 'timeout', which we implement
      by adding raise_count to raise the probe size when it counts to 30
      and removing the SCTP_PL_COMPLETE check for PMTU_RAISE_TIMER.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dac127c
    • D
      Merge branch 'sja1110-doc' · 98ebad48
      David S. Miller 提交于
      Vladimir Oltean says:
      
      ====================
      Document the NXP SJA1110 switch as supported
      
      Now that most of the basic work for SJA1110 support has been done in the
      sja1105 DSA driver, let's add the missing documentation bits to make it
      clear that the driver can be used.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98ebad48
    • V
      net: dsa: sja1105: document the SJA1110 in the Kconfig · 75e99470
      Vladimir Oltean 提交于
      Mention support for the SJA1110 in menuconfig.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75e99470
    • V
      Documentation: net: dsa: add details about SJA1110 · 44531076
      Vladimir Oltean 提交于
      Denote that the new switch generation is supported, detail its pin
      strapping options (with differences compared to SJA1105) and explain how
      MDIO access to the internal 100base-T1 and 100base-TX PHYs is performed.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44531076
    • D
      Merge branch 'gve-dqo' · 89bddde3
      David S. Miller 提交于
      Bailey Forrest says:
      
      ====================
      gve: Introduce DQO descriptor format
      
      DQO is the descriptor format for our next generation virtual NIC. The existing
      descriptor format will be referred to as "GQI" in the patch set.
      
      One major change with DQO is it uses dual descriptor rings for both TX and RX
      queues.
      
      The TX path uses a TX queue to send descriptors to HW, and receives packet
      completion events on a TX completion queue.
      
      The RX path posts buffers to HW using an RX buffer queue and receives incoming
      packets on an RX queue.
      
      One important note is that DQO descriptors and doorbells are little endian. We
      continue to use the existing big endian control plane infrastructure.
      
      The general format of the patch series is:
      - Refactor existing code/data structures to be shared by DQO
      - Expand admin queues to support DQO device setup
      - Expand data structures and device setup to support DQO
      - Add logic to setup DQO queues
      - Implement datapath
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89bddde3
    • B
      gve: DQO: Add RX path · 9b8dd5e5
      Bailey Forrest 提交于
      The RX queue has an array of `gve_rx_buf_state_dqo` objects. All
      allocated pages have an associated buf_state object. When a buffer is
      posted on the RX buffer queue, the buffer ID will be the buf_state's
      index into the RX queue's array.
      
      On packet reception, the RX queue will have one descriptor for each
      buffer associated with a received packet. Each RX descriptor will have
      a buffer_id that was posted on the buffer queue.
      
      Notable mentions:
      
      - We use a default buffer size of 2048 bytes. Based on page size, we
        may post separate sections of a single page as separate buffers.
      
      - The driver holds an extra reference on pages passed up the receive
        path with an skb and keeps these pages on a list. When posting new
        buffers to the NIC, we check if any of these pages has only our
        reference, or another buffer sized segment of the page has no
        references. If so, it is free to reuse. This page recycling approach
        is a common netdev optimization that reduces page alloc/free calls.
      
      - Pages in the free list have a page_count bias in order to avoid an
        atomic increment of pagecount every time we attempt to reuse a page.
        # references = page_count() - bias
      
      - In order to track when a page is safe to reuse, we keep track of the
        last offset which had a single SKB reference. When this occurs, it
        implies that every single other offset is reusable. Otherwise, we
        don't know if offsets can be safely reused.
      
      - We maintain two free lists of pages. List #1 (recycled_buf_states)
        contains pages we know can be reused right away. List #2
        (used_buf_states) contains pages which cannot be used right away. We
        only attempt to get pages from list #2 when list #1 is empty. We only
        attempt to use a small fixed number pages from list #2 before giving
        up and allocating a new page. Both lists are FIFOs in hope that by the
        time we attempt to reuse a page, the references were dropped.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b8dd5e5
    • B
      gve: DQO: Add TX path · a57e5de4
      Bailey Forrest 提交于
      TX SKBs will have their buffers DMA mapped with the device. Each buffer
      will have at least one TX descriptor associated. Each SKB will also have
      a metadata descriptor.
      
      Each TX queue maintains an array of `gve_tx_pending_packet_dqo` objects.
      Every TX SKB will have an associated pending_packet object. A TX SKB's
      descriptors will use its pending_packet's index as the completion tag,
      which will be returned on the TX completion queue.
      
      The device implements a "flow-miss model". Most packets will simply
      receive a packet completion. The flow-miss system may choose to process
      a packet based on its contents. A TX packet which experiences a flow
      miss would receive a miss completion followed by a later reinjection
      completion. The miss-completion is received when the packet starts to be
      processed by the flow-miss system and the reinjection completion is
      received when the flow-miss system completes processing the packet and
      sends it on the wire.
      
      Notable mentions:
      
      - Buffers may be freed after receiving the miss-completion, but in order
        to avoid packet reordering, we do not complete the SKB until receiving
        the reinjection completion.
      
      - The driver must robustly handle the unlikely scenario where a miss
        completion does not have an associated reinjection completion. This is
        accomplished by maintaining a list of packets which have a pending
        reinjection completion. After a short timeout (5 seconds), the
        SKB and buffers are released and the pending_packet is moved to a
        second list which has a longer timeout (60 seconds), where the
        pending_packet will not be reused. When the longer timeout elapses,
        the driver may assume the reinjection completion would never be
        received and the pending_packet may be reused.
      
      - Completion handling is triggered by an interrupt and is done in the
        NAPI poll function. Because the TX path and completion exist in
        different threading contexts they maintain their own lists for free
        pending_packet objects. The TX path uses a lock-free approach to steal
        the list from the completion path.
      
      - Both the TSO context and general context descriptors have metadata
        bytes. The device requires that if multiple descriptors contain the
        same field, each descriptor must have the same value set for that
        field.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a57e5de4
    • B
      gve: DQO: Configure interrupts on device up · 0dcc144a
      Bailey Forrest 提交于
      When interrupts are first enabled, we also set the ratelimits, which
      will be static for the entire usage of the device.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dcc144a
    • B
      gve: DQO: Add ring allocation and initialization · 9c1a59a2
      Bailey Forrest 提交于
      Allocate the buffer and completion ring structures. Do not populate the
      rings yet. That will happen in the respective rx and tx datapath
      follow-on patches
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c1a59a2
    • B
      gve: DQO: Add core netdev features · 5e8c5adf
      Bailey Forrest 提交于
      Add napi netdev device registration, interrupt handling and initial tx
      and rx polling stubs. The stubs will be filled in follow-on patches.
      
      Also:
      - LRO feature advertisement and handling
      - Also update ethtool logic
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e8c5adf
    • B
      gve: Update adminq commands to support DQO queues · 1f6228e4
      Bailey Forrest 提交于
      DQO queue creation requires additional parameters:
      - TX completion/RX buffer queue size
      - TX completion/RX buffer queue address
      - TX/RX queue size
      - RX buffer size
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f6228e4
    • B
      gve: Add DQO fields for core data structures · a4aa1f1e
      Bailey Forrest 提交于
      - Add new DQO datapath structures:
        - `gve_rx_buf_queue_dqo`
        - `gve_rx_compl_queue_dqo`
        - `gve_rx_buf_state_dqo`
        - `gve_tx_desc_dqo`
        - `gve_tx_pending_packet_dqo`
      
      - Incorporate these into the existing ring data structures:
        - `gve_rx_ring`
        - `gve_tx_ring`
      
      Noteworthy mentions:
      
      - `gve_rx_buf_state` represents an RX buffer which was posted to HW.
        Each RX queue has an array of these objects and the index into the
        array is used as the buffer_id when posted to HW.
      
      - `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The
        completion_tag is the index into the array.
      
      - These two structures have links for linked lists which are represented
        by 16b indexes into a contiguous array of these structures.
        This reduces memory footprint compared to 64b pointers.
      
      - We use unions for the writeable datapath structures to reduce cache
        footprint. GQI specific members will renamed like DQO members in a
        future patch.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4aa1f1e
    • B
      gve: Add dqo descriptors · 22319818
      Bailey Forrest 提交于
      General description of rings and descriptors:
      
      TX ring is used for sending TX packet buffers to the NIC. It has the
      following descriptors:
      - `gve_tx_pkt_desc_dqo` - Data buffer descriptor
      - `gve_tx_tso_context_desc_dqo` - TSO context descriptor
      - `gve_tx_general_context_desc_dqo` - Generic metadata descriptor
      
      Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo`
      which represents the logical interpetation of the metadata bytes. It's
      helpful to define this structure because the metadata bytes exist in
      multiple descriptor types (including `gve_tx_tso_context_desc_dqo`),
      and the device requires same field has the same value in all
      descriptors.
      
      The TX completion ring is used to receive completions from the NIC.
      Having a separate ring allows for completions to be out of order. The
      completion descriptor `gve_tx_compl_desc` has several different types,
      most important are packet and descriptor completions. Descriptor
      completions are used to notify the driver when descriptors sent on the
      TX ring are done being consumed. The descriptor completion is only used
      to signal that space is cleared in the TX ring. A packet completion will
      be received when a packet transmitted on the TX queue is done being
      transmitted.
      
      In addition there are "miss" and "reinjection" completions. The device
      implements a "flow-miss model". Most packets will simply receive a
      packet completion. The flow-miss system may choose to process a packet
      based on its contents. A TX packet which experiences a flow miss would
      receive a miss completion followed by a later reinjection completion.
      The miss-completion is received when the packet starts to be processed
      by the flow-miss system and the reinjection completion is received when
      the flow-miss system completes processing the packet and sends it on the
      wire.
      
      The RX buffer ring is used to send buffers to HW via the
      `gve_rx_desc_dqo` descriptor.
      
      Received packets are put into the RX queue by the device, which
      populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors
      refer to buffers posted by the buffer queue. Received buffers may be
      returned out of order, such as when HW LRO is enabled.
      
      Important concepts:
      - "TX" and "RX buffer" queues, which send descriptors to the device, use
        MMIO doorbells to notify the device of new descriptors.
      
      - "RX" and "TX completion" queues, which receive descriptors from the
        device, use a "generation bit" to know when a descriptor was populated
        by the device. The driver initializes all bits with the "current
        generation". The device will populate received descriptors with the
        "next generation" which is inverted from the current generation. When
        the ring wraps, the current/next generation are swapped.
      
      - It's the driver's responsibility to ensure that the RX and TX
        completion queues are not overrun. This can be accomplished by
        limiting the number of descriptors posted to HW.
      
      - TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
        buffer_id. These will be returned on the TX completion and RX queues
        respectively to let the driver know which packet/buffer was completed.
      
      Bitfields are used to describe descriptor fields. This notation is more
      concise and readable than shift-and-mask. It is possible because the
      driver is restricted to little endian platforms.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22319818
    • B
      gve: Add support for DQO RX PTYPE map · c4b87ac8
      Bailey Forrest 提交于
      Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the
      packet. L3 and L4 types are necessary in order to set the hash and csum
      on RX SKBs correctly.
      
      DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map
      enables the device to tell the driver how to map from PTYPE index to
      L3/L4 type.
      
      The device doesn't provide any guarantees about the range of possible
      PTYPEs, so we just use a 1024 entry array to implement a fast mapping
      structure.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4b87ac8
    • B
      gve: adminq: DQO specific device descriptor logic · 5ca2265e
      Bailey Forrest 提交于
      - In addition to TX and RX queues, DQO has TX completion and RX buffer
        queues.
        - TX completions are received when the device has completed sending a
          packet on the wire.
        - RX buffers are posted on a separate queue form the RX completions.
      - DQO descriptor rings are allowed to be smaller than PAGE_SIZE.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ca2265e
    • B
      gve: Introduce per netdev `enum gve_queue_format` · a5886ef4
      Bailey Forrest 提交于
      The currently supported queue formats are:
      - GQI_RDA - GQI with raw DMA addressing
      - GQI_QPL - GQI with queue page list
      - DQO_RDA - DQO with raw DMA addressing
      
      The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we
      remove it in favor of just checking against GQI_RDA
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5886ef4
    • B
      gve: Introduce a new model for device options · 8a39d3e0
      Bailey Forrest 提交于
      The current model uses an integer ID and a fixed size struct for the
      parameters of each device option.
      
      The new model allows the device option structs to grow in size over
      time. A driver may assume that changes to device option structs will
      always be appended.
      
      New device options will also generally have a
      `supported_features_mask` so that the driver knows which fields within a
      particular device option are enabled.
      
      `gve_device_option.feat_mask` is changed to `required_features_mask`,
      and it is a bitmask which must match the value expected by the driver.
      This gives the device the ability to break backwards compatibility with
      old drivers for certain features by blocking the old drivers from trying
      to use the feature.
      
      We maintain ABI compatibility with the old model for
      GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which
      does not support the new model.
      
      This patch introduces some new terminology:
      
      RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA
            mapped and read/updated by the device.
      
      QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped
            with the device for read/write and data is copied from/to SKBs.
      Signed-off-by: NBailey Forrest <bcf@google.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a39d3e0