1. 03 7月, 2016 17 次提交
    • D
      Merge branch 'mlx5-next' · 513334e1
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      Mellanox 100G SRIOV E-Switch offload and VF representors
      
      We are happy to announce SRIOV E-Switch offload and VF netdev representors.
      
      Or Gerlitz says:
      
      Currently, the way SR-IOV embedded switches are dealt with in Linux is limited
      in its expressiveness and flexibility, but this is not necessarily due to
      hardware limitations. The kernel software model for controlling the SR-IOV
      switch simply does not allow the configuration of anything more complex than
      MAC/VLAN based forwarding.
      
      Hence the benefits brought by SRIOV come at a price of management flexibility,
      when compared to software virtual switches which are used in Para-Virtual (PV)
      schemes and allow implementing complex policies and virtual topologies. Such
      SW switching typically involved a complex per-packet processing within the host
      kernel using subsystems such as TC, Bridge, Netfilter and Open-vswitch.
      
      We'd like to change that and get the best of both worlds: the performance of SR-IOV
      with the management flexibility of software switches. This will eventually include
      a richer model for controlling the SR-IOV switch for flow-based switching and
      tunneling. Under this model, the e-switch is configured dynamically and a fallback
      to software exists in case the hardware is unable to offload all required flows.
      
      This series from Hadar Hen-Zion and myself, is the 1st step in that direction,
      specfically, it provides full control on the SRIOV embedded switching by host
      software and paves the way to offload switching rules and polices with downstream
      patches.
      
      To allow for host based SW control on the SRIOV HW switch, we introduce per VF
      representor host netdevice. The VF representor plays the same role as TAP devices
      in PV setup. A packet send through the VF representor on the host arrives to
      the VF, and a packet sent through the VF is received by its representor. The
      administrator can hook the representor netdev into a kernel switching component.
      Once they do that, packets from the VF are subject to steering (matching and
      actions) of that software component."
      
      Doing so indeed hurts the performance benefits of SRIOV as it forces all the
      traffic to go through the hypervisor. However, this SW representation is what
      would eventually allow us to introduce hybrid model, where we offload steering
      for some of the VF/VM traffic to the HW while keeping other VM traffic to go
      through the hypervisor. Examples for the latter are first packet of flows which
      are needed for SW switches learning and/or matching against policy database or
      types of traffic for which offloading is not desired or not supported by the
      current HW eswitch generation.
      
      The embedded switch is managed through a PCI device driver. As such, we introduce
      a devlink/pci based scheme for setting the mode of the e-switch. The current mode
      (where steering is done based on mac/vlan, etc) is referred to as "legacy" and the
      new mode as "offloads".
      
      For the mlx5 driver / ConnectX4 HW case, the VF representors implement a functional
      subset of mlx5e Ethernet netdevices using their own profile. This design buys us robust
      implementation with code reuse and sharing.
      
      The representors are created by the host PCI driver when (1) in SRIOV and (2) the
      e-switch is set to offloads mode. Currently, in mlx5 the e-switch management is done
      through the PF vport (0) and hence the VF representors along with the existing PF
      netdev which represents the uplink share the PCI PF device instance.
      
      The series is built from two major components, the first relates to the e-switch
      management and the second to VF representors.
      
      We start with a refactoring that treats the existing SRIOV e-switch code as of operating
      in legacy mode. Next, we add the code for the offloads mode which programs the e-switch
      to operate in a way which serves for software based switching:
      
      1. miss rule which matches all packets that do not match any HW other switching rule
      and forwards them to the e-switch management port (0) for further processing.
      
      2. infrastructure for send-to-vport rules which conceptually bypass other "normal"
      steering rules which present at the e-switch datapath. Such rules apply only for packets
      that originate in the e-switch manager vport (0).
      
      Since all the VF reps run over the same e-switch port, we use more logic in the host PCI
      driver to do HW steering of missed packets into the HW queue opened by a the respective VF
      representor. Finally here, we add the devlink APIs to configure the e-switch mode.
      
      The second part from Hadar starts with some refactoring work which allow for multiple
      mlx5e NIC instances to be created over the same PCI function, use common resources
      and avoid wrong loopbacks.
      
      Next comes the heart of the change which is a profile definition which allow to practically
      have both "conventional" mlx5e NIC use cases such as native mode (non SRIOV), VF, PF and VF
      representor to share the Ethernet driver code. This is done by a small surgery that ended up
      with few internal callbacks that should be implemented by a profile instance. The profile
      for the conventional NIC is implemented, to preserve the existing functionality.
      
      The last two patches add e-switch registration API for the VF representors and the
      implementation of the VF representors netdevice profile. Being an mlx5e instance, the
      VF representor uses HW send/recv queues, completions queues and such. It currently doesn't
      support NIC offloads but some of them could be added later on. The VF representor has
      switchdev ops, where currently the only supported API is the one to the HW ID,
      which is needed to identify multiple representors belonging to the same e-switch.
      
      The architecture + solution (software and firmware) work were done by a team consisting
      of Ilya Lesokhin, Haggai Eran, Rony Efraim, Tal Anker, Natan Oppenheimer, Saeed Mahameed,
      Hadar and Or, thanks you all!
      
      v1 --> v2 fixes:
      * removed unneeded variable (patch #3)
      * removed unused value DEVLINK_ESWITCH_MODE_NONE (patch #8)
      * changed the devlink mode name from "offloads" to "switchdev" which
         better describes what are we referring here, using a known concept (patch #8)
      * correctly refer to devlink e-switch modes (patch #10)
      * use the correct mlx5e way to define the VF rep statistics  (patch #16)
      
      v2 --> v3 fixes:
      * Rebased on top 6fde0e63 'be2net: signedness bug in be_msix_enable()'
      * Handled compilation error introduced by rebase on top "f5074d0c Merge branch 'mlx5-100G-fixes'"
      * This series applies perfectly even with 'mlx5 resiliency and xmit path fixes' merged to net-next
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      513334e1
    • H
      net/mlx5e: Introduce SRIOV VF representors · cb67b832
      Hadar Hen Zion 提交于
      Implement the relevant profile functions to create mlx5e driver instance
      serving as VF representor. When SRIOV offloads mode is enabled, each VF
      will have a representor netdevice instance on the host.
      
      To do that, we also export set of shared service functions from en_main.c,
      such that they can be used by both NIC and repsresentors netdevs.
      
      The newly created representor netdevice has a basic set of net_device_ops
      which are the same ndo functions as the NIC netdevice and an ndo of it's
      own for phys port name.
      
      The profiling infrastructure allow sharing code between the NIC and the
      vport representor even though the representor has only a subset of the
      NIC functionality.
      
      The VF reps and the PF which is used in that mode to represent the uplink,
      expose switchdev ops. Currently the only op supposed is attr get for the
      port parent ID which here serves to identify net-devices belonging to the
      same HW E-Switch. Other than that, no offloading is implemented and hence
      switching functionality is achieved if one sets SW switching rules, e.g
      using tc, bridge or ovs.
      
      Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
      to user-space the VF vport number and along with the switchdev port parent
      id (phys_switch_id) enable a udev base consistent naming scheme:
      
      SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
              ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
      
      where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
      the name of the PF netdevice.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb67b832
    • H
      net/mlx5: Add Representors registration API · 127ea380
      Hadar Hen Zion 提交于
      Introduce E-Switch registration/unregister representors functions.
      
      Those functions are called by the mlx5e driver when the PF NIC is
      created upon pci probe action regardless of the E-Switch mode (NONE,
      LEGACY or OFFLOADS).
      
      Adding basic E-Switch database that will hold the vport represntors
      upon creation.
      
      This patch doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      127ea380
    • H
      net/mlx5e: Add support for multiple profiles · 6bfd390b
      Hadar Hen Zion 提交于
      To allow support in representor netdevices where we create more than one
      netdevice per NIC, add profiles to the mlx5e driver. The profiling
      allows for creation of mlx5e instances with different characteristics.
      
      Each profile implements its own behavior using set of function pointers
      defined in struct mlx5e_profile. This is done to allow for avoiding complex
      per profix branching in the code.
      
      Currently only the profile for the conventional NIC is implemented,
      which is of use when a netdev is created upon pci probe.
      
      This patch doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bfd390b
    • H
      net/mlx5e: Mark enabled RQTs instances explicitly · 398f3351
      Hadar Hen Zion 提交于
      In the current driver implementation two types of receive queue
      tables (RQTs) are in use - direct and indirect.
      
      Change the driver to mark each new created RQT (direct or indirect)
      as "enabled". This behaviour is needed for introducing new mlx5e
      instances which serve to represent SRIOV VFs.
      
      The VF representors will have only one type of RQTs (direct).
      
      An "enabled" flag is added to each RQT to allow better handling
      and code sharing between the representors and the nic netdevices.
      
      This patch doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      398f3351
    • H
      net/mlx5e: TIRs management refactoring · 724b2aa1
      Hadar Hen Zion 提交于
      The current refresh tirs self loopback mechanism, refreshes all the tirs
      belonging to the same mlx5e instance to prevent self loopback by packets
      sent over any ring of that instance. This mechanism relies on all the
      tirs/tises of an instance to be created with the same transport domain
      number (tdn).
      
      Change the driver to refresh all the tirs created under the same tdn
      regardless of which mlx5e netdev instance they belong to.
      
      This behaviour is needed for introducing new mlx5e instances which serve
      to represent SRIOV VFs. The representors and the PF share vport used for
      E-Switch management, and we want to avoid NIC level HW loopback between
      them, e.g when sending broadcast packets. To achieve that, both the
      representors and the PF NIC will share the tdn.
      
      This patch doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      724b2aa1
    • H
      net/mlx5e: Create NIC global resources only once · b50d292b
      Hadar Hen Zion 提交于
      To allow creating more than one netdev over the same PCI function, we
      change the driver such that global NIC resources are created once and
      later be shared amongst all the mlx5e netdevs running over that port.
      
      Move the CQ UAR, PD (pdn), Transport Domain (tdn), MKey resources from
      being kept in the mlx5e priv part to a new resources structure
      (mlx5e_resources) placed under the mlx5_core device.
      
      This patch doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50d292b
    • O
      net/mlx5e: Add devlink based SRIOV mode changes · c930a3ad
      Or Gerlitz 提交于
      Implement handlers for the devlink commands to get and set the SRIOV
      E-Switch mode.
      
      When turning to the switchdev/offloads mode, we disable the e-switch
      and enable it again in the new mode, create the NIC offloads table
      and create VF reps.
      
      When turning to legacy mode, we remove the VF reps and the offloads
      table, and re-initiate the e-switch in it's legacy mode.
      
      The actual creation/removal of the VF reps is done in downstream patches.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c930a3ad
    • O
      net/mlx5: Add devlink interface · feae9087
      Or Gerlitz 提交于
      The devlink interface is initially used to set/get the mode of the SRIOV e-switch.
      
      Currently, these are only stubs for get/set, down-stream patch will actually
      fill them out.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      feae9087
    • O
      net/devlink: Add E-Switch mode control · 08f4b591
      Or Gerlitz 提交于
      Add the commands to set and show the mode of SRIOV E-Switch, two modes
      are supported:
      
      * legacy: operating in the "old" L2 based mode (DMAC --> VF vport)
      
      * switchdev: the E-Switch is referred to as whitebox switch configured
      using standard tools such as tc, bridge, openvswitch etc. To allow
      working with the tools, for each VF, a VF representor netdevice is
      created by the E-Switch manager vendor device driver instance (e.g PF).
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08f4b591
    • O
      net/mlx5: E-Switch, Add API to create vport rx rules · fed9ce22
      Or Gerlitz 提交于
      Add the API to create vport rx rules of the form
      
      	packet meta-data :: vport == $VPORT --> $TIR
      
      where the TIR is opened by this VF representor.
      
      This logic will by used for packets that didn't match any rule in the
      e-switch datapath and should be received into the host OS through the
      netdevice that represents the VF they were sent from.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fed9ce22
    • O
      net/mlx5: E-Switch, Add offloads table · c116c6ee
      Or Gerlitz 提交于
      Belongs to the NIC offloads name-space, and to be used as part of the
      SRIOV offloads logic to steer packets that hit the e-switch miss rule
      to the TIR of the relevant VF representor.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c116c6ee
    • O
      net/mlx5: Introduce offloads steering namespace · acbc2004
      Or Gerlitz 提交于
      Add a new namespace (MLX5_FLOW_NAMESPACE_OFFLOADS) to be populated
      with flow steering rules that deal with rules that have have to
      be executed before the EN NIC steering rules are matched.
      
      The namespace is located after the bypass name-space and before the
      kernel name-space. Therefore, it precedes the HW processing done for
      rules set for the kernel NIC name-space.
      
      Under SRIOV, it would allow us to match on e-switch missed packet
      and forward them to the relevant VF representor TIR.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NAmir Vadai <amir@vadai.me>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acbc2004
    • O
      net/mlx5: E-Switch, Add API to create send-to-vport rules · ab22be9b
      Or Gerlitz 提交于
      Add the API to create send-to-vport e-switch rules of the form
      
       packet meta-data :: send-queue-number == $SQN and source-vport == 0 --> $VPORT
      
      These rules are to be used for a send-to-vport logic which conceptually bypasses
      the "normal" steering rules currently present at the e-switch datapath.
      
      Such rule should apply only for packets that originate in the e-switch manager
      vport (0) and are sent for a given SQN which is used by a given VF representor
      device, and hence the matching logic.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab22be9b
    • O
      net/mlx5: E-Switch, Add miss rule for offloads mode · 3aa33572
      Or Gerlitz 提交于
      In the sriov offloads mode, packets that are not matched by any other
      rule should be sent towards the e-switch manager for further processing.
      
      Add such "miss" rule which matches ANY packet as the last rule in the
      e-switch FDB and programs the HW to send the packet to vport 0 where
      the e-switch manager runs.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3aa33572
    • O
      net/mlx5: E-Switch, Add support for the sriov offloads mode · 69697b6e
      Or Gerlitz 提交于
      Unlike the legacy mode, here, forwarding rules are not learned by the
      driver per events on macs set by VFs/VMs into their vports, but rather
      should be programmed by higher-level SW entities.
      
      Saying that, still, in the offloads mode (SRIOV_OFFLOADS), two flow
      groups are created by the driver for management (slow path) purposes:
      
      The first group will be used for sending packets over e-switch vports
      from the host OS where the e-switch management code runs, to be
      received by VFs.
      
      The second group will be used by a miss rule which forwards packets toward
      the e-switch manager. Further logic will trap these packets such that
      the receiving net-device as seen by the networking stack is the representor
      of the vport that sent the packet over the e-switch data-path.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69697b6e
    • O
      net/mlx5: E-Switch, Add operational mode to the SRIOV e-Switch · 6ab36e35
      Or Gerlitz 提交于
      Define three modes for the SRIOV e-switch operation, none (SRIOV_NONE,
      none of the VF vports are enabled), legacy (SRIOV_LEGACY, the current mode)
      and sriov offloads (SRIOV_OFFLOADS). Currently, when in SRIOV, only the
      legacy mode is supported, where steering rules are of the form:
      
              destination mac --> VF vport
      
      This patch does not change any functionality.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ab36e35
  2. 02 7月, 2016 23 次提交