1. 18 10月, 2017 7 次提交
    • J
      samples/bpf: add cpumap sample program xdp_redirect_cpu · fad3917e
      Jesper Dangaard Brouer 提交于
      This sample program show how to use cpumap and the associated
      tracepoints.
      
      It provides command line stats, which shows how the XDP-RX process,
      cpumap-enqueue and cpumap kthread dequeue is cooperating on a per CPU
      basis.  It also utilize the xdp_exception and xdp_redirect_err
      transpoints to allow users quickly to identify setup issues.
      
      One issue with ixgbe driver is that the driver reset the link when
      loading XDP.  This reset the procfs smp_affinity settings.  Thus,
      after loading the program, these must be reconfigured.  The easiest
      workaround it to reduce the RX-queue to e.g. two via:
      
       # ethtool --set-channels ixgbe1 combined 2
      
      And then add CPUs above 0 and 1, like:
      
       # xdp_redirect_cpu --dev ixgbe1 --prog 2 --cpu 2 --cpu 3 --cpu 4
      
      Another issue with ixgbe is that the page recycle mechanism is tied to
      the RX-ring size.  And the default setting of 512 elements is too
      small.  This is the same issue with regular devmap XDP_REDIRECT.
      To overcome this I've been using 1024 rx-ring size:
      
       # ethtool -G ixgbe1 rx 1024 tx 1024
      
      V3:
       - whitespace cleanups
       - bpf tracepoint cannot access top part of struct
      
      V4:
       - report on kthread sched events, according to tracepoint change
       - report average bulk enqueue size
      
      V5:
       - bpf_map_lookup_elem on cpumap not allowed from bpf_prog
         use separate map to mark CPUs not available
      
      V6:
       - correct kthread sched summary output
      
      V7:
       - Added a --stress-mode for concurrently changing underlying cpumap
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fad3917e
    • J
      bpf: cpumap add tracepoints · f9419f7b
      Jesper Dangaard Brouer 提交于
      This adds two tracepoint to the cpumap.  One for the enqueue side
      trace_xdp_cpumap_enqueue() and one for the kthread dequeue side
      trace_xdp_cpumap_kthread().
      
      To mitigate the tracepoint overhead, these are invoked during the
      enqueue/dequeue bulking phases, thus amortizing the cost.
      
      The obvious use-cases are for debugging and monitoring.  The
      non-intuitive use-case is using these as a feedback loop to know the
      system load.  One can imagine auto-scaling by reducing, adding or
      activating more worker CPUs on demand.
      
      V4: tracepoint remove time_limit info, instead add sched info
      
      V8: intro struct bpf_cpu_map_entry members cpu+map_id in this patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9419f7b
    • J
      bpf: cpumap xdp_buff to skb conversion and allocation · 1c601d82
      Jesper Dangaard Brouer 提交于
      This patch makes cpumap functional, by adding SKB allocation and
      invoking the network stack on the dequeuing CPU.
      
      For constructing the SKB on the remote CPU, the xdp_buff in converted
      into a struct xdp_pkt, and it mapped into the top headroom of the
      packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
      just a cpumap internal data structure, with info carried between
      enqueue to dequeue.
      
      If a driver doesn't have enough headroom it is simply dropped, with
      return code -EOVERFLOW.  This will be picked up the xdp tracepoint
      infrastructure, to allow users to catch this.
      
      V2: take into account xdp->data_meta
      
      V4:
       - Drop busypoll tricks, keeping it more simple.
       - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei
      
      V5: correct RCU read protection around __netif_receive_skb_core.
      
      V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c601d82
    • J
      bpf: XDP_REDIRECT enable use of cpumap · 9c270af3
      Jesper Dangaard Brouer 提交于
      This patch connects cpumap to the xdp_do_redirect_map infrastructure.
      
      Still no SKB allocation are done yet.  The XDP frames are transferred
      to the other CPU, but they are simply refcnt decremented on the remote
      CPU.  This served as a good benchmark for measuring the overhead of
      remote refcnt decrement.  If driver page recycle cache is not
      efficient then this, exposes a bottleneck in the page allocator.
      
      A shout-out to MST's ptr_ring, which is the secret behind is being so
      efficient to transfer memory pointers between CPUs, without constantly
      bouncing cache-lines between CPUs.
      
      V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.
      
      V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
       as implementation require a separate upstream discussion.
      
      V5:
       - Fix a maybe-uninitialized pointed out by kbuild test robot.
       - Restrict bpf-prog side access to cpumap, open when use-cases appear
       - Implement cpu_map_enqueue() as a more simple void pointer enqueue
      
      V6:
       - Allow cpumap type for usage in helper bpf_redirect_map,
         general bpf-prog side restriction moved to earlier patch.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c270af3
    • J
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer 提交于
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6710e112
    • J
      net: ftgmac100: Request clock and set speed · 4b70c62b
      Joel Stanley 提交于
      According to the ASPEED datasheet, gigabit speeds require a clock of
      100MHz or higher. Other speeds require 25MHz or higher. This patch
      configures a 100MHz clock if the system has a direct-attached
      PHY, or 25MHz if the system is running NC-SI which is limited to 100MHz.
      
      There appear to be no other upstream users of the FTGMAC100 driver it is
      hard to know the clocking requirements of other platforms. Therefore a
      conservative approach was taken with enabling clocks. If the platform is
      not ASPEED, both requesting the clock and configuring the speed is
      skipped.
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Tested-by: NAndrew Jeffery <andrew@aj.id.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b70c62b
    • H
      net: export netdev_txq_to_tc to allow sch_mqprio to compile as module · 8a5f2166
      Henrik Austad 提交于
      In commit 32302902 ("mqprio: Reserve last 32 classid values for HW
      traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
      to find the correct tc instead of dev->tc_to_txq[]
      
      However, when mqprio is compiled as a module, it cannot resolve the
      symbol, leading to this error:
      
           ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!
      
      This adds an EXPORT_SYMBOL() since the other user in the kernel
      (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
      sysfs-callback.
      
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NHenrik Austad <haustad@cisco.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a5f2166
  2. 17 10月, 2017 30 次提交
  3. 16 10月, 2017 1 次提交
    • D
      Merge tag 'mlx5-updates-2017-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · af28f6f2
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2017-10-11: IPoIB Multi Pkey support
      
      This series provides the support for IPoIB Multi Pkey.
      InfiniBand Pkeys are the equivalent of Ethernet vlans.
      Currently IPoIB device driver supports only default Pkey and IPoIB Pkey child
      interfaces are not supported with IPoIB offloads mode, this series will add
      the support for that by allowing creating mlx5 multiple IPoIB netdevices with
      a non-default Pkey.
      
      mlx5 IPoIB Pkey child interface is smaller version of mlx5i IPoIB interfaces and shares
      most of its resources with the parent IPoIB interface, namely RX steering and ring
      queue resources.
      
      The only mlx5 resources a child Pkey interface will be creating are the TX rings,
      since they should be assigned to a specific Pkey.
      
      mlx5i Pkey netdev is implemented via new mlx5e netdev profile implemented in
      mlx5/core/ipoib/ipoib_vlan.c.
      
      The series starts with a refactoring of mlx5e PTP and mlx5 clock implementation
      to move the code to be part of mlx5 core rather than mlx5e netdevice, in order to
      make mlx5 clock and PTP registration part of the core to be shared with mlx5e
      master Ethernet netdev/IPoIB parent netdev and mlx5_ib in the near future.
      
      Add the support for attaching multiple underlay QPs for the different Pkeys
      in mlx5 core RX steering.
      
      Add Pkey index to rdma_netdev to add the ability to set PKEY index to lower
      IPoIB offload netdev.
      
      Use hash-table to map between DQPN (Destination QP number) to child netdev
      for the IPoIB parent netdev to forward RX packets to the corresponding
      child Pkey netdev, since the RX rings are shared.
      
      The reset of the series adds the ipoib child Pkey: mlx5e netdev profile,
      netdev nods implementation and minimal set of ethtool callbacks.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af28f6f2
  4. 15 10月, 2017 2 次提交
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · e4655e4a
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2017-10-13
      
      This series contains updates to mqprio and i40e.
      
      Amritha introduces a new hardware offload mode in tc/mqprio where the TCs,
      the queue configurations and bandwidth rate limits are offloaded to the
      hardware. The existing mqprio framework is extended to configure the queue
      counts and layout and also added support for rate limiting. This is
      achieved through new netlink attributes for the 'mode' option which takes
      values such as 'dcb' (default) and 'channel' and a 'shaper' option for
      QoS attributes such as bandwidth rate limits in hw mode 1.  Legacy devices
      can fall back to the existing setup supporting hw mode 1 without these
      additional options where only the TCs are offloaded and then the 'mode'
      and 'shaper' options defaults to DCB support.  The i40e driver enables the
      new mqprio hardware offload mechanism factoring the TCs, queue
      configuration and bandwidth rates by creating HW channel VSIs.
      In this new mode, the priority to traffic class mapping and the user
      specified queue ranges are used to configure the traffic class when the
      'mode' option is set to 'channel'. This is achieved by creating HW
      channels(VSI). A new channel is created for each of the traffic class
      configuration offloaded via mqprio framework except for the first TC (TC0)
      which is for the main VSI. TC0 for the main VSI is also reconfigured as
      per user provided queue parameters. Finally, bandwidth rate limits are set
      on these traffic classes through the shaper attribute by sending these
      rates in addition to the number of TCs and the queue configurations.
      
      Colin Ian King makes an array of constant values "constant".
      
      Alan fixes and issue where on some firmware versions, we were failing to
      actually fill out the phy_types which caused ethtool to not report any
      link types.  Also hardened against a potentially malicious VF by not
      letting the VF to reset itself after requesting to change the number of
      queues (via ethtool), let the PF reset the VF to institute the requested
      changes.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4655e4a
    • D
      Merge branch 'tc-testing-updates' · ae0783b1
      David S. Miller 提交于
      Lucas Bates says:
      
      ====================
      tc-testing: Test suite updates
      
      This patch series is a roundup of changes to the tc-testing
      suite:
      
       - Add test cases for police and mirred modules and some coverage
         in already-submitted test categories
       - Break the test case files down into more user-friendly sizes
       - Bug fix to the tdc.py script's handling of the -l argument
      
      v2: fix the lack of final newlines in two new files (thanks David)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae0783b1