1. 29 5月, 2018 5 次提交
    • J
      net: sched: mq: add simple offload notification · f971b132
      Jakub Kicinski 提交于
      mq offload is trivial, we just need to let the device know
      that the root qdisc is mq.  Alternative approach would be
      to export qdisc_lookup() and make drivers check the root
      type themselves, but notification via ndo_setup_tc is more
      in line with other qdiscs.
      
      Note that mq doesn't hold any stats on it's own, it just
      adds up stats of its children.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f971b132
    • J
      net: sched: add qstats.qlen to qlen · 6172abc1
      Jakub Kicinski 提交于
      AFAICT struct gnet_stats_queue.qlen is not used in Qdiscs.
      It may, however, be useful for offloads to report HW queue
      length there.  Add that value to the result of qdisc_qlen_sum().
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6172abc1
    • P
      net: sched: shrink struct Qdisc · e9be0e99
      Paolo Abeni 提交于
      The struct Qdisc has a lot of holes, especially after commit
      a53851e2 ("net: sched: explicit locking in gso_cpu fallback"),
      which as a side effect, moved the fields just after 'busylock'
      on a new cacheline.
      
      Since both 'padded' and 'refcnt' are not updated frequently, and
      there is a hole before 'gso_skb', we can move such fields there,
      saving a cacheline without any performance side effect.
      
      Before this commit:
      
      pahole -C Qdisc net/sche/sch_generic.o
      	# ...
              /* size: 384, cachelines: 6, members: 25 */
              /* sum members: 236, holes: 3, sum holes: 92 */
              /* padding: 56 */
      
      After this commit:
      pahole -C Qdisc net/sche/sch_generic.o
      	# ...
      	/* size: 320, cachelines: 5, members: 25 */
      	/* sum members: 236, holes: 2, sum holes: 28 */
      	/* padding: 56 */
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9be0e99
    • S
      net: Introduce net_failover driver · cfc80d9a
      Sridhar Samudrala 提交于
      The net_failover driver provides an automated failover mechanism via APIs
      to create and destroy a failover master netdev and manages a primary and
      standby slave netdevs that get registered via the generic failover
      infrastructure.
      
      The failover netdev acts a master device and controls 2 slave devices. The
      original paravirtual interface gets registered as 'standby' slave netdev and
      a passthru/vf device with the same MAC gets registered as 'primary' slave
      netdev. Both 'standby' and 'failover' netdevs are associated with the same
      'pci' device. The user accesses the network interface via 'failover' netdev.
      The 'failover' netdev chooses 'primary' netdev as default for transmits when
      it is available with link up and running.
      
      This can be used by paravirtual drivers to enable an alternate low latency
      datapath. It also enables hypervisor controlled live migration of a VM with
      direct attached VF by failing over to the paravirtual datapath when the VF
      is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc80d9a
    • S
      net: Introduce generic failover module · 30c8bd5a
      Sridhar Samudrala 提交于
      The failover module provides a generic interface for paravirtual drivers
      to register a netdev and a set of ops with a failover instance. The ops
      are used as event handlers that get called to handle netdev register/
      unregister/link change/name change events on slave pci ethernet devices
      with the same mac address as the failover netdev.
      
      This enables paravirtual drivers to use a VF as an accelerated low latency
      datapath. It also allows migration of VMs with direct attached VFs by
      failing over to the paravirtual datapath when the VF is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c8bd5a
  2. 25 5月, 2018 3 次提交
    • C
      net_sched: switch to rcu_work · aaa908ff
      Cong Wang 提交于
      Commit 05f0fe6b ("RCU, workqueue: Implement rcu_work") introduces
      new API's for dispatching work in a RCU callback. Now we can just
      switch to the new API's for tc filters. This could get rid of a lot
      of code.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaa908ff
    • J
      xdp: introduce xdp_return_frame_rx_napi · 389ab7f0
      Jesper Dangaard Brouer 提交于
      When sending an xdp_frame through xdp_do_redirect call, then error
      cases can happen where the xdp_frame needs to be dropped, and
      returning an -errno code isn't sufficient/possible any-longer
      (e.g. for cpumap case). This is already fully supported, by simply
      calling xdp_return_frame.
      
      This patch is an optimization, which provides xdp_return_frame_rx_napi,
      which is a faster variant for these error cases.  It take advantage of
      the protection provided by XDP RX running under NAPI protection.
      
      This change is mostly relevant for drivers using the page_pool
      allocator as it can take advantage of this. (Tested with mlx5).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      389ab7f0
    • H
      net/dcb: Add dcbnl buffer attribute · e549f6f9
      Huy Nguyen 提交于
      In this patch, we add dcbnl buffer attribute to allow user
      change the NIC's buffer configuration such as priority
      to buffer mapping and buffer size of individual buffer.
      
      This attribute combined with pfc attribute allows advanced user to
      fine tune the qos setting for specific priority queue. For example,
      user can give dedicated buffer for one or more priorities or user
      can give large buffer to certain priorities.
      
      The dcb buffer configuration will be controlled by lldptool.
      lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6
        maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6
      lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0
        sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively
      
      After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to
      choose dcbnl over devlink interface since this feature is intended to set
      port attributes which are governed by the netdev instance of that port, where
      devlink API is more suitable for global ASIC configurations.
      
      We present an use case scenario where dcbnl buffer attribute configured
      by advance user helps reduce the latency of messages of different sizes.
      
      Scenarios description:
      On ConnectX-5, we run latency sensitive traffic with
      small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
      traffic with large messages sizes 512KB and 1MB. We group small, medium,
      and large message sizes to their own pfc enables priorities as follow.
        Priorities 1 & 2 (64B, 256B and 1KB)
        Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
        Priorities 5 & 6 (512KB and 1MB)
      
      By default, ConnectX-5 maps all pfc enabled priorities to a single
      lossless fixed buffer size of 50% of total available buffer space. The
      other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
      we create three equal size lossless buffers. Each buffer has 25% of total
      available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
      to lossless  buffer mappings are set as follow.
        Priorities 1 & 2 on lossless buffer #1
        Priorities 3 & 4 on lossless buffer #2
        Priorities 5 & 6 on lossless buffer #3
      
      We observe improvements in latency for small and medium message sizes
      as follows. Please note that the large message sizes bandwidth performance is
      reduced but the total bandwidth remains the same.
        256B message size (42 % latency reduction)
        4K message size (21% latency reduction)
        64K message size (16% latency reduction)
      
      CC: Ido Schimmel <idosch@idosch.org>
      CC: Jakub Kicinski <jakub.kicinski@netronome.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Or Gerlitz <gerlitz.or@gmail.com>
      CC: Parav Pandit <parav@mellanox.com>
      CC: Aron Silverton <aron.silverton@oracle.com>
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      e549f6f9
  3. 24 5月, 2018 4 次提交
    • M
      bpf: Add IPv6 Segment Routing helpers · fe94cc29
      Mathieu Xhonneux 提交于
      The BPF seg6local hook should be powerful enough to enable users to
      implement most of the use-cases one could think of. After some thinking,
      we figured out that the following actions should be possible on a SRv6
      packet, requiring 3 specific helpers :
          - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
          - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                                     (to add/delete TLVs)
          - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                                 (specifically End.X, End.T, End.B6 and
                                  End.B6.Encap)
      
      The specifications of these helpers are provided in the patch (see
      include/uapi/linux/bpf.h).
      
      The non-sensitive fields of the SRH are the following : flags, tag and
      TLVs. The other fields can not be modified, to maintain the SRH
      integrity. Flags, tag and TLVs can easily be modified as their validity
      can be checked afterwards via seg6_validate_srh. It is not allowed to
      modify the segments directly. If one wants to add segments on the path,
      he should stack a new SRH using the End.B6 action via
      bpf_lwt_seg6_action.
      
      Growing, shrinking or editing TLVs via the helpers will flag the SRH as
      invalid, and it will have to be re-validated before re-entering the IPv6
      layer. This flag is stored in a per-CPU buffer, along with the current
      header length in bytes.
      
      Storing the SRH len in bytes in the control block is mandatory when using
      bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
      len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
      boundary). When adding/deleting TLVs within the BPF program, the SRH may
      temporary be in an invalid state where its length cannot be rounded to 8
      bytes without remainder, hence the need to store the length in bytes
      separately. The caller of the BPF program can then ensure that the SRH's
      final length is valid using this value. Again, a final SRH modified by a
      BPF program which doesn’t respect the 8-bytes boundary will be discarded
      as it will be considered as invalid.
      
      Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
      available from the LWT BPF IN hook, but not from the seg6local BPF one.
      This helper allows to encapsulate a Segment Routing Header (either with
      a new outer IPv6 header, or by inlining it directly in the existing IPv6
      header) into a non-SRv6 packet. This helper is required if we want to
      offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
      as the BPF seg6local hook only works on traffic already containing a SRH.
      This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
      the same purpose but with a static SRH per route.
      
      These helpers require CONFIG_IPV6=y (and not =m).
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fe94cc29
    • M
      ipv6: sr: export function lookup_nexthop · 1c1e761e
      Mathieu Xhonneux 提交于
      The function lookup_nexthop is essential to implement most of the seg6local
      actions. As we want to provide a BPF helper allowing to apply some of these
      actions on the packet being processed, the helper should be able to call
      this function, hence the need to make it public.
      
      Moreover, if one argument is incorrect or if the next hop can not be found,
      an error should be returned by the BPF helper so the BPF program can adapt
      its processing of the packet (return an error, properly force the drop,
      ...). This patch hence makes this function return dst->error to indicate a
      possible error.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c1e761e
    • M
      ipv6: sr: make seg6.h includable without IPv6 · 63526e1c
      Mathieu Xhonneux 提交于
      include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
      not enabled:
         include/net/seg6.h: In function 'seg6_pernet':
      >> include/net/seg6.h:52:14: error: 'struct net' has no member named
                                              'ipv6'; did you mean 'ipv4'?
           return net->ipv6.seg6_data;
                       ^~~~
                       ipv4
      
      This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
      allowing seg6.h to be included regardless of the configuration.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      63526e1c
    • R
      ipv4: support sport, dport and ip_proto in RTM_GETROUTE · 404eb77e
      Roopa Prabhu 提交于
      This is a followup to fib rules sport, dport and ipproto
      match support. Only supports tcp, udp and icmp for ipproto.
      Used by fib rule self tests.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404eb77e
  4. 23 5月, 2018 12 次提交
  5. 22 5月, 2018 3 次提交
  6. 21 5月, 2018 1 次提交
  7. 20 5月, 2018 3 次提交
  8. 18 5月, 2018 9 次提交