1. 01 6月, 2018 8 次提交
  2. 30 5月, 2018 1 次提交
  3. 29 5月, 2018 11 次提交
  4. 26 5月, 2018 2 次提交
    • Y
      openvswitch: Support conntrack zone limit · 11efd5cb
      Yi-Hung Wei 提交于
      Currently, nf_conntrack_max is used to limit the maximum number of
      conntrack entries in the conntrack table for every network namespace.
      For the VMs and containers that reside in the same namespace,
      they share the same conntrack table, and the total # of conntrack entries
      for all the VMs and containers are limited by nf_conntrack_max.  In this
      case, if one of the VM/container abuses the usage the conntrack entries,
      it blocks the others from committing valid conntrack entries into the
      conntrack table.  Even if we can possibly put the VM in different network
      namespace, the current nf_conntrack_max configuration is kind of rigid
      that we cannot limit different VM/container to have different # conntrack
      entries.
      
      To address the aforementioned issue, this patch proposes to have a
      fine-grained mechanism that could further limit the # of conntrack entries
      per-zone.  For example, we can designate different zone to different VM,
      and set conntrack limit to each zone.  By providing this isolation, a
      mis-behaved VM only consumes the conntrack entries in its own zone, and
      it will not influence other well-behaved VMs.  Moreover, the users can
      set various conntrack limit to different zone based on their preference.
      
      The proposed implementation utilizes Netfilter's nf_conncount backend
      to count the number of connections in a particular zone.  If the number of
      connection is above a configured limitation, ovs will return ENOMEM to the
      userspace.  If userspace does not configure the zone limit, the limit
      defaults to zero that is no limitation, which is backward compatible to
      the behavior without this patch.
      
      The following high leve APIs are provided to the userspace:
        - OVS_CT_LIMIT_CMD_SET:
          * set default connection limit for all zones
          * set the connection limit for a particular zone
        - OVS_CT_LIMIT_CMD_DEL:
          * remove the connection limit for a particular zone
        - OVS_CT_LIMIT_CMD_GET:
          * get the default connection limit for all zones
          * get the connection limit for a particular zone
      Signed-off-by: NYi-Hung Wei <yihung.wei@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11efd5cb
    • N
      net: bridge: add support for port isolation · 7d850abd
      Nikolay Aleksandrov 提交于
      This patch adds support for a new port flag - BR_ISOLATED. If it is set
      then isolated ports cannot communicate between each other, but they can
      still communicate with non-isolated ports. The same can be achieved via
      ACLs but they can't scale with large number of ports and also the
      complexity of the rules grows. This feature can be used to achieve
      isolated vlan functionality (similar to pvlan) as well, though currently
      it will be port-wide (for all vlans on the port). The new test in
      should_deliver uses data that is already cache hot and the new boolean
      is used to avoid an additional source port test in should_deliver.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d850abd
  5. 25 5月, 2018 12 次提交
    • D
      net/ipv4: Remove tracepoint in fib_validate_source · c949cbbb
      David Ahern 提交于
      Tracepoint does not add value and the call to fib_lookup follows
      it which shows the same information and the fib lookup result.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c949cbbb
    • D
      net/ipv6: Udate fib6_table_lookup tracepoint · 30d444d3
      David Ahern 提交于
      Commit bb0ad198 ("ipv6: fib6_rules: support for match on sport, dport
      and ip proto") added support for protocol and ports to FIB rules.
      Update the FIB lookup tracepoint to dump the parameters.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30d444d3
    • D
      net/ipv4: Udate fib_table_lookup tracepoint · 9f323973
      David Ahern 提交于
      Commit 4a2d73a4 ("ipv4: fib_rules: support match on sport, dport
      and ip proto") added support for protocol and ports to FIB rules.
      Update the FIB lookup tracepoint to dump the parameters.
      
      In addition, make the IPv4 tracepoint similar to the IPv6 one where
      the lookup parameters and result are dumped in 1 event. It is much
      easier to use and understand the outcome of the lookup.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f323973
    • C
      net_sched: switch to rcu_work · aaa908ff
      Cong Wang 提交于
      Commit 05f0fe6b ("RCU, workqueue: Implement rcu_work") introduces
      new API's for dispatching work in a RCU callback. Now we can just
      switch to the new API's for tc filters. This could get rid of a lot
      of code.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaa908ff
    • W
      ipv4: remove warning in ip_recv_error · 730c54d5
      Willem de Bruijn 提交于
      A precondition check in ip_recv_error triggered on an otherwise benign
      race. Remove the warning.
      
      The warning triggers when passing an ipv6 socket to this ipv4 error
      handling function. RaceFuzzer was able to trigger it due to a race
      in setsockopt IPV6_ADDRFORM.
      
        ---
        CPU0
          do_ipv6_setsockopt
            sk->sk_socket->ops = &inet_dgram_ops;
      
        ---
        CPU1
          sk->sk_prot->recvmsg
            udp_recvmsg
              ip_recv_error
                WARN_ON_ONCE(sk->sk_family == AF_INET6);
      
        ---
        CPU0
          do_ipv6_setsockopt
            sk->sk_family = PF_INET;
      
      This socket option converts a v6 socket that is connected to a v4 peer
      to an v4 socket. It updates the socket on the fly, changing fields in
      sk as well as other structs. This is inherently non-atomic. It races
      with the lockless udp_recvmsg path.
      
      No other code makes an assumption that these fields are updated
      atomically. It is benign here, too, as ip_recv_error cares only about
      the protocol of the skbs enqueued on the error queue, for which
      sk_family is not a precise predictor (thanks to another isue with
      IPV6_ADDRFORM).
      
      Link: http://lkml.kernel.org/r/20180518120826.GA19515@dragonet.kaist.ac.kr
      Fixes: 7ce875e5 ("ipv4: warn once on passing AF_INET6 socket to ip_recv_error")
      Reported-by: NDaeRyong Jeong <threeearcat@gmail.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      730c54d5
    • O
      net : sched: cls_api: deal with egdev path only if needed · f8f4bef3
      Or Gerlitz 提交于
      When dealing with ingress rule on a netdev, if we did fine through the
      conventional path, there's no need to continue into the egdev route,
      and we can stop right there.
      
      Not doing so may cause a 2nd rule to be added by the cls api layer
      with the ingress being the egdev.
      
      For example, under sriov switchdev scheme, a user rule of VFR A --> VFR B
      will end up with two HW rules (1) VF A --> VF B and (2) uplink --> VF B
      
      Fixes: 208c0f4b ('net: sched: use tc_setup_cb_call to call per-block callbacks')
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8f4bef3
    • W
      packet: fix reserve calculation · 9aad13b0
      Willem de Bruijn 提交于
      Commit b84bbaf7 ("packet: in packet_snd start writing at link
      layer allocation") ensures that packet_snd always starts writing
      the link layer header in reserved headroom allocated for this
      purpose.
      
      This is needed because packets may be shorter than hard_header_len,
      in which case the space up to hard_header_len may be zeroed. But
      that necessary padding is not accounted for in skb->len.
      
      The fix, however, is buggy. It calls skb_push, which grows skb->len
      when moving skb->data back. But in this case packet length should not
      change.
      
      Instead, call skb_reserve, which moves both skb->data and skb->tail
      back, without changing length.
      
      Fixes: b84bbaf7 ("packet: in packet_snd start writing at link layer allocation")
      Reported-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aad13b0
    • J
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer 提交于
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      735fc405
    • J
      xdp: introduce xdp_return_frame_rx_napi · 389ab7f0
      Jesper Dangaard Brouer 提交于
      When sending an xdp_frame through xdp_do_redirect call, then error
      cases can happen where the xdp_frame needs to be dropped, and
      returning an -errno code isn't sufficient/possible any-longer
      (e.g. for cpumap case). This is already fully supported, by simply
      calling xdp_return_frame.
      
      This patch is an optimization, which provides xdp_return_frame_rx_napi,
      which is a faster variant for these error cases.  It take advantage of
      the protection provided by XDP RX running under NAPI protection.
      
      This change is mostly relevant for drivers using the page_pool
      allocator as it can take advantage of this. (Tested with mlx5).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      389ab7f0
    • J
      xdp: add tracepoint for devmap like cpumap have · 38edddb8
      Jesper Dangaard Brouer 提交于
      Notice how this allow us get XDP statistic without affecting the XDP
      performance, as tracepoint is no-longer activated on a per packet basis.
      
      V5: Spotted by John Fastabend.
       Fix 'sent' also counted 'drops' in this patch, a later patch corrected
       this, but it was a mistake in this intermediate step.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      38edddb8
    • J
      bpf: devmap introduce dev_map_enqueue · 67f29e07
      Jesper Dangaard Brouer 提交于
      Functionality is the same, but the ndo_xdp_xmit call is now
      simply invoked from inside the devmap.c code.
      
      V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
      
      V5: Cleanups requested by Daniel
       - Newlines before func definition
       - Use BUILD_BUG_ON checks
       - Remove unnecessary use return value store in dev_map_enqueue
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      67f29e07
    • H
      net/dcb: Add dcbnl buffer attribute · e549f6f9
      Huy Nguyen 提交于
      In this patch, we add dcbnl buffer attribute to allow user
      change the NIC's buffer configuration such as priority
      to buffer mapping and buffer size of individual buffer.
      
      This attribute combined with pfc attribute allows advanced user to
      fine tune the qos setting for specific priority queue. For example,
      user can give dedicated buffer for one or more priorities or user
      can give large buffer to certain priorities.
      
      The dcb buffer configuration will be controlled by lldptool.
      lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6
        maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6
      lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0
        sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively
      
      After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to
      choose dcbnl over devlink interface since this feature is intended to set
      port attributes which are governed by the netdev instance of that port, where
      devlink API is more suitable for global ASIC configurations.
      
      We present an use case scenario where dcbnl buffer attribute configured
      by advance user helps reduce the latency of messages of different sizes.
      
      Scenarios description:
      On ConnectX-5, we run latency sensitive traffic with
      small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive
      traffic with large messages sizes 512KB and 1MB. We group small, medium,
      and large message sizes to their own pfc enables priorities as follow.
        Priorities 1 & 2 (64B, 256B and 1KB)
        Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB)
        Priorities 5 & 6 (512KB and 1MB)
      
      By default, ConnectX-5 maps all pfc enabled priorities to a single
      lossless fixed buffer size of 50% of total available buffer space. The
      other 50% is assigned to lossy buffer. Using dcbnl buffer attribute,
      we create three equal size lossless buffers. Each buffer has 25% of total
      available buffer space. Thus, the lossy buffer size reduces to 25%. Priority
      to lossless  buffer mappings are set as follow.
        Priorities 1 & 2 on lossless buffer #1
        Priorities 3 & 4 on lossless buffer #2
        Priorities 5 & 6 on lossless buffer #3
      
      We observe improvements in latency for small and medium message sizes
      as follows. Please note that the large message sizes bandwidth performance is
      reduced but the total bandwidth remains the same.
        256B message size (42 % latency reduction)
        4K message size (21% latency reduction)
        64K message size (16% latency reduction)
      
      CC: Ido Schimmel <idosch@idosch.org>
      CC: Jakub Kicinski <jakub.kicinski@netronome.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Or Gerlitz <gerlitz.or@gmail.com>
      CC: Parav Pandit <parav@mellanox.com>
      CC: Aron Silverton <aron.silverton@oracle.com>
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      e549f6f9
  6. 24 5月, 2018 6 次提交
    • J
      bpfilter: don't pass O_CREAT when opening console for debug · 13405468
      Jakub Kicinski 提交于
      Passing O_CREAT (00000100) to open means we should also pass file
      mode as the third parameter.  Creating /dev/console as a regular
      file may not be helpful anyway, so simply drop the flag when
      opening debug_fd.
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13405468
    • A
      bpfilter: fix build dependency · 61a552eb
      Alexei Starovoitov 提交于
      BPFILTER could have been enabled without INET causing this build error:
      ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined!
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Reported-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61a552eb
    • M
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux 提交于
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • M
      bpf: Split lwt inout verifier structures · cd3092c7
      Mathieu Xhonneux 提交于
      The new bpf_lwt_push_encap helper should only be accessible within the
      LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
      panic.
      
      At the moment, both LWT BPF IN and OUT share the same list of helpers,
      whose calls are authorized by the verifier. This patch separates the
      verifier ops for the IN and OUT hooks, and allows the IN hook to call the
      bpf_lwt_push_encap helper.
      
      This patch is also the occasion to put all lwt_*_func_proto functions
      together for clarity. At the moment, socks_op_func_proto is in the middle
      of lwt_inout_func_proto and lwt_xmit_func_proto.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cd3092c7
    • M
      bpf: Add IPv6 Segment Routing helpers · fe94cc29
      Mathieu Xhonneux 提交于
      The BPF seg6local hook should be powerful enough to enable users to
      implement most of the use-cases one could think of. After some thinking,
      we figured out that the following actions should be possible on a SRv6
      packet, requiring 3 specific helpers :
          - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
          - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                                     (to add/delete TLVs)
          - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                                 (specifically End.X, End.T, End.B6 and
                                  End.B6.Encap)
      
      The specifications of these helpers are provided in the patch (see
      include/uapi/linux/bpf.h).
      
      The non-sensitive fields of the SRH are the following : flags, tag and
      TLVs. The other fields can not be modified, to maintain the SRH
      integrity. Flags, tag and TLVs can easily be modified as their validity
      can be checked afterwards via seg6_validate_srh. It is not allowed to
      modify the segments directly. If one wants to add segments on the path,
      he should stack a new SRH using the End.B6 action via
      bpf_lwt_seg6_action.
      
      Growing, shrinking or editing TLVs via the helpers will flag the SRH as
      invalid, and it will have to be re-validated before re-entering the IPv6
      layer. This flag is stored in a per-CPU buffer, along with the current
      header length in bytes.
      
      Storing the SRH len in bytes in the control block is mandatory when using
      bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
      len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
      boundary). When adding/deleting TLVs within the BPF program, the SRH may
      temporary be in an invalid state where its length cannot be rounded to 8
      bytes without remainder, hence the need to store the length in bytes
      separately. The caller of the BPF program can then ensure that the SRH's
      final length is valid using this value. Again, a final SRH modified by a
      BPF program which doesn’t respect the 8-bytes boundary will be discarded
      as it will be considered as invalid.
      
      Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
      available from the LWT BPF IN hook, but not from the seg6local BPF one.
      This helper allows to encapsulate a Segment Routing Header (either with
      a new outer IPv6 header, or by inlining it directly in the existing IPv6
      header) into a non-SRv6 packet. This helper is required if we want to
      offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
      as the BPF seg6local hook only works on traffic already containing a SRH.
      This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
      the same purpose but with a static SRH per route.
      
      These helpers require CONFIG_IPV6=y (and not =m).
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fe94cc29
    • M
      ipv6: sr: export function lookup_nexthop · 1c1e761e
      Mathieu Xhonneux 提交于
      The function lookup_nexthop is essential to implement most of the seg6local
      actions. As we want to provide a BPF helper allowing to apply some of these
      actions on the packet being processed, the helper should be able to call
      this function, hence the need to make it public.
      
      Moreover, if one argument is incorrect or if the next hop can not be found,
      an error should be returned by the BPF helper so the BPF program can adapt
      its processing of the packet (return an error, properly force the drop,
      ...). This patch hence makes this function return dst->error to indicate a
      possible error.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1c1e761e