1. 22 10月, 2017 2 次提交
    • L
      bpf: Adding helper function bpf_getsockops · cd86d1fd
      Lawrence Brakmo 提交于
      Adding support for helper function bpf_getsockops to socket_ops BPF
      programs. This patch only supports TCP_CONGESTION.
      Signed-off-by: NVlad Vysotsky <vlad@cs.ucla.edu>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd86d1fd
    • L
      bpf: add support for BPF_SOCK_OPS_BASE_RTT · e6546ef6
      Lawrence Brakmo 提交于
      A congestion control algorithm can make a call to the BPF socket_ops
      program to request the base RTT. The base RTT can be congestion control
      dependent and is meant to represent a congestion threshold such that
      RTTs above it indicate congestion. This is especially useful for flows
      within a DC where the base RTT is easy to obtain.
      
      Being provided a base RTT solves a basic problem in RTT based congestion
      avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
      to get the base RTT when the network is not congested, it is very
      diffcult to do when it is very congested. Newer connections get an
      inflated value of the base RTT leading to unfariness (newer flows with a
      larger base RTT get more bandwidth). As a result, RTT based congestion
      avoidance algorithms tend to update their base RTTs to improve fairness.
      In very congested networks this can lead to base RTT inflation, reducing
      the ability of these RTT based congestion control algorithms to prevent
      congestion.
      
      Note that in my experiments with TCP-NV, the base RTT provided can be
      much larger than the actual hardware RTT. For example, experimenting
      with hosts within a rack where the hardware RTT is 16-20us, I've used
      base RTTs up to 150us. The effect of using a larger base RTT is that the
      congestion avoidance algorithm will allow more queueing. When there are
      only a few flows the main effect is larger measured RTTs and RPC
      latencies due to the increased queueing. When there are a lot of flows,
      a larger base RTT can lead to more congestion and more packet drops.
      For this case, where the hardware RTT is 20us, a base RTT of 80us
      produces good results.
      
      This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
      set adds support for using it in TCP-NV. Further study and testing is
      needed before support can be added to other delay based congestion
      avoidance algorithms.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6546ef6
  2. 20 10月, 2017 2 次提交
  3. 18 10月, 2017 1 次提交
    • J
      bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112
      Jesper Dangaard Brouer 提交于
      The 'cpumap' is primarily used as a backend map for XDP BPF helper
      call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
      
      This patch implement the main part of the map.  It is not connected to
      the XDP redirect system yet, and no SKB allocation are done yet.
      
      The main concern in this patch is to ensure the datapath can run
      without any locking.  This adds complexity to the setup and tear-down
      procedure, which assumptions are extra carefully documented in the
      code comments.
      
      V2:
       - make sure array isn't larger than NR_CPUS
       - make sure CPUs added is a valid possible CPU
      
      V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
      
      V5:
       - Restrict map allocation to root / CAP_SYS_ADMIN
       - WARN_ON_ONCE if queue is not empty on tear-down
       - Return -EPERM on memlock limit instead of -ENOMEM
       - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
       - Moved cpu_map_enqueue() to next patch
      
      V6: all notice by Daniel Borkmann
       - Fix err return code in cpu_map_alloc() introduced in V5
       - Move cpu_possible() check after max_entries boundary check
       - Forbid usage initially in check_map_func_compatibility()
      
      V7:
       - Fix alloc error path spotted by Daniel Borkmann
       - Did stress test adding+removing CPUs from the map concurrently
       - Fixed refcnt issue on cpu_map_entry, kthread started too soon
       - Make sure packets are flushed during tear-down, involved use of
         rcu_barrier() and kthread_run only exit after queue is empty
       - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
      
      V8:
       - Nitpicking comments and gramma by Edward Cree
       - Fix missing semi-colon introduced in V7 due to rebasing
       - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6710e112
  4. 17 10月, 2017 1 次提交
  5. 14 10月, 2017 1 次提交
    • A
      mqprio: Introduce new hardware offload mode and shaper in mqprio · 4e8b86c0
      Amritha Nambiar 提交于
      The offload types currently supported in mqprio are 0 (no offload) and
      1 (offload only TCs) by setting these values for the 'hw' option. If
      offloads are supported by setting the 'hw' option to 1, the default
      offload mode is 'dcb' where only the TC values are offloaded to the
      device. This patch introduces a new hardware offload mode called
      'channel' with 'hw' set to 1 in mqprio which makes full use of the
      mqprio options, the TCs, the queue configurations and the QoS parameters
      for the TCs. This is achieved through a new netlink attribute for the
      'mode' option which takes values such as 'dcb' (default) and 'channel'.
      The 'channel' mode also supports QoS attributes for traffic class such as
      minimum and maximum values for bandwidth rate limits.
      
      This patch enables configuring additional HW shaper attributes associated
      with a traffic class. Currently the shaper for bandwidth rate limiting is
      supported which takes options such as minimum and maximum bandwidth rates
      and are offloaded to the hardware in the 'channel' mode. The min and max
      limits for bandwidth rates are provided by the user along with the TCs
      and the queue configurations when creating the mqprio qdisc. The interface
      can be extended to support new HW shapers in future through the 'shaper'
      attribute.
      
      Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
      mqprio queue options and use this to be shared between the kernel and
      device driver. This contains a copy of the existing data structure
      for mqprio queue options. This new data structure can be extended when
      adding new attributes for traffic class such as mode, shaper, shaper
      parameters (bandwidth rate limits). The existing data structure for mqprio
      queue options will be shared between the kernel and userspace.
      
      Example:
        queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
        min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit
      
      To dump the bandwidth rates:
      
      qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
                   queues:(0:3) (4:7)
                   mode:channel
                   shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit
      Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e8b86c0
  6. 13 10月, 2017 3 次提交
    • J
      tipc: receive group membership events via member socket · ae236fb2
      Jon Maloy 提交于
      Like with any other service, group members' availability can be
      subscribed for by connecting to be topology server. However, because
      the events arrive via a different socket than the member socket, there
      is a real risk that membership events my arrive out of synch with the
      actual JOIN/LEAVE action. I.e., it is possible to receive the first
      messages from a new member before the corresponding JOIN event arrives,
      just as it is possible to receive the last messages from a leaving
      member after the LEAVE event has already been received.
      
      Since each member socket is internally also subscribing for membership
      events, we now fix this problem by passing those events on to the user
      via the member socket. We leverage the already present member synch-
      ronization protocol to guarantee correct message/event order. An event
      is delivered to the user as an empty message where the two source
      addresses identify the new/lost member. Furthermore, we set the MSG_OOB
      bit in the message flags to mark it as an event. If the event is an
      indication about a member loss we also set the MSG_EOR bit, so it can
      be distinguished from a member addition event.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae236fb2
    • J
      tipc: introduce communication groups · 75da2163
      Jon Maloy 提交于
      As a preparation for introducing flow control for multicast and datagram
      messaging we need a more strictly defined framework than we have now. A
      socket must be able keep track of exactly how many and which other
      sockets it is allowed to communicate with at any moment, and keep the
      necessary state for those.
      
      We therefore introduce a new concept we have named Communication Group.
      Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
      The call takes four parameters: 'type' serves as group identifier,
      'instance' serves as an logical member identifier, and 'scope' indicates
      the visibility of the group (node/cluster/zone). Finally, 'flags' makes
      it possible to set certain properties for the member. For now, there is
      only one flag, indicating if the creator of the socket wants to receive
      a copy of broadcast or multicast messages it is sending via the socket,
      and if wants to be eligible as destination for its own anycasts.
      
      A group is closed, i.e., sockets which have not joined a group will
      not be able to send messages to or receive messages from members of
      the group, and vice versa.
      
      Any member of a group can send multicast ('group broadcast') messages
      to all group members, optionally including itself, using the primitive
      send(). The messages are received via the recvmsg() primitive. A socket
      can only be member of one group at a time.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75da2163
    • F
      sched: tc_mirred: Remove whitespaces · ad2d116c
      Florian Fainelli 提交于
      This file contains unnecessary whitespaces as newlines, remove them,
      found by looking at what struct tc_mirred looks like.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad2d116c
  7. 12 10月, 2017 2 次提交
  8. 11 10月, 2017 2 次提交
  9. 10 10月, 2017 1 次提交
  10. 09 10月, 2017 2 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
    • R
      bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood · 821f1b21
      Roopa Prabhu 提交于
      This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
      suppress arp and nd flood on bridge ports. It implements
      rfc7432, section 10.
      https://tools.ietf.org/html/rfc7432#section-10
      for ethernet VPN deployments. It is similar to the existing
      BR_PROXYARP* flags but has a few semantic differences to conform
      to EVPN standard. Unlike the existing flags, this new flag suppresses
      flood of all neigh discovery packets (arp and nd) to tunnel ports.
      Supports both vlan filtering and non-vlan filtering bridges.
      
      In case of EVPN, it is mainly used to avoid flooding
      of arp and nd packets to tunnel ports like vxlan.
      
      This patch adds netlink and sysfs support to set this bridge port
      flag.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      821f1b21
  11. 08 10月, 2017 4 次提交
  12. 06 10月, 2017 1 次提交
    • S
      VSOCK: add sock_diag interface · 413a4317
      Stefan Hajnoczi 提交于
      This patch adds the sock_diag interface for querying sockets from
      userspace.  Tools like ss(8) and netstat(8) can use this interface to
      list open sockets.
      
      The userspace ABI is defined in <linux/vm_sockets_diag.h> and includes
      netlink request and response structs.  The request can query sockets
      based on their sk_state (e.g. listening sockets only) and the response
      contains socket information fields including the local/remote addresses,
      inode number, etc.
      
      This patch does not dump VMCI pending sockets because I have only tested
      the virtio transport, which does not use pending sockets.  Support can
      be added later by extending vsock_diag_dump() if needed by VMCI users.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      413a4317
  13. 05 10月, 2017 3 次提交
    • N
      dev: advertise the new nsid when the netns iface changes · 6621dd29
      Nicolas Dichtel 提交于
      x-netns interfaces are bound to two netns: the link netns and the upper
      netns. Usually, this kind of interfaces is created in the link netns and
      then moved to the upper netns. At the end, the interface is visible only
      in the upper netns. The link nsid is advertised via netlink in the upper
      netns, thus the user always knows where is the link part.
      
      There is no such mechanism in the link netns. When the interface is moved
      to another netns, the user cannot "follow" it.
      This patch adds a new netlink attribute which helps to follow an interface
      which moves to another netns. When the interface is unregistered, the new
      nsid is advertised. If the interface is a x-netns interface (ie
      rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.
      
      CC: Jason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6621dd29
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  14. 04 10月, 2017 6 次提交
  15. 02 10月, 2017 1 次提交
    • A
      cfg80211/nl80211: add a port authorized event · 503c1fb9
      Avraham Stern 提交于
      Add an event that indicates that a connection is authorized
      (i.e. the 4 way handshake was performed by the driver). This event
      should be sent by the driver after sending a connect/roamed event.
      
      This is useful for networks that require 802.1X authentication.
      In cases that the driver supports 4 way handshake offload, but the
      802.1X authentication is managed by user space, the driver needs to
      inform user space right after the 802.11 association was completed
      so user space can initialize its 802.1X state machine etc.
      However, it is also possible that the AP will choose to skip the
      802.1X authentication (e.g. when PMKSA caching is used) and proceed
      with the 4 way handshake immediately. In this case the driver needs
      to inform user space that 802.1X authentication is no longer required
      (e.g. to prevent user space from disconnecting since it did not get
      any EAPOLs from the AP).
      
      This is also useful for roaming, in which case it is possible that
      the driver used the Fast Transition protocol so 802.1X is not
      required.
      
      Since there will now be a dedicated notification indicating that the
      connection is authorized, the authorized flag can be removed from the
      roamed event. Drivers can send the new port authorized event right
      after sending the roamed event to indicate the new AP is already
      authorized. This therefore reserves the old PORT_AUTHORIZED attribute.
      Signed-off-by: NAvraham Stern <avraham.stern@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      503c1fb9
  16. 30 9月, 2017 1 次提交
    • M
      net-ipv6: add support for sockopt(SOL_IPV6, IPV6_FREEBIND) · 84e14fe3
      Maciej Żenczykowski 提交于
      So far we've been relying on sockopt(SOL_IP, IP_FREEBIND) being usable
      even on IPv6 sockets.
      
      However, it turns out it is perfectly reasonable to want to set freebind
      on an AF_INET6 SOCK_RAW socket - but there is no way to set any SOL_IP
      socket option on such a socket (they're all blindly errored out).
      
      One use case for this is to allow spoofing src ip on a raw socket
      via sendmsg cmsg.
      
      Tested:
        built, and booted
        # python
        >>> import socket
        >>> SOL_IP = socket.SOL_IP
        >>> SOL_IPV6 = socket.IPPROTO_IPV6
        >>> IP_FREEBIND = 15
        >>> IPV6_FREEBIND = 78
        >>> s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0)
        >>> s.getsockopt(SOL_IP, IP_FREEBIND)
        0
        >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
        0
        >>> s.setsockopt(SOL_IPV6, IPV6_FREEBIND, 1)
        >>> s.getsockopt(SOL_IP, IP_FREEBIND)
        1
        >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
        1
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84e14fe3
  17. 29 9月, 2017 3 次提交
  18. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  19. 26 9月, 2017 2 次提交
    • P
      tun: enable napi_gro_frags() for TUN/TAP driver · 90e33d45
      Petar Penkov 提交于
      Add a TUN/TAP receive mode that exercises the napi_gro_frags()
      interface. This mode is available only in TAP mode, as the interface
      expects packets with Ethernet headers.
      
      Furthermore, packets follow the layout of the iovec_iter that was
      received. The first iovec is the linear data, and every one after the
      first is a fragment. If there are more fragments than the max number,
      drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
      dissector code and to verify that the header resides in the linear data.
      
      The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
      This is imposed because this mode is intended for testing via tools like
      syzkaller and packetdrill, and the increased flexibility it provides can
      introduce security vulnerabilities. This flag is accepted only if the
      device is in TAP mode and has the IFF_NAPI flag set as well. This is
      done because both of these are explicit requirements for correct
      operation in this mode.
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google,com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e33d45
    • P
      tun: enable NAPI for TUN/TAP driver · 94317099
      Petar Penkov 提交于
      Changes TUN driver to use napi_gro_receive() upon receiving packets
      rather than netif_rx_ni(). Adds flag IFF_NAPI that enables these
      changes and operation is not affected if the flag is disabled.  SKBs
      are constructed upon packet arrival and are queued to be processed
      later.
      
      The new path was evaluated with a benchmark with the following setup:
      Open two tap devices and a receiver thread that reads in a loop for
      each device. Start one sender thread and pin all threads to different
      CPUs. Send 1M minimum UDP packets to each device and measure sending
      time for each of the sending methods:
      	napi_gro_receive():	4.90s
      	netif_rx_ni():		4.90s
      	netif_receive_skb():	7.20s
      Signed-off-by: NPetar Penkov <peterpenkov96@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: davem@davemloft.net
      Cc: ppenkov@stanford.edu
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94317099
  20. 25 9月, 2017 1 次提交
    • M
      dm ioctl: fix alignment of event number in the device list · 62e08243
      Mikulas Patocka 提交于
      The size of struct dm_name_list is different on 32-bit and 64-bit
      kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels).
      
      This mismatch caused some harmless difference in padding when using 32-bit
      or 64-bit kernel. Commit 23d70c5e ("dm ioctl: report event number in
      DM_LIST_DEVICES") added reporting event number in the output of
      DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for
      userspace to determine the location of the event number (the location
      would be different when running on 32-bit and 64-bit kernels).
      
      Fix the padding by using offsetof(struct dm_name_list, name) instead of
      sizeof(struct dm_name_list) to determine the location of entries.
      
      Also, the ioctl version number is incremented to 37 so that userspace
      can use the version number to determine that the event number is present
      and correctly located.
      
      In addition, a global event is now raised when a DM device is created,
      removed, renamed or when table is swapped, so that the user can monitor
      for device changes.
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Fixes: 23d70c5e ("dm ioctl: report event number in DM_LIST_DEVICES")
      Cc: stable@vger.kernel.org # 4.13
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      62e08243