1. 04 5月, 2018 6 次提交
    • B
      xsk: add Rx receive functions and poll support · c497176c
      Björn Töpel 提交于
      Here the actual receive functions of AF_XDP are implemented, that in a
      later commit, will be called from the XDP layers.
      
      There's one set of functions for the XDP_DRV side and another for
      XDP_SKB (generic).
      
      A new XDP API, xdp_return_buff, is also introduced.
      
      Adding xdp_return_buff, which is analogous to xdp_return_frame, but
      acts upon an struct xdp_buff. The API will be used by AF_XDP in future
      commits.
      
      Support for the poll syscall is also implemented.
      
      v2: xskq_validate_id did not update cons_tail.
          The entries variable was calculated twice in xskq_nb_avail.
          Squashed xdp_return_buff commit.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c497176c
    • M
      xsk: add support for bind for Rx · 965a9909
      Magnus Karlsson 提交于
      Here, the bind syscall is added. Binding an AF_XDP socket, means
      associating the socket to an umem, a netdev and a queue index. This
      can be done in two ways.
      
      The first way, creating a "socket from scratch". Create the umem using
      the XDP_UMEM_REG setsockopt and an associated fill queue with
      XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
      setsockopt. Call bind passing ifindex and queue index ("channel" in
      ethtool speak).
      
      The second way to bind a socket, is simply skipping the
      umem/netdev/queue index, and passing another already setup AF_XDP
      socket. The new socket will then have the same umem/netdev/queue index
      as the parent so it will share the same umem. You must also set the
      flags field in the socket address to XDP_SHARED_UMEM.
      
      v2: Use PTR_ERR instead of passing error variable explicitly.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      965a9909
    • B
      xsk: add Rx queue setup and mmap support · b9b6b68e
      Björn Töpel 提交于
      Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
      a queue, where the kernel can pass completed Rx frames from the kernel
      to user process.
      
      The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b9b6b68e
    • M
      xsk: add umem fill queue support and mmap · 423f3832
      Magnus Karlsson 提交于
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
      ask the kernel to allocate a queue (ring buffer) and also mmap it
      (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      user process to the kernel. These frames will in a later patch be
      filled in with Rx packet data by the kernel.
      
      v2: Fixed potential crash in xsk_mmap.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      423f3832
    • B
      xsk: add user memory registration support sockopt · c0c77d8f
      Björn Töpel 提交于
      In this commit the base structure of the AF_XDP address family is set
      up. Further, we introduce the abilty register a window of user memory
      to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
      window is viewed by an AF_XDP socket as a set of equally large
      frames. After a user memory registration all frames are "owned" by the
      user application, and not the kernel.
      
      v2: More robust checks on umem creation and unaccount on error.
          Call set_page_dirty_lock on cleanup.
          Simplified xdp_umem_reg.
      Co-authored-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c0c77d8f
    • B
      net: initial AF_XDP skeleton · 68e8b849
      Björn Töpel 提交于
      Buildable skeleton of AF_XDP without any functionality. Just what it
      takes to register a new address family.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      68e8b849
  2. 27 4月, 2018 8 次提交
    • N
      bpf: fix xdp_generic for bpf_adjust_tail usecase · f7613120
      Nikita V. Shirokov 提交于
      When bpf_adjust_tail was introduced for generic xdp, it changed skb's tail
      pointer, so it was pointing to the new "end of the packet". However skb's
      len field wasn't properly modified, so on the wire ethernet frame had
      original (or even bigger, if adjust_head was used) size. This diff is
      fixing this.
      
      Fixes: 198d83bb (" bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail")
      Signed-off-by: NNikita V. Shirokov <tehnerd@tehnerd.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f7613120
    • W
      udp: add gso support to virtual devices · 83aa025f
      Willem de Bruijn 提交于
      Virtual devices such as tunnels and bonding can handle large packets.
      Only segment packets when reaching a physical or loopback device.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83aa025f
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: paged allocation with gso · 15e36f5b
      Willem de Bruijn 提交于
      When sending large datagrams that are later segmented, store data in
      page frags to avoid copying from linear in skb_segment.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e36f5b
    • W
      udp: better wmem accounting on gso · ad405857
      Willem de Bruijn 提交于
      skb_segment by default transfers allocated wmem from the gso skb
      to the tail of the segment list. This underreports real truesize
      of the list, especially if the tail might be dropped.
      
      Similar to tcp_gso_segment, update wmem_alloc with the aggregate
      list truesize and make each segment responsible for its own
      share by setting skb->destructor.
      
      Clear gso_skb->destructor prior to calling skb_segment to skip
      the default assignment to tail.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad405857
    • W
      udp: generate gso with UDP_SEGMENT · bec1f6f6
      Willem de Bruijn 提交于
      Support generic segmentation offload for udp datagrams. Callers can
      concatenate and send at once the payload of multiple datagrams with
      the same destination.
      
      To set segment size, the caller sets socket option UDP_SEGMENT to the
      length of each discrete payload. This value must be smaller than or
      equal to the relevant MTU.
      
      A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
      per send call basis.
      
      Total byte length may then exceed MTU. If not an exact multiple of
      segment size, the last segment will be shorter.
      
      The implementation adds a gso_size field to the udp socket, ip(v6)
      cmsg cookie and inet_cork structure to be able to set the value at
      setsockopt or cmsg time and to work with both lockless and corked
      paths.
      
      Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
      
          tcp tso
           3197 MB/s 54232 msg/s 54232 calls/s
               6,457,754,262      cycles
      
          tcp gso
           1765 MB/s 29939 msg/s 29939 calls/s
              11,203,021,806      cycles
      
          tcp without tso/gso *
            739 MB/s 12548 msg/s 12548 calls/s
              11,205,483,630      cycles
      
          udp
            876 MB/s 14873 msg/s 624666 calls/s
              11,205,777,429      cycles
      
          udp gso
           2139 MB/s 36282 msg/s 36282 calls/s
              11,204,374,561      cycles
      
         [*] after reverting commit 0a6b2a1d
             ("tcp: switch to GSO being always on")
      
      Measured total system cycles ('-a') for one core while pinning both
      the network receive path and benchmark process to that core:
      
        perf stat -a -C 12 -e cycles \
          ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
      
      Note the reduction in calls/s with GSO. Bytes per syscall drops
      increases from 1470 to 61818.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec1f6f6
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
    • W
      udp: expose inet cork to udp · 1cd7884d
      Willem de Bruijn 提交于
      UDP segmentation offload needs access to inet_cork in the udp layer.
      Pass the struct to ip(6)_make_skb instead of allocating it on the
      stack in that function itself.
      
      This patch is a noop otherwise.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cd7884d
  3. 26 4月, 2018 9 次提交
  4. 25 4月, 2018 12 次提交
    • W
      bpf: clear the ip_tunnel_info. · 5540fbf4
      William Tu 提交于
      The percpu metadata_dst might carry the stale ip_tunnel_info
      and cause incorrect behavior.  When mixing tests using ipv4/ipv6
      bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
      ipv4's src ip addr as its ipv6 src address, because the previous
      tunnel info does not clean up.  The patch zeros the fields in
      ip_tunnel_info.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Reported-by: NYifeng Sun <pkusunyifeng@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5540fbf4
    • E
      bpf: add helper for getting xfrm states · 12bed760
      Eyal Birger 提交于
      This commit introduces a helper which allows fetching xfrm state
      parameters by eBPF programs attached to TC.
      
      Prototype:
      bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
      
      skb: pointer to skb
      index: the index in the skb xfrm_state secpath array
      xfrm_state: pointer to 'struct bpf_xfrm_state'
      size: size of 'struct bpf_xfrm_state'
      flags: reserved for future extensions
      
      The helper returns 0 on success. Non zero if no xfrm state at the index
      is found - or non exists at all.
      
      struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
      address and the reqid; it can be further extended by adding elements to
      its end - indicating the populated fields by the 'size' argument -
      keeping backwards compatibility.
      
      Typical usage:
      
      struct bpf_xfrm_state x = {};
      bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
      ...
      Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      12bed760
    • E
      net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt() · 091311de
      Eric Dumazet 提交于
      rt6_remove_exception_rt() is called under rcu_read_lock() only.
      
      We lock rt6_exception_lock a bit later, so we do not hold
      rt6_exception_lock yet.
      
      Fixes: 8a14e46f ("net/ipv6: Fix missing rcu dereferences on from")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: David Ahern <dsahern@gmail.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      091311de
    • C
      net/tls: remove redundant second null check on sgout · 95ad7544
      Colin Ian King 提交于
      A duplicated null check on sgout is redundant as it is known to be
      already true because of the identical earlier check. Remove it.
      Detected by cppcheck:
      
      net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always
      true.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95ad7544
    • C
      ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers · c04d2cb2
      Chris Novakovic 提交于
      Distributed filesystems are most effective when the server and client
      clocks are synchronised. Embedded devices often use NFS for their
      root filesystem but typically do not contain an RTC, so the clocks of
      the NFS server and the embedded device will be out-of-sync when the root
      filesystem is mounted (and may not be synchronised until late in the
      boot process).
      
      Extend ipconfig with the ability to export IP addresses of NTP servers
      it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as
      follows:
      
       - If ipconfig is configured manually via the "ip=" or "nfsaddrs="
         kernel command line parameters, one NTP server can be specified in
         the new "<ntp0-ip>" parameter.
       - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in
         the DHCPDISCOVER message, and record the IP addresses of up to three
         NTP servers sent by the responding DHCP server in the subsequent
         DHCPOFFER message.
      
      ipconfig will only write the NTP server IP addresses it discovers to
      /proc/net/ipconfig/ntp_servers, one per line (in the order received from
      the DHCP server, if DHCP autoconfiguration is used); making use of these
      NTP servers is the responsibility of a user space process (e.g. an
      initrd/initram script that invokes an NTP client before mounting an NFS
      root filesystem).
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c04d2cb2
    • C
      ipconfig: Create /proc/net/ipconfig directory · 4d019b3f
      Chris Novakovic 提交于
      To allow ipconfig to report IP configuration details to user space
      processes without cluttering /proc/net, create a new subdirectory
      /proc/net/ipconfig. All files containing IP configuration details should
      be written to this directory.
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d019b3f
    • C
      ipconfig: Correctly initialise ic_nameservers · 300eec7c
      Chris Novakovic 提交于
      ic_nameservers, which stores the list of name servers discovered by
      ipconfig, is initialised (i.e. has all of its elements set to NONE, or
      0xffffffff) by ic_nameservers_predef() in the following scenarios:
      
       - before the "ip=" and "nfsaddrs=" kernel command line parameters are
         parsed (in ip_auto_config_setup());
       - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in
         order to clear any values that may have been set after parsing "ip="
         or "nfsaddrs=" and are no longer needed.
      
      This means that ic_nameservers_predef() is not called when neither "ip="
      nor "nfsaddrs=" is specified on the kernel command line. In this
      scenario, every element in ic_nameservers remains set to 0x00000000,
      which is indistinguishable from ANY and causes pnp_seq_show() to write
      the following (bogus) information to /proc/net/pnp:
      
        #MANUAL
        nameserver 0.0.0.0
        nameserver 0.0.0.0
        nameserver 0.0.0.0
      
      This is potentially problematic for systems that blindly link
      /etc/resolv.conf to /proc/net/pnp.
      
      Ensure that ic_nameservers is also initialised when neither "ip=" nor
      "nfsaddrs=" are specified by calling ic_nameservers_predef() in
      ip_auto_config(), but only when ip_auto_config_setup() was not called
      earlier. This causes the following to be written to /proc/net/pnp, and
      is consistent with what gets written when ipconfig is configured
      manually but no name servers are specified on the kernel command line:
      
        #MANUAL
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      300eec7c
    • C
      ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers · de1fa15b
      Chris Novakovic 提交于
      When ipconfig is autoconfigured via BOOTP, the request packet
      initialised by ic_bootp_init_ext() always allocates 8 bytes for the name
      server option, limiting the BOOTP server to responding with at most 2
      name servers even though ipconfig in fact supports an arbitrary number
      of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently
      3).
      
      Only request name servers in the request packet if CONF_NAMESERVERS_MAX
      is positive (to comply with [1, §3.8]), and allocate enough space in the
      packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum
      number we can accept in response.
      
      [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
          https://tools.ietf.org/rfc/rfc2132.txtSigned-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de1fa15b
    • C
      ipconfig: BOOTP: Don't request IEN-116 name servers · 4e1a8af2
      Chris Novakovic 提交于
      When ipconfig is autoconfigured via BOOTP, the request packet
      initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name
      Server" [1, §3.7]), but tag 5 in the response isn't processed by
      ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name
      Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears
      to have been the intended tag to request.
      
      This won't cause any breakage for existing users, as tag 5 responses
      provided by BOOTP servers weren't being processed anyway.
      
      [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
          https://tools.ietf.org/rfc/rfc2132.txtSigned-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e1a8af2
    • C
      ipconfig: Tidy up reporting of name servers · e18bdc83
      Chris Novakovic 提交于
      Commit 5e953778 ("ipconfig: add
      nameserver IPs to kernel-parameter ip=") adds the IP addresses of
      discovered name servers to the summary printed by ipconfig when
      configuration is complete. It appears the intention in ip_auto_config()
      was to print the name servers on a new line (especially given the
      spacing and lack of comma before "nameserver0="), but they're actually
      printed on the same line as the NFS root filesystem configuration
      summary:
      
        [    0.686186] IP-Config: Complete:
        [    0.686226]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
        [    0.686328]      host=test, domain=example.com, nis-domain=(none)
        [    0.686386]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=     nameserver0=10.0.0.1
      
      This makes it harder to read and parse ipconfig's output. Instead, print
      the name servers on a separate line:
      
        [    0.791250] IP-Config: Complete:
        [    0.791289]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
        [    0.791407]      host=test, domain=example.com, nis-domain=(none)
        [    0.791475]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
        [    0.791476]      nameserver0=10.0.0.1
      Signed-off-by: NChris Novakovic <chris@chrisn.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e18bdc83
    • E
      tcp: md5: only call tp->af_specific->md5_lookup() for md5 sockets · 8c2320e8
      Eric Dumazet 提交于
      RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive,
      given they have no result.
      We can omit the calls for sockets that have no md5 keys.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c2320e8
    • W
      packet: fix bitfield update race · a6361f0c
      Willem de Bruijn 提交于
      Updates to the bitfields in struct packet_sock are not atomic.
      Serialize these read-modify-write cycles.
      
      Move po->running into a separate variable. Its writes are protected by
      po->bind_lock (except for one startup case at packet_create). Also
      replace a textual precondition warning with lockdep annotation.
      
      All others are set only in packet_setsockopt. Serialize these
      updates by holding the socket lock. Analogous to other field updates,
      also hold the lock when testing whether a ring is active (pg_vec).
      
      Fixes: 8dc41944 ("[PACKET]: Add optional checksum computation for recvmsg")
      Reported-by: NDaeRyong Jeong <threeearcat@gmail.com>
      Reported-by: NByoungyoung Lee <byoungyoung@purdue.edu>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6361f0c
  5. 24 4月, 2018 5 次提交