1. 08 11月, 2018 5 次提交
    • P
      udp: cope with UDP GRO packet misdirection · cf329aa4
      Paolo Abeni 提交于
      In some scenarios, the GRO engine can assemble an UDP GRO packet
      that ultimately lands on a non GRO-enabled socket.
      This patch tries to address the issue explicitly checking for the UDP
      socket features before enqueuing the packet, and eventually segmenting
      the unexpected GRO packet, as needed.
      
      We must also cope with re-insertion requests: after segmentation the
      UDP code calls the helper introduced by the previous patches, as needed.
      
      Segmentation is performed by a common helper, which takes care of
      updating socket and protocol stats is case of failure.
      
      rfc v3 -> v1
       - fix compile issues with rxrpc
       - when gso_segment returns NULL, treat is as an error
       - added 'ipv4' argument to udp_rcv_segment()
      
      rfc v2 -> rfc v3
       - moved udp_rcv_segment() into net/udp.h, account errors to socket
         and ns, always return NULL or segs list
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf329aa4
    • P
      udp: add support for UDP_GRO cmsg · bcd1665e
      Paolo Abeni 提交于
      When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
      datagram size. User-space can use such info to compute the original
      packets layout.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcd1665e
    • P
      udp: implement GRO for plain UDP sockets. · e20cf8d3
      Paolo Abeni 提交于
      This is the RX counterpart of commit bec1f6f6 ("udp: generate gso
      with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
      eligible for GRO in the rx path: UDP segments directed to such socket
      are assembled into a larger GSO_UDP_L4 packet.
      
      The core UDP GRO support is enabled with setsockopt(UDP_GRO).
      
      Initial benchmark numbers:
      
      Before:
      udp rx:   1079 MB/s   769065 calls/s
      
      After:
      udp rx:   1466 MB/s    24877 calls/s
      
      This change introduces a side effect in respect to UDP tunnels:
      after a UDP tunnel creation, now the kernel performs a lookup per ingress
      UDP packet, while before such lookup happened only if the ingress packet
      carried a valid internal header csum.
      
      rfc v2 -> rfc v3:
       - fixed typos in macro name and comments
       - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
       - acquire socket lock in UDP_GRO setsockopt
      
      rfc v1 -> rfc v2:
       - use a new option to enable UDP GRO
       - use static keys to protect the UDP GRO socket lookup
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e20cf8d3
    • P
      udp: implement complete book-keeping for encap_needed · 60fb9567
      Paolo Abeni 提交于
      The *encap_needed static keys are enabled by UDP tunnels
      and several UDP encapsulations type, but they are never
      turned off. This can cause unneeded overall performance
      degradation for systems where such features are used
      transiently.
      
      This patch introduces complete book-keeping for such keys,
      decreasing the usage at socket destruction time, if needed,
      and avoiding that the same socket could increase the key
      usage multiple times.
      
      rfc v3 -> v1:
       - add socket lock around udp_tunnel_encap_enable()
      
      rfc v2 -> rfc v3:
       - use udp_tunnel_encap_enable() in setsockopt()
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60fb9567
    • M
      net: ensure unbound datagram socket to be chosen when not in a VRF · 6da5b0f0
      Mike Manning 提交于
      Ensure an unbound datagram skt is chosen when not in a VRF. The check
      for a device match in compute_score() for UDP must be performed when
      there is no device match. For this, a failure is returned when there is
      no device match. This ensures that bound sockets are never selected,
      even if there is no unbound socket.
      
      Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
      packets are currently blocked, as flowi6_oif was set to that of the
      master vrf device, and the ipi6_ifindex is that of the slave device.
      Allow these packets to be sent by checking the device with ipi6_ifindex
      has the same L3 scope as that of the bound device of the skt, which is
      the master vrf device. Note that this check always succeeds if the skt
      is unbound.
      
      Even though the right datagram skt is now selected by compute_score(),
      a different skt is being returned that is bound to the wrong vrf. The
      difference between these and stream sockets is the handling of the skt
      option for SO_REUSEPORT. While the handling when adding a skt for reuse
      correctly checks that the bound device of the skt is a match, the skts
      in the hashslot are already incorrect. So for the same hash, a skt for
      the wrong vrf may be selected for the required port. The root cause is
      that the skt is immediately placed into a slot when it is created,
      but when the skt is then bound using SO_BINDTODEVICE, it remains in the
      same slot. The solution is to move the skt to the correct slot by
      forcing a rehash.
      Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da5b0f0
  2. 31 10月, 2018 1 次提交
  3. 27 10月, 2018 1 次提交
  4. 25 10月, 2018 1 次提交
    • S
      net: udp: fix handling of CHECKSUM_COMPLETE packets · db4f1be3
      Sean Tranchetti 提交于
      Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
      incorrect for any packet that has an incorrect checksum value.
      
      udp4/6_csum_init() will both make a call to
      __skb_checksum_validate_complete() to initialize/validate the csum
      field when receiving a CHECKSUM_COMPLETE packet. When this packet
      fails validation, skb->csum will be overwritten with the pseudoheader
      checksum so the packet can be fully validated by software, but the
      skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
      the stack can later warn the user about their hardware spewing bad
      checksums. Unfortunately, leaving the SKB in this state can cause
      problems later on in the checksum calculation.
      
      Since the the packet is still marked as CHECKSUM_COMPLETE,
      udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
      from skb->csum instead of adding it, leaving us with a garbage value
      in that field. Once we try to copy the packet to userspace in the
      udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
      to checksum the packet data and add it in the garbage skb->csum value
      to perform our final validation check.
      
      Since the value we're validating is not the proper checksum, it's possible
      that the folded value could come out to 0, causing us not to drop the
      packet. Instead, we believe that the packet was checksummed incorrectly
      by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
      to warn the user with netdev_rx_csum_fault(skb->dev);
      
      Unfortunately, since this is the UDP path, skb->dev has been overwritten
      by skb->dev_scratch and is no longer a valid pointer, so we end up
      reading invalid memory.
      
      This patch addresses this problem in two ways:
      	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
      	   from skb_copy_and_csum_datagram_msg(). Since this gets called
      	   from the UDP path where skb->dev has been overwritten, we have
      	   no way of knowing if the pointer is still valid. Also for the
      	   sake of consistency with the other uses of
      	   netdev_rx_csum_fault(), don't attempt to call it if the
      	   packet was checksummed by software.
      
      	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
      	   If we receive a packet that's CHECKSUM_COMPLETE that fails
      	   verification (i.e. skb->csum_valid == 0), check who performed
      	   the calculation. It's possible that the checksum was done in
      	   software by the network stack earlier (such as Netfilter's
      	   CONNTRACK module), and if that says the checksum is bad,
      	   we can drop the packet immediately instead of waiting until
      	   we try and copy it to userspace. Otherwise, we need to
      	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
      	   no longer contains the full packet checksum after the
      	   call to __skb_checksum_validate_complete().
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Fixes: c84d9490 ("udp: copy skb->truesize in the first cache line")
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db4f1be3
  5. 08 10月, 2018 1 次提交
    • J
      udp: Unbreak modules that rely on external __skb_recv_udp() availability · 7e823644
      Jiri Kosina 提交于
      Commit 2276f58a ("udp: use a separate rx queue for packet reception")
      turned static inline __skb_recv_udp() from being a trivial helper around
      __skb_recv_datagram() into a UDP specific implementaion, making it
      EXPORT_SYMBOL_GPL() at the same time.
      
      There are external modules that got broken by __skb_recv_udp() not being
      visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().
      
      Rationale (one of those) why this is actually "technically correct" thing
      to do: __skb_recv_udp() used to be an inline wrapper around
      __skb_recv_datagram(), which itself (still, and correctly so, I believe)
      is EXPORT_SYMBOL().
      
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e823644
  6. 06 10月, 2018 1 次提交
  7. 03 10月, 2018 1 次提交
  8. 17 9月, 2018 1 次提交
  9. 11 8月, 2018 2 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
  10. 07 7月, 2018 2 次提交
    • W
      ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6 · 678ca42d
      Willem de Bruijn 提交于
      skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
      after modification by __sock_cmsg_send, by calling sock_tx_timestamp.
      
      The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
      individual protocols that support tx timestamps call this function
      and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
      called in __ip6_append_data.
      
      There is no need to store both tx_flags and ts_flags in the cookie
      as one is derived from the other. Convert when setting up the cork
      and remove the redundant field. This is similar to IPv6, only have
      the conversion happen only once per datagram, in ip(6)_setup_cork.
      
      Also change __ip6_append_data to match __ip_append_data. Only update
      tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
      redundant: only valid protocols can have non-zero cork->tx_flags.
      
      After this change the IPv4 and IPv6 logic is the same.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      678ca42d
    • W
      ipv4: ipcm_cookie initializers · 35178206
      Willem de Bruijn 提交于
      Initialize the cookie in one location to reduce code duplication and
      avoid bugs from inconsistent initialization, such as that fixed in
      commit 9887cba1 ("ip: limit use of gso_size to udp").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35178206
  11. 04 7月, 2018 1 次提交
  12. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  13. 09 6月, 2018 1 次提交
    • P
      udp: fix rx queue len reported by diag and proc interface · 6c206b20
      Paolo Abeni 提交于
      After commit 6b229cf7 ("udp: add batching to udp_rmem_release()")
      the sk_rmem_alloc field does not measure exactly anymore the
      receive queue length, because we batch the rmem release. The issue
      is really apparent only after commit 0d4a6608 ("udp: do rmem bulk
      free even if the rx sk queue is empty"): the user space can easily
      check for an empty socket with not-0 queue length reported by the 'ss'
      tool or the procfs interface.
      
      We need to use a custom UDP helper to report the correct queue length,
      taking into account the forward allocation deficit.
      
      Reported-by: trevor.francis@46labs.com
      Fixes: 6b229cf7 ("UDP: add batching to udp_rmem_release()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c206b20
  14. 05 6月, 2018 1 次提交
  15. 28 5月, 2018 1 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13
  16. 26 5月, 2018 1 次提交
  17. 24 5月, 2018 1 次提交
  18. 16 5月, 2018 2 次提交
  19. 12 5月, 2018 1 次提交
  20. 11 5月, 2018 2 次提交
  21. 09 5月, 2018 1 次提交
  22. 02 5月, 2018 1 次提交
    • W
      udp: disable gso with no_check_tx · a8c744a8
      Willem de Bruijn 提交于
      Syzbot managed to send a udp gso packet without checksum offload into
      the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
      triggered the skb_warn_bad_offload.
      
        RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
         skb_gso_segment include/linux/netdevice.h:4038 [inline]
         validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
         __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
         dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
      
      UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
      udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
      catch and fail in this case.
      
      After the integrity checks jump directly to the CHECKSUM_PARTIAL case
      to avoid reading the no_check_tx flags again (a TOCTTOU race).
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8c744a8
  23. 27 4月, 2018 3 次提交
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: generate gso with UDP_SEGMENT · bec1f6f6
      Willem de Bruijn 提交于
      Support generic segmentation offload for udp datagrams. Callers can
      concatenate and send at once the payload of multiple datagrams with
      the same destination.
      
      To set segment size, the caller sets socket option UDP_SEGMENT to the
      length of each discrete payload. This value must be smaller than or
      equal to the relevant MTU.
      
      A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
      per send call basis.
      
      Total byte length may then exceed MTU. If not an exact multiple of
      segment size, the last segment will be shorter.
      
      The implementation adds a gso_size field to the udp socket, ip(v6)
      cmsg cookie and inet_cork structure to be able to set the value at
      setsockopt or cmsg time and to work with both lockless and corked
      paths.
      
      Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
      
          tcp tso
           3197 MB/s 54232 msg/s 54232 calls/s
               6,457,754,262      cycles
      
          tcp gso
           1765 MB/s 29939 msg/s 29939 calls/s
              11,203,021,806      cycles
      
          tcp without tso/gso *
            739 MB/s 12548 msg/s 12548 calls/s
              11,205,483,630      cycles
      
          udp
            876 MB/s 14873 msg/s 624666 calls/s
              11,205,777,429      cycles
      
          udp gso
           2139 MB/s 36282 msg/s 36282 calls/s
              11,204,374,561      cycles
      
         [*] after reverting commit 0a6b2a1d
             ("tcp: switch to GSO being always on")
      
      Measured total system cycles ('-a') for one core while pinning both
      the network receive path and benchmark process to that core:
      
        perf stat -a -C 12 -e cycles \
          ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
      
      Note the reduction in calls/s with GSO. Bytes per syscall drops
      increases from 1470 to 61818.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec1f6f6
    • W
      udp: expose inet cork to udp · 1cd7884d
      Willem de Bruijn 提交于
      UDP segmentation offload needs access to inet_cork in the udp layer.
      Pass the struct to ip(6)_make_skb instead of allocating it on the
      stack in that function itself.
      
      This patch is a noop otherwise.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cd7884d
  24. 31 3月, 2018 1 次提交
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
  25. 28 3月, 2018 1 次提交
  26. 27 3月, 2018 1 次提交
  27. 24 3月, 2018 1 次提交
  28. 17 3月, 2018 1 次提交
  29. 17 2月, 2018 1 次提交
    • A
      udplite: fix partial checksum initialization · 15f35d49
      Alexey Kodanev 提交于
      Since UDP-Lite is always using checksum, the following path is
      triggered when calculating pseudo header for it:
      
        udp4_csum_init() or udp6_csum_init()
          skb_checksum_init_zero_check()
            __skb_checksum_validate_complete()
      
      The problem can appear if skb->len is less than CHECKSUM_BREAK. In
      this particular case __skb_checksum_validate_complete() also invokes
      __skb_checksum_complete(skb). If UDP-Lite is using partial checksum
      that covers only part of a packet, the function will return bad
      checksum and the packet will be dropped.
      
      It can be fixed if we skip skb_checksum_init_zero_check() and only
      set the required pseudo header checksum for UDP-Lite with partial
      checksum before udp4_csum_init()/udp6_csum_init() functions return.
      
      Fixes: ed70fcfc ("net: Call skb_checksum_init in IPv4")
      Fixes: e4f45b7f ("net: Call skb_checksum_init in IPv6")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15f35d49
  30. 13 2月, 2018 1 次提交
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821