1. 27 2月, 2019 2 次提交
  2. 25 2月, 2019 5 次提交
  3. 22 2月, 2019 2 次提交
  4. 18 2月, 2019 5 次提交
  5. 17 2月, 2019 1 次提交
    • G
      sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values · 4057765f
      Guillaume Nault 提交于
      SO_SNDBUF and SO_RCVBUF (and their *BUFFORCE version) may overflow or
      underflow their input value. This patch aims at providing explicit
      handling of these extreme cases, to get a clear behaviour even with
      values bigger than INT_MAX / 2 or lower than INT_MIN / 2.
      
      For simplicity, only SO_SNDBUF and SO_SNDBUFFORCE are described here,
      but the same explanation and fix apply to SO_RCVBUF and SO_RCVBUFFORCE
      (with 'SNDBUF' replaced by 'RCVBUF' and 'wmem_max' by 'rmem_max').
      
      Overflow of positive values
      
      ===========================
      
      When handling SO_SNDBUF or SO_SNDBUFFORCE, if 'val' exceeds
      INT_MAX / 2, the buffer size is set to its minimum value because
      'val * 2' overflows, and max_t() considers that it's smaller than
      SOCK_MIN_SNDBUF. For SO_SNDBUF, this can only happen with
      net.core.wmem_max > INT_MAX / 2.
      
      SO_SNDBUF and SO_SNDBUFFORCE are actually designed to let users probe
      for the maximum buffer size by setting an arbitrary large number that
      gets capped to the maximum allowed/possible size. Having the upper
      half of the positive integer space to potentially reduce the buffer
      size to its minimum value defeats this purpose.
      
      This patch caps the base value to INT_MAX / 2, so that bigger values
      don't overflow and keep setting the buffer size to its maximum.
      
      Underflow of negative values
      ============================
      
      For negative numbers, SO_SNDBUF always considers them bigger than
      net.core.wmem_max, which is bounded by [SOCK_MIN_SNDBUF, INT_MAX].
      Therefore such values are set to net.core.wmem_max and we're back to
      the behaviour of positive integers described above (return maximum
      buffer size if wmem_max <= INT_MAX / 2, return SOCK_MIN_SNDBUF
      otherwise).
      
      However, SO_SNDBUFFORCE behaves differently. The user value is
      directly multiplied by two and compared with SOCK_MIN_SNDBUF. If
      'val * 2' doesn't underflow or if it underflows to a value smaller
      than SOCK_MIN_SNDBUF then buffer size is set to its minimum value.
      Otherwise the buffer size is set to the underflowed value.
      
      This patch treats negative values passed to SO_SNDBUFFORCE as null, to
      prevent underflows. Therefore negative values now always set the buffer
      size to its minimum value.
      
      Even though SO_SNDBUF behaves inconsistently by setting buffer size to
      the maximum value when passed a negative number, no attempt is made to
      modify this behaviour. There may exist some programs that rely on using
      negative numbers to set the maximum buffer size. Avoiding overflows
      because of extreme net.core.wmem_max values is the most we can do here.
      
      Summary of altered behaviours
      =============================
      
      val      : user-space value passed to setsockopt()
      val_uf   : the underflowed value resulting from doubling val when
                 val < INT_MIN / 2
      wmem_max : short for net.core.wmem_max
      val_cap  : min(val, wmem_max)
      min_len  : minimal buffer length (that is, SOCK_MIN_SNDBUF)
      max_len  : maximal possible buffer length, regardless of wmem_max (that
                 is, INT_MAX - 1)
      ^^^^     : altered behaviour
      
      SO_SNDBUF:
      +-------------------------+-------------+------------+----------------+
      |       CONDITION         | OLD RESULT  | NEW RESULT |    COMMENT     |
      +-------------------------+-------------+------------+----------------+
      | val < 0 &&              |             |            | No overflow,   |
      | wmem_max <= INT_MAX/2   | wmem_max*2  | wmem_max*2 | keep original  |
      |                         |             |            | behaviour      |
      +-------------------------+-------------+------------+----------------+
      | val < 0 &&              |             |            | Cap wmem_max   |
      | INT_MAX/2 < wmem_max    | min_len     | max_len    | to prevent     |
      |                         |             | ^^^^^^^    | overflow       |
      +-------------------------+-------------+------------+----------------+
      | 0 <= val <= min_len/2   | min_len     | min_len    | Ordinary case  |
      +-------------------------+-------------+------------+----------------+
      | min_len/2 < val &&      | val_cap*2   | val_cap*2  | Ordinary case  |
      | val_cap <= INT_MAX/2    |             |            |                |
      +-------------------------+-------------+------------+----------------+
      | min_len < val &&        |             |            | Cap val_cap    |
      | INT_MAX/2 < val_cap     | min_len     | max_len    | again to       |
      | (implies that           |             | ^^^^^^^    | prevent        |
      | INT_MAX/2 < wmem_max)   |             |            | overflow       |
      +-------------------------+-------------+------------+----------------+
      
      SO_SNDBUFFORCE:
      +------------------------------+---------+---------+------------------+
      |          CONDITION           | BEFORE  | AFTER   |     COMMENT      |
      |                              | PATCH   | PATCH   |                  |
      +------------------------------+---------+---------+------------------+
      | val < INT_MIN/2 &&           | min_len | min_len | Underflow with   |
      | val_uf <= min_len            |         |         | no consequence   |
      +------------------------------+---------+---------+------------------+
      | val < INT_MIN/2 &&           | val_uf  | min_len | Set val to 0 to  |
      | val_uf > min_len             |         | ^^^^^^^ | avoid underflow  |
      +------------------------------+---------+---------+------------------+
      | INT_MIN/2 <= val < 0         | min_len | min_len | No underflow     |
      +------------------------------+---------+---------+------------------+
      | 0 <= val <= min_len/2        | min_len | min_len | Ordinary case    |
      +------------------------------+---------+---------+------------------+
      | min_len/2 < val <= INT_MAX/2 | val*2   | val*2   | Ordinary case    |
      +------------------------------+---------+---------+------------------+
      | INT_MAX/2 < val              | min_len | max_len | Cap val to       |
      |                              |         | ^^^^^^^ | prevent overflow |
      +------------------------------+---------+---------+------------------+
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4057765f
  6. 16 2月, 2019 1 次提交
    • H
      net: Fix for_each_netdev_feature on Big endian · 3b89ea9c
      Hauke Mehrtens 提交于
      The features attribute is of type u64 and stored in the native endianes on
      the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
      and goes over the bits in this area. On little Endian systems this also
      works with an u64 as the most significant bit is on the highest address,
      but on big endian the words are swapped. When we expect bit 15 here we get
      bit 47 (15 + 32).
      
      This patch converts it more or less to its own for_each_set_bit()
      implementation which works on 64 bit integers directly. This is then
      completely in host endianness and should work like expected.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: NHauke Mehrtens <hauke.mehrtens@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b89ea9c
  7. 15 2月, 2019 3 次提交
  8. 14 2月, 2019 7 次提交
    • J
      page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings · 13f16d9d
      Jesper Dangaard Brouer 提交于
      As pointed out by Alexander Duyck, the DMA mapping done in page_pool needs
      to use the DMA attribute DMA_ATTR_SKIP_CPU_SYNC.
      
      As the principle behind page_pool keeping the pages mapped is that the
      driver takes over the DMA-sync steps.
      Reported-by: NAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13f16d9d
    • I
      net: page_pool: don't use page->private to store dma_addr_t · 1567b85e
      Ilias Apalodimas 提交于
      As pointed out by David Miller the current page_pool implementation
      stores dma_addr_t in page->private.
      This won't work on 32-bit platforms with 64-bit DMA addresses since the
      page->private is an unsigned long and the dma_addr_t a u64.
      
      A previous patch is adding dma_addr_t on struct page to accommodate this.
      This patch adapts the page_pool related functions to use the newly added
      struct for storing and retrieving DMA addresses from network drivers.
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1567b85e
    • E
      net: fix possible overflow in __sk_mem_raise_allocated() · 5bf325a5
      Eric Dumazet 提交于
      With many active TCP sockets, fat TCP sockets could fool
      __sk_mem_raise_allocated() thanks to an overflow.
      
      They would increase their share of the memory, instead
      of decreasing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bf325a5
    • P
      bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c · 3bd0b152
      Peter Oskolkov 提交于
      This patch builds on top of the previous patch in the patchset,
      which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
      encapping can result in the skb needing to go via a different
      interface/route/dst, bpf programs can indicate this by returning
      BPF_LWT_REROUTE, which triggers a new route lookup for the skb.
      
      v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but
         IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N,
         call IPV6 routing functions only if they are built-in.
      
      v9 changes:
         - fixed a kbuild test robot compiler warning;
         - call IPV6 routing functions via ipv6_stub.
      
      v10 changes: removed unnecessary IS_ENABLED and pr_warn_once.
      
      v11 changes: fixed a potential dst leak.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3bd0b152
    • P
      bpf: handle GSO in bpf_lwt_push_encap · ca78801a
      Peter Oskolkov 提交于
      This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
      (called from bpf_lwt_push_encap):
      
      * IPIP, GRE, and UDP encapsulation types are deduced by looking
        into iphdr->protocol or ipv6hdr->next_header;
      * SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
        and similar do);
      * UDP_L4 GSO packets are also not supported (although they are
        not blocked in bpf_skb_proto_4_to_6 and similar), as
        skb_decrease_gso_size() will break it;
      * SKB_GSO_DODGY bit is set.
      
      Note: it may be possible to support SCTP and UDP_L4 gso packets;
            but as these cases seem to be not well handled by other
            tunneling/encapping code paths, the solution should
            be generic enough to apply to all tunneling/encapping code.
      
      v8 changes:
         - make sure that if GRE or UDP encap is detected, there is
           enough of pushed bytes to cover both IP[v6] + GRE|UDP headers;
         - do not reject double-encapped packets;
         - whitelist TCP GSO packets rather than block SCTP GSO and
           UDP GSO.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca78801a
    • P
      bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap · 52f27877
      Peter Oskolkov 提交于
      Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper.
      It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and
      BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
      to packets (e.g. IP/GRE, GUE, IPIP).
      
      This is useful when thousands of different short-lived flows should be
      encapped, each with different and dynamically determined destination.
      Although lwtunnels can be used in some of these scenarios, the ability
      to dynamically generate encap headers adds more flexibility, e.g.
      when routing depends on the state of the host (reflected in global bpf
      maps).
      
      v7 changes:
       - added a call skb_clear_hash();
       - removed calls to skb_set_transport_header();
       - refuse to encap GSO-enabled packets.
      
      v8 changes:
       - fix build errors when LWT is not enabled.
      
      Note: the next patch in the patchset with deal with GSO-enabled packets,
      which are currently rejected at encapping attempt.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      52f27877
    • P
      bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap · 3e0bd37c
      Peter Oskolkov 提交于
      This patch adds all needed plumbing in preparation to allowing
      bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
      implementation is added in the next patch in the patchset.
      
      Of note:
      - bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
        prog types in addition to BPF_PROG_TYPE_LWT_IN;
      - if the skb being encapped has GSO set, encapsulation is limited
        to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
      - as route lookups are different for ingress vs egress, the single
        external bpf_lwt_push_encap BPF helper is routed internally to
        either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
        depending on prog type.
      
      v8 changes: fixed a typo.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3e0bd37c
  9. 13 2月, 2019 2 次提交
  10. 12 2月, 2019 3 次提交
  11. 11 2月, 2019 5 次提交
    • W
      bpf: only adjust gso_size on bytestream protocols · b90efd22
      Willem de Bruijn 提交于
      bpf_skb_change_proto and bpf_skb_adjust_room change skb header length.
      For GSO packets they adjust gso_size to maintain the same MTU.
      
      The gso size can only be safely adjusted on bytestream protocols.
      Commit d02f51cb ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
      to deal with gso sctp skbs") excluded SKB_GSO_SCTP.
      
      Since then type SKB_GSO_UDP_L4 has been added, whose contents are one
      gso_size unit per datagram. Also exclude these.
      
      Move from a blacklist to a whitelist check to future proof against
      additional such new GSO types, e.g., for fraglist based GRO.
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b90efd22
    • M
      bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock · 655a51e5
      Martin KaFai Lau 提交于
      This patch adds a helper function BPF_FUNC_tcp_sock and it
      is currently available for cg_skb and sched_(cls|act):
      
      struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_tcp_sock *tp;
      	struct bpf_sock *sk;
      	__u32 snd_cwnd;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	tp = bpf_tcp_sock(sk);
      	if (!tp)
      		return 1;
      
      	snd_cwnd = tp->snd_cwnd;
      	/* ... */
      
      	return 1;
      }
      
      A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
      read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
      that has already been exposed by the bpf_sock_ops.
      i.e. no new tcp_sock's fields are exposed in bpf.h.
      
      This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
      or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
      returns NULL.  Hence, the caller needs to check for NULL before
      accessing it.
      
      The current use case is to expose members from tcp_sock
      to allow a cg_skb_bpf_prog to provide per cgroup traffic
      policing/shaping.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      655a51e5
    • M
      bpf: Refactor sock_ops_convert_ctx_access · 9b1f3d6e
      Martin KaFai Lau 提交于
      The next patch will introduce a new "struct bpf_tcp_sock" which
      exposes the same tcp_sock's fields already exposed in
      "struct bpf_sock_ops".
      
      This patch refactor the existing convert_ctx_access() codes for
      "struct bpf_sock_ops" to get them ready to be reused for
      "struct bpf_tcp_sock".  The "rtt_min" is not refactored
      in this patch because its handling is different from other
      fields.
      
      The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
      changes are code move only.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9b1f3d6e
    • M
      bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock · aa65d696
      Martin KaFai Lau 提交于
      This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the
      bpf_sock.  The userspace has already been using "state",
      e.g. inet_diag (ss -t) and getsockopt(TCP_INFO).
      
      This patch also allows narrow load on the following existing fields:
      "family", "type", "protocol" and "src_port".  Unlike IP address,
      the load offset is resticted to the first byte for them but it
      can be relaxed later if there is a use case.
      
      This patch also folds __sock_filter_check_size() into
      bpf_sock_is_valid_access() since it is not called
      by any where else.  All bpf_sock checking is in
      one place.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      aa65d696
    • M
      bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper · 46f8bc92
      Martin KaFai Lau 提交于
      In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
      before accessing the fields in sock.  For example, in __netdev_pick_tx:
      
      static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
      			    struct net_device *sb_dev)
      {
      	/* ... */
      
      	struct sock *sk = skb->sk;
      
      		if (queue_index != new_index && sk &&
      		    sk_fullsock(sk) &&
      		    rcu_access_pointer(sk->sk_dst_cache))
      			sk_tx_queue_set(sk, new_index);
      
      	/* ... */
      
      	return queue_index;
      }
      
      This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
      where a few of the convert_ctx_access() in filter.c has already been
      accessing the skb->sk sock_common's fields,
      e.g. sock_ops_convert_ctx_access().
      
      "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
      Some of the fileds in "bpf_sock" will not be directly
      accessible through the "__sk_buff->sk" pointer.  It is limited
      by the new "bpf_sock_common_is_valid_access()".
      e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
           are not allowed.
      
      The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
      can be used to get a sk with all accessible fields in "bpf_sock".
      This helper is added to both cg_skb and sched_(cls|act).
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_sock *sk;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	sk = bpf_sk_fullsock(sk);
      	if (!sk)
      		return 1;
      
      	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
      		return 1;
      
      	/* some_traffic_shaping(); */
      
      	return 1;
      }
      
      (1) The sk is read only
      
      (2) There is no new "struct bpf_sock_common" introduced.
      
      (3) Future kernel sock's members could be added to bpf_sock only
          instead of repeatedly adding at multiple places like currently
          in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
      
      (4) After "sk = skb->sk", the reg holding sk is in type
          PTR_TO_SOCK_COMMON_OR_NULL.
      
      (5) After bpf_sk_fullsock(), the return type will be in type
          PTR_TO_SOCKET_OR_NULL which is the same as the return type of
          bpf_sk_lookup_xxx().
      
          However, bpf_sk_fullsock() does not take refcnt.  The
          acquire_reference_state() is only depending on the return type now.
          To avoid it, a new is_acquire_function() is checked before calling
          acquire_reference_state().
      
      (6) The WARN_ON in "release_reference_state()" is no longer an
          internal verifier bug.
      
          When reg->id is not found in state->refs[], it means the
          bpf_prog does something wrong like
          "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
          never been acquired by calling "bpf_sk_fullsock(skb->sk)".
      
          A -EINVAL and a verbose are done instead of WARN_ON.  A test is
          added to the test_verifier in a later patch.
      
          Since the WARN_ON in "release_reference_state()" is no longer
          needed, "__release_reference_state()" is folded into
          "release_reference_state()" also.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      46f8bc92
  12. 09 2月, 2019 2 次提交
  13. 08 2月, 2019 2 次提交
新手
引导
客服 返回
顶部