1. 03 7月, 2017 17 次提交
  2. 02 7月, 2017 23 次提交
    • D
      Merge branch 'bpf-Add-support-for-sock_ops' · bcdb239b
      David S. Miller 提交于
      Lawrence Brakmo says:
      
      ====================
      bpf: Add support for sock_ops
      
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.) and setting
      connection parameters such as buffer sizes, initial window, SYN/SYN-ACK
      RTOs, etc.
      
      Unlike current BPF program types that expect to be called at a particular
      place in the network stack code, SOCK_OPS program can be called at
      different places and use an "op" field to indicate the context. There
      are currently two types of operations, those whose effect is through
      their return value and those whose effect is through the new
      bpf_setsocketop BPF helper function.
      
      Example operands of the first type are:
        BPF_SOCK_OPS_TIMEOUT_INIT
        BPF_SOCK_OPS_RWND_INIT
        BPF_SOCK_OPS_NEEDS_ECN
      
      Example operands of the secont type are:
        BPF_SOCK_OPS_TCP_CONNECT_CB
        BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
        BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
      
      Current operands are only called during connection establishment so
      there should not be any BPF overheads after connection establishment. The
      main idea is to use connection information form both hosts, such as IP
      addresses and ports to allow setting of per connection parameters to
      optimize the connection's peformance.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      disticnt advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it does not require
      application changes and it can be updated easily at any time.
      
      It uses the existing bpf cgroups infrastructure so the programs can be
      attached per cgroup with full inheritance support. Although the bpf cgroup
      framework already contains a sock related program type (BPF_PROG_TYPE_CGROUP_SOCK),
      I created the new type (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type
      expects to be called only once during the connections's lifetime. In contrast,
      the new program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set congestion
      control, etc. As a result it has "op" field to specify the type of operation
      requested.
      
      This patch set also includes sample BPF programs to demostrate the differnet
      features.
      
      v2: Formatting changes, rebased to latest net-next
      
      v3: Fixed build issues, changed socket_ops to sock_ops throught,
          fixed formatting issues, removed the syscall to load sock_ops
          program and added functionality to use existing bpf attach and
          bpf detach system calls, removed reader/writer locks in
          sock_bpfops.c (used when saving sock_ops global program)
          and fixed missing module refcount increment.
      
      v4: Removed global sock_ops program and instead used existing cgroup bpf
          infrastructure to support a new BPF_CGROUP_ATTCH type.
      
      v5: fixed kbuild warning happening in bpf-cgroup.h
          removed automatic converstion to host byte order from some sock_ops
            fields (ipv4 and ipv6 addresses, remote port)
          Added conversion to host byte order in some of the sample programs
          Added to sample BPF program comments about using load_sock_ops to load
          Removed is_req_sock field from bpf_sock_ops_kern and related places,
            using sk_fullsock() instead.
      
      v6: fixes to BPF helper function setsockopt (possible NULL deferencing, etc.)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcdb239b
    • L
      bpf: update tools/include/uapi/linux/bpf.h · 04df41e3
      Lawrence Brakmo 提交于
      Update tools/include/uapi/linux/bpf.h to include changes related to new
      bpf sock_ops program type.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04df41e3
    • L
      bpf: Sample bpf program to set sndcwnd clamp · 6c4a01b2
      Lawrence Brakmo 提交于
      Sample BPF program, tcp_clamp_kern.c, to demostrate the use
      of setting the sndcwnd clamp. This program assumes that if the
      first 5.5 bytes of the host's IPv6 addresses are the same, then
      the hosts are in the same datacenter and sets sndcwnd clamp to
      100 packets, SYN and SYN-ACK RTOs to 10ms and send/receive buffer
      sizes to 150KB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c4a01b2
    • L
      bpf: Adds support for setting sndcwnd clamp · 13bf9641
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
      sets the initial congestion window. It is useful to limit the sndcwnd
      when the host are close to each other (small RTT).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13bf9641
    • L
      bpf: Sample BPF program to set initial cwnd · 7bc62e28
      Lawrence Brakmo 提交于
      Sample BPF program that assumes hosts are far away (i.e. large RTTs)
      and sets initial cwnd and initial receive window to 40 packets,
      send and receive buffers to 1.5MB.
      
      In practice there would be a test to insure the hosts are actually
      far enough away.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bc62e28
    • L
      bpf: Adds support for setting initial cwnd · fc747810
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
      initial congestion window. This can be used when the hosts are far
      apart (large RTTs) and it is safe to start with a large inital cwnd.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc747810
    • L
      bpf: Sample BPF program to set congestion control · bb56d444
      Lawrence Brakmo 提交于
      Sample BPF program that sets congestion control to dctcp when both hosts
      are within the same datacenter. In this example that is assumed to be
      when they have the first 5.5 bytes of their IPv6 address are the same.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb56d444
    • L
      bpf: Add support for changing congestion control · 91b5b21c
      Lawrence Brakmo 提交于
      Added support for changing congestion control for SOCK_OPS bpf
      programs through the setsockopt bpf helper function. It also adds
      a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
      congestion controls, like dctcp, that need to enable ECN in the
      SYN packets.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b5b21c
    • L
      bpf: Sample BPF program to set buffer sizes · d9925368
      Lawrence Brakmo 提交于
      This patch contains a BPF program to set initial receive window to
      40 packets and send and receive buffers to 1.5MB. This would usually
      be done after doing appropriate checks that indicate the hosts are
      far enough away (i.e. large RTT).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9925368
    • L
      bpf: Add TCP connection BPF callbacks · 9872a4bd
      Lawrence Brakmo 提交于
      Added callbacks to BPF SOCK_OPS type program before an active
      connection is intialized and after a passive or active connection is
      established.
      
      The following patch demostrates how they can be used to set send and
      receive buffer sizes.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9872a4bd
    • L
      bpf: Add setsockopt helper function to bpf · 8c4b4c7e
      Lawrence Brakmo 提交于
      Added support for calling a subset of socket setsockopts from
      BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
      than making the changes to call the socket setsockopt function because
      the changes required would have been larger.
      
      The ops supported are:
        SO_RCVBUF
        SO_SNDBUF
        SO_MAX_PACING_RATE
        SO_PRIORITY
        SO_RCVLOWAT
        SO_MARK
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c4b4c7e
    • L
      bpf: Sample bpf program to set initial window · c400296b
      Lawrence Brakmo 提交于
      The sample bpf program, tcp_rwnd_kern.c, sets the initial
      advertized window to 40 packets in an environment where
      distinct IPv6 prefixes indicate that both hosts are not
      in the same data center.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c400296b
    • L
      bpf: Support for setting initial receive window · 13d3b1eb
      Lawrence Brakmo 提交于
      This patch adds suppport for setting the initial advertized window from
      within a BPF_SOCK_OPS program. This can be used to support larger
      initial cwnd values in environments where it is known to be safe.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d3b1eb
    • L
      bpf: Sample bpf program to set SYN/SYN-ACK RTOs · 61bc4d8d
      Lawrence Brakmo 提交于
      The sample BPF program, tcp_synrto_kern.c, sets the SYN and SYN-ACK
      RTOs to 10ms when both hosts are within the same datacenter (i.e.
      small RTTs) in an environment where common IPv6 prefixes indicate
      both hosts are in the same data center.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bc4d8d
    • L
      bpf: Support for per connection SYN/SYN-ACK RTOs · 8550f328
      Lawrence Brakmo 提交于
      This patch adds support for setting a per connection SYN and
      SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
      to set small RTOs when it is known both hosts are within a
      datacenter.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8550f328
    • L
      bpf: program to load and attach sock_ops BPF progs · ae16189e
      Lawrence Brakmo 提交于
      The program load_sock_ops can be used to load sock_ops bpf programs and
      to attach it to an existing (v2) cgroup. It can also be used to detach
      sock_ops programs.
      
      Examples:
          load_sock_ops [-l] <cg-path> <prog filename>
      	Load and attaches a sock_ops program at the specified cgroup.
      	If "-l" is used, the program will continue to run to output the
      	BPF log buffer.
      	If the specified filename does not end in ".o", it appends
      	"_kern.o" to the name.
      
          load_sock_ops -r <cg-path>
      	Detaches the currently attached sock_ops program from the
      	specified cgroup.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae16189e
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
    • D
      Merge branch 'for-upstream' of... · 57a53a0b
      David S. Miller 提交于
      Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
      
      Johan Hedberg says:
      
      ====================
      pull request: bluetooth-next 2017-07-01
      
      Here are some more Bluetooth patches for the 4.13 kernel:
      
       - Added support for Broadcom BCM43430 controllers
       - Added sockaddr length checks before accessing sa_family
       - Fixed possible "might sleep" errors in bnep, cmtp and hidp modules
       - A few other minor fixes
      
      Please let me know if there are any issues pulling. Thanks.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57a53a0b
    • N
      sctp: Add peeloff-flags socket option · 2cb5c8e3
      Neil Horman 提交于
      Based on a request raised on the sctp devel list, there is a need to
      augment the sctp_peeloff operation while specifying the O_CLOEXEC and
      O_NONBLOCK flags (simmilar to the socket syscall).  Since modifying the
      SCTP_SOCKOPT_PEELOFF socket option would break user space ABI for existing
      programs, this patch creates a new socket option
      SCTP_SOCKOPT_PEELOFF_FLAGS, which accepts a third flags parameter to
      allow atomic assignment of the socket descriptor flags.
      
      Tested successfully by myself and the requestor
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Andreas Steinmetz <ast@domdv.de>
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cb5c8e3
    • D
      Merge branch 'sfc-MCDI-cleanups' · 15324b25
      David S. Miller 提交于
      Edward Cree says:
      
      ====================
      sfc: small MCDI cleanups
      
      Giving the full MCDI event rather than just the code can aid in
       debugging.  While fixing this I noticed an outdated comment.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15324b25
    • E
      sfc: correct comment on efx_mcdi_process_event · 53172d9b
      Edward Cree 提交于
      Fix out-of-date comment.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53172d9b
    • J
    • C
      net/mlx5: fix spelling mistake: "Allodating" -> "Allocating" · 4120dab0
      Colin Ian King 提交于
      Trivial fix to spelling mistake in mlx5_core_dbg debug message
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4120dab0