1. 03 7月, 2017 2 次提交
  2. 02 7月, 2017 5 次提交
    • L
      bpf: Adds support for setting sndcwnd clamp · 13bf9641
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
      sets the initial congestion window. It is useful to limit the sndcwnd
      when the host are close to each other (small RTT).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13bf9641
    • L
      bpf: Adds support for setting initial cwnd · fc747810
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
      initial congestion window. This can be used when the hosts are far
      apart (large RTTs) and it is safe to start with a large inital cwnd.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc747810
    • L
      bpf: Add support for changing congestion control · 91b5b21c
      Lawrence Brakmo 提交于
      Added support for changing congestion control for SOCK_OPS bpf
      programs through the setsockopt bpf helper function. It also adds
      a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
      congestion controls, like dctcp, that need to enable ECN in the
      SYN packets.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b5b21c
    • L
      bpf: Add setsockopt helper function to bpf · 8c4b4c7e
      Lawrence Brakmo 提交于
      Added support for calling a subset of socket setsockopts from
      BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
      than making the changes to call the socket setsockopt function because
      the changes required would have been larger.
      
      The ops supported are:
        SO_RCVBUF
        SO_SNDBUF
        SO_MAX_PACING_RATE
        SO_PRIORITY
        SO_RCVLOWAT
        SO_MARK
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c4b4c7e
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
  3. 25 6月, 2017 1 次提交
    • J
      net: store port/representator id in metadata_dst · 3fcece12
      Jakub Kicinski 提交于
      Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
      representators and control messages over single set of hardware queues.
      Control messages and muxed traffic may need ordered delivery.
      
      Those requirements make it hard to comfortably use TC infrastructure today
      unless we have a way of attaching metadata to skbs at the upper device.
      Because single set of queues is used for many netdevs stopping TC/sched
      queues of all of them reliably is impossible and lower device has to
      retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
      the fastpath.
      
      This patch attempts to enable port/representative devs to attach metadata
      to skbs which carry port id.  This way representatives can be queueless and
      all queuing can be performed at the lower netdev in the usual way.
      
      Traffic arriving on the port/representative interfaces will be have
      metadata attached and will subsequently be queued to the lower device for
      transmission.  The lower device should recognize the metadata and translate
      it to HW specific format which is most likely either a special header
      inserted before the network headers or descriptor/metadata fields.
      
      Metadata is associated with the lower device by storing the netdev pointer
      along with port id so that if TC decides to redirect or mirror the new
      netdev will not try to interpret it.
      
      This is mostly for SR-IOV devices since switches don't have lower netdevs
      today.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcece12
  4. 24 6月, 2017 1 次提交
    • Y
      bpf: possibly avoid extra masking for narrower load in verifier · 23994631
      Yonghong Song 提交于
      Commit 31fd8581 ("bpf: permits narrower load from bpf program
      context fields") permits narrower load for certain ctx fields.
      The commit however will already generate a masking even if
      the prog-specific ctx conversion produces the result with
      narrower size.
      
      For example, for __sk_buff->protocol, the ctx conversion
      loads the data into register with 2-byte load.
      A narrower 2-byte load should not generate masking.
      For __sk_buff->vlan_present, the conversion function
      set the result as either 0 or 1, essentially a byte.
      The narrower 2-byte or 1-byte load should not generate masking.
      
      To avoid unnecessary masking, prog-specific *_is_valid_access
      now passes converted_op_size back to verifier, which indicates
      the valid data width after perceived future conversion.
      Based on this information, verifier is able to avoid
      unnecessary marking.
      
      Since we want more information back from prog-specific
      *_is_valid_access checking, all of them are packed into
      one data structure for more clarity.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23994631
  5. 15 6月, 2017 1 次提交
    • Y
      bpf: permits narrower load from bpf program context fields · 31fd8581
      Yonghong Song 提交于
      Currently, verifier will reject a program if it contains an
      narrower load from the bpf context structure. For example,
              __u8 h = __sk_buff->hash, or
              __u16 p = __sk_buff->protocol
              __u32 sample_period = bpf_perf_event_data->sample_period
      which are narrower loads of 4-byte or 8-byte field.
      
      This patch solves the issue by:
        . Introduce a new parameter ctx_field_size to carry the
          field size of narrower load from prog type
          specific *__is_valid_access validator back to verifier.
        . The non-zero ctx_field_size for a memory access indicates
          (1). underlying prog type specific convert_ctx_accesses
               supporting non-whole-field access
          (2). the current insn is a narrower or whole field access.
        . In verifier, for such loads where load memory size is
          less than ctx_field_size, verifier transforms it
          to a full field load followed by proper masking.
        . Currently, __sk_buff and bpf_perf_event_data->sample_period
          are supporting narrowing loads.
        . Narrower stores are still not allowed as typical ctx stores
          are just normal stores.
      
      Because of this change, some tests in verifier will fail and
      these tests are removed. As a bonus, rename some out of bound
      __sk_buff->cb access to proper field name and remove two
      redundant "skb cb oob" tests.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31fd8581
  6. 11 6月, 2017 2 次提交
  7. 01 6月, 2017 1 次提交
  8. 26 5月, 2017 1 次提交
  9. 01 5月, 2017 1 次提交
  10. 22 4月, 2017 1 次提交
  11. 21 4月, 2017 1 次提交
  12. 18 4月, 2017 1 次提交
  13. 12 4月, 2017 2 次提交
  14. 02 4月, 2017 1 次提交
    • A
      bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9
      Alexei Starovoitov 提交于
      development and testing of networking bpf programs is quite cumbersome.
      Despite availability of user space bpf interpreters the kernel is
      the ultimate authority and execution environment.
      Current test frameworks for TC include creation of netns, veth,
      qdiscs and use of various packet generators just to test functionality
      of a bpf program. XDP testing is even more complicated, since
      qemu needs to be started with gro/gso disabled and precise queue
      configuration, transferring of xdp program from host into guest,
      attaching to virtio/eth0 and generating traffic from the host
      while capturing the results from the guest.
      
      Moreover analyzing performance bottlenecks in XDP program is
      impossible in virtio environment, since cost of running the program
      is tiny comparing to the overhead of virtio packet processing,
      so performance testing can only be done on physical nic
      with another server generating traffic.
      
      Furthermore ongoing changes to user space control plane of production
      applications cannot be run on the test servers leaving bpf programs
      stubbed out for testing.
      
      Last but not least, the upstream llvm changes are validated by the bpf
      backend testsuite which has no ability to test the code generated.
      
      To improve this situation introduce BPF_PROG_TEST_RUN command
      to test and performance benchmark bpf programs.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf1cae9
  15. 24 3月, 2017 2 次提交
  16. 23 3月, 2017 1 次提交
  17. 24 2月, 2017 1 次提交
  18. 18 2月, 2017 1 次提交
  19. 03 2月, 2017 1 次提交
  20. 25 1月, 2017 3 次提交
  21. 12 1月, 2017 2 次提交
    • D
      bpf: allow b/h/w/dw access for bpf's cb in ctx · 62c7989b
      Daniel Borkmann 提交于
      When structs are used to store temporary state in cb[] buffer that is
      used with programs and among tail calls, then the generated code will
      not always access the buffer in bpf_w chunks. We can ease programming
      of it and let this act more natural by allowing for aligned b/h/w/dw
      sized access for cb[] ctx member. Various test cases are attached as
      well for the selftest suite. Potentially, this can also be reused for
      other program types to pass data around.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62c7989b
    • D
      bpf: pass original insn directly to convert_ctx_access · 6b8cc1d1
      Daniel Borkmann 提交于
      Currently, when calling convert_ctx_access() callback for the various
      program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
      the original instruction. This information is needed to rewrite the
      instruction that is based on the user ctx structure into a kernel
      representation for the ctx. As we'd like to allow access size beyond
      just BPF_W, we'd need also insn->code for that in order to decode the
      original access size. Given that, lets just pass insn directly to the
      convert_ctx_access() callback and work on that to not clutter the
      callback with even more arguments we need to pass when everything is
      already contained in insn. So lets go through that once, no functional
      change.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b8cc1d1
  22. 10 1月, 2017 1 次提交
  23. 28 12月, 2016 1 次提交
  24. 25 12月, 2016 1 次提交
  25. 18 12月, 2016 1 次提交
  26. 09 12月, 2016 1 次提交
  27. 06 12月, 2016 1 次提交
  28. 03 12月, 2016 2 次提交