1. 02 7月, 2017 14 次提交
    • L
      bpf: Support for setting initial receive window · 13d3b1eb
      Lawrence Brakmo 提交于
      This patch adds suppport for setting the initial advertized window from
      within a BPF_SOCK_OPS program. This can be used to support larger
      initial cwnd values in environments where it is known to be safe.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d3b1eb
    • L
      bpf: Support for per connection SYN/SYN-ACK RTOs · 8550f328
      Lawrence Brakmo 提交于
      This patch adds support for setting a per connection SYN and
      SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
      to set small RTOs when it is known both hosts are within a
      datacenter.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8550f328
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
    • N
      sctp: Add peeloff-flags socket option · 2cb5c8e3
      Neil Horman 提交于
      Based on a request raised on the sctp devel list, there is a need to
      augment the sctp_peeloff operation while specifying the O_CLOEXEC and
      O_NONBLOCK flags (simmilar to the socket syscall).  Since modifying the
      SCTP_SOCKOPT_PEELOFF socket option would break user space ABI for existing
      programs, this patch creates a new socket option
      SCTP_SOCKOPT_PEELOFF_FLAGS, which accepts a third flags parameter to
      allow atomic assignment of the socket descriptor flags.
      
      Tested successfully by myself and the requestor
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Andreas Steinmetz <ast@domdv.de>
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cb5c8e3
    • T
      datapath: Avoid using stack larger than 1024. · 9cc9a5cb
      Tonghao Zhang 提交于
      When compiling OvS-master on 4.4.0-81 kernel,
      there is a warning:
      
          CC [M]  /root/ovs/datapath/linux/datapath.o
          /root/ovs/datapath/linux/datapath.c: In function
          'ovs_flow_cmd_set':
          /root/ovs/datapath/linux/datapath.c:1221:1: warning:
          the frame size of 1040 bytes is larger than 1024 bytes
          [-Wframe-larger-than=]
      
      This patch factors out match-init and action-copy to avoid
      "Wframe-larger-than=1024" warning. Because mask is only
      used to get actions, we new a function to save some
      stack space.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cc9a5cb
    • X
      sctp: remove the typedef sctp_init_chunk_t · 01a992be
      Xin Long 提交于
      This patch is to remove the typedef sctp_init_chunk_t, and replace
      with struct sctp_init_chunk in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01a992be
    • X
      sctp: remove the typedef sctp_inithdr_t · 4ae70c08
      Xin Long 提交于
      This patch is to remove the typedef sctp_inithdr_t, and replace
      with struct sctp_inithdr in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ae70c08
    • X
      sctp: remove the typedef sctp_data_chunk_t · 9f8d3147
      Xin Long 提交于
      This patch is to remove the typedef sctp_data_chunk_t, and replace
      with struct sctp_data_chunk in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f8d3147
    • X
      sctp: remove the typedef sctp_datahdr_t · 3583df1a
      Xin Long 提交于
      This patch is to remove the typedef sctp_datahdr_t, and replace with
      struct sctp_datahdr in the places where it's using this typedef.
      
      It is also to use izeof(variable) instead of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3583df1a
    • X
      sctp: remove the typedef sctp_param_t · 34b4e29b
      Xin Long 提交于
      This patch is to remove the typedef sctp_param_t, and replace with
      struct sctp_paramhdr in the places where it's using this typedef.
      
      It is also to remove the useless declaration sctp_addip_addr_config
      and fix the lack of params for some other functions' declaration.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34b4e29b
    • X
      sctp: remove the typedef sctp_paramhdr_t · 3c918704
      Xin Long 提交于
      This patch is to remove the typedef sctp_paramhdr_t, and replace
      with struct sctp_paramhdr in the places where it's using this
      typedef.
      
      It is also to fix some indents and  use sizeof(variable) instead
      of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c918704
    • X
      sctp: remove the typedef sctp_cid_t · 6d85e68f
      Xin Long 提交于
      This patch is to remove the typedef sctp_cid_t, and replace
      with struct sctp_cid in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d85e68f
    • X
      sctp: remove the typedef sctp_chunkhdr_t · 922dbc5b
      Xin Long 提交于
      This patch is to remove the typedef sctp_chunkhdr_t, and replace
      with struct sctp_chunkhdr in the places where it's using this
      typedef.
      
      It is also to fix some indents and use sizeof(variable) instead
      of sizeof(type)., especially in sctp_new.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      922dbc5b
    • X
      sctp: remove the typedef sctp_sctphdr_t · ae146d9b
      Xin Long 提交于
      This patch is to remove the typedef sctp_sctphdr_t, and replace
      with struct sctphdr in the places where it's using this typedef.
      
      It is also to fix some indents and use sizeof(variable) instead
      of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae146d9b
  2. 01 7月, 2017 18 次提交
  3. 30 6月, 2017 5 次提交
    • M
      net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() · e44699d2
      Michal Kubeček 提交于
      Recently I started seeing warnings about pages with refcount -1. The
      problem was traced to packets being reused after their head was merged into
      a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
      commit c21b48cc ("net: adjust skb->truesize in ___pskb_trim()") and
      I have never seen it on a kernel with it reverted, I believe the real
      problem appeared earlier when the option to merge head frag in GRO was
      implemented.
      
      Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
      branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
      and head is merged (which in my case happens after the skb_condense()
      call added by the commit mentioned above), the skb is reused including the
      head that has been merged. As a result, we release the page reference
      twice and eventually end up with negative page refcount.
      
      To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
      the same way it's done in napi_skb_finish().
      
      Fixes: d7e8883c ("net: make GRO aware of skb->head_frag")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e44699d2
    • A
      net: bridge: constify attribute_group structures. · cddbb79f
      Arvind Yadav 提交于
      attribute_groups are not supposed to change at runtime. All functions
      working with attribute_groups provided by <linux/sysfs.h> work with const
      attribute_group. So mark the non-const structs as const.
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         2645	    896	      0	   3541	    dd5	net/bridge/br_sysfs_br.o
      
      File size After adding 'const':
         text	   data	    bss	    dec	    hex	filename
         2701	    832	      0	   3533	    dcd	net/bridge/br_sysfs_br.o
      Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cddbb79f
    • A
      net: constify attribute_group structures. · 38ef00cc
      Arvind Yadav 提交于
      attribute_groups are not supposed to change at runtime. All functions
      working with attribute_groups provided by <linux/device.h> work with const
      attribute_group. So mark the non-const structs as const.
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         9968	   3168	     16	  13152	   3360	net/core/net-sysfs.o
      
      File size After adding 'const':
         text	   data	    bss	    dec	    hex	filename
        10160	   2976	     16	  13152	   3360	net/core/net-sysfs.o
      Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ef00cc
    • D
      net: ipmr: Add ipmr_rtm_getroute · 4f75ba69
      Donald Sharp 提交于
      Add to RTNL_FAMILY_IPMR, RTM_GETROUTE the ability
      to retrieve one S,G mroute from a specified table.
      
      *,G will return mroute information for just that
      particular mroute if it exists.  This is because
      it is entirely possible to have more S's then
      can fit in one skb to return to the requesting
      process.
      Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f75ba69
    • G
      net: sched: Fix one possible panic when no destroy callback · c1a4872e
      Gao Feng 提交于
      When qdisc fail to init, qdisc_create would invoke the destroy callback
      to cleanup. But there is no check if the callback exists really. So it
      would cause the panic if there is no real destroy callback like the qdisc
      codel, fq, and so on.
      
      Take codel as an example following:
      When a malicious user constructs one invalid netlink msg, it would cause
      codel_init->codel_change->nla_parse_nested failed.
      Then kernel would invoke the destroy callback directly but qdisc codel
      doesn't define one. It causes one panic as a result.
      
      Now add one the check for destroy to avoid the possible panic.
      
      Fixes: 87b60cfa ("net_sched: fix error recovery at qdisc creation")
      Signed-off-by: NGao Feng <gfree.wind@vip.163.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1a4872e
  4. 29 6月, 2017 2 次提交
    • M
      Bluetooth: Add sockaddr length checks before accessing sa_family in bind and connect handlers · d2ecfa76
      Mateusz Jurczyk 提交于
      Verify that the caller-provided sockaddr structure is large enough to
      contain the sa_family field, before accessing it in bind() and connect()
      handlers of the Bluetooth sockets. Since neither syscall enforces a minimum
      size of the corresponding memory region, very short sockaddrs (zero or one
      byte long) result in operating on uninitialized memory while referencing
      sa_family.
      Signed-off-by: NMateusz Jurczyk <mjurczyk@google.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      d2ecfa76
    • T
      bluetooth: remove WQ_MEM_RECLAIM from hci workqueues · 29e2dd0d
      Tejun Heo 提交于
      Bluetooth hci uses ordered HIGHPRI, MEM_RECLAIM workqueues.  It's
      likely that the flags came from mechanical conversion from
      create_singlethread_workqueue().  Bluetooth shouldn't be depended upon
      for memory reclaim and the spurious MEM_RECLAIM flag can trigger the
      following warning.  Remove WQ_MEM_RECLAIM and convert to
      alloc_ordered_workqueue() while at it.
      
        workqueue: WQ_MEM_RECLAIM hci0:hci_power_off is flushing !WQ_MEM_RECLAIM events:btusb_work
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 14231 at /home/brodo/local/kernel/git/linux/kernel/workqueue.c:2423 check_flush_dependency+0xb3/0x100
        Modules linked in:
        CPU: 2 PID: 14231 Comm: kworker/u9:4 Not tainted 4.12.0-rc6+ #3
        Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS A11 12/08/2016
        Workqueue: hci0 hci_power_off
        task: ffff9432dad58000 task.stack: ffff986d43790000
        RIP: 0010:check_flush_dependency+0xb3/0x100
        RSP: 0018:ffff986d43793c90 EFLAGS: 00010086
        RAX: 000000000000005a RBX: ffff943316810820 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000001
        RBP: ffff986d43793cb0 R08: 0000000000000775 R09: ffffffff85bdd5c0
        R10: 0000000000000040 R11: 0000000000000000 R12: ffffffff84d596e0
        R13: ffff9432dad58000 R14: ffff94321c640320 R15: ffff9432dad58000
        FS:  0000000000000000(0000) GS:ffff94331f500000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007b8bca242000 CR3: 000000014f60a000 CR4: 00000000003406e0
        Call Trace:
         flush_work+0x8a/0x1c0
         ? flush_work+0x184/0x1c0
         ? skb_free_head+0x21/0x30
         __cancel_work_timer+0x124/0x1b0
         ? hci_dev_do_close+0x2a4/0x4d0
         cancel_work_sync+0x10/0x20
         btusb_close+0x23/0x100
         hci_dev_do_close+0x2ca/0x4d0
         hci_power_off+0x1e/0x50
         process_one_work+0x184/0x3e0
         worker_thread+0x4a/0x3a0
         ? preempt_count_sub+0x9b/0x100
         ? preempt_count_sub+0x9b/0x100
         kthread+0x125/0x140
         ? process_one_work+0x3e0/0x3e0
         ? __kthread_create_on_node+0x1a0/0x1a0
         ? do_syscall_64+0x58/0xd0
         ret_from_fork+0x27/0x40
        Code: 00 75 bf 49 8b 56 18 48 8d 8b b0 00 00 00 48 81 c6 b0 00 00 00 4d 89 e0 48 c7 c7 20 23 6b 85 c6 05 83 cd 31 01 01 e8 bf c4 0c 00 <0f> ff eb 93 80 3d 74 cd 31 01 00 75 a5 65 48 8b 04 25 00 c5 00
        ---[ end trace b88fd2f77754bfec ]---
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      29e2dd0d
  5. 28 6月, 2017 1 次提交