1. 14 4月, 2016 2 次提交
    • D
      net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put · f9a7cbbf
      Denys Vlasenko 提交于
      Sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined. See
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Arguably, gcc should do better, but gcc people aren't willing
      to invest time into it, asking to use __always_inline instead.
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
      the following functions get deinlined many times.
      
      netif_tx_stop_queue: 207 copies, 590 calls:
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      netif_tx_start_queue: 47 copies, 111 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      sock_hold: 39 copies, 124 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 87 80 00 00 00    lock incl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      __sock_put: 6 copies, 13 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 8f 80 00 00 00    lock decl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      
      Code size decrease after the patch is ~2.5k:
      
          text      data      bss       dec     hex filename
      56719876  56364551 36196352 149280779 8e5d80b vmlinux_before
      56717440  56364551 36196352 149278343 8e5ce87 vmlinux
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9a7cbbf
    • H
      sock: tigthen lockdep checks for sock_owned_by_user · fafc4e1e
      Hannes Frederic Sowa 提交于
      sock_owned_by_user should not be used without socket lock held. It seems
      to be a common practice to check .owned before lock reclassification, so
      provide a little help to abstract this check away.
      
      Cc: linux-cifs@vger.kernel.org
      Cc: linux-bluetooth@vger.kernel.org
      Cc: linux-nfs@vger.kernel.org
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fafc4e1e
  2. 08 4月, 2016 4 次提交
  3. 06 4月, 2016 3 次提交
  4. 05 4月, 2016 6 次提交
    • E
      tcp: increment sk_drops for dropped rx packets · 532182cd
      Eric Dumazet 提交于
      Now ss can report sk_drops, we can instruct TCP to increment
      this per socket counter when it drops an incoming frame, to refine
      monitoring and debugging.
      
      Following patch takes care of listeners drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      532182cd
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • E
      net: add SOCK_RCU_FREE socket flag · a4298e45
      Eric Dumazet 提交于
      We want a generic way to insert an RCU grace period before socket
      freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
      much overhead.
      
      SLAB_DESTROY_BY_RCU strict rules force us to take a reference
      on the socket sk_refcnt, and it is a performance problem for UDP
      encapsulation, or TCP synflood behavior, as many CPUs might
      attempt the atomic operations on a shared sk_refcnt
      
      UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
      lookup can use traditional RCU rules, without refcount changes.
      They can set the flag only once hashed and visible by other cpus.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4298e45
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
    • S
      sock: accept SO_TIMESTAMPING flags in socket cmsg · 3dd17e63
      Soheil Hassas Yeganeh 提交于
      Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
      as a basis to accept timestamping requests per write.
      
      This implementation only accepts TX recording flags (i.e.,
      SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
      SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
      control messages. Users need to set reporting flags (e.g.,
      SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
      
      This commit adds a tsflags field in sockcm_cookie which is
      set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
      bits in sockcm_cookie.tsflags allowing the control message
      to override the recording behavior per write, yet maintaining
      the value of other flags.
      
      This patch implements validating the control message and setting
      tsflags in struct sockcm_cookie. Next commits in this series will
      actually implement timestamping per write for different protocols.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dd17e63
    • W
      sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop · 39771b12
      Willem de Bruijn 提交于
      To process cmsg's of the SOL_SOCKET level in addition to
      cmsgs of another level, protocols can call sock_cmsg_send().
      This causes a double walk on the cmsghdr list, one for SOL_SOCKET
      and one for the other level.
      
      Extract the inner demultiplex logic from the loop that walks the list,
      to allow having this called directly from a walker in the protocol
      specific code.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39771b12
  5. 11 2月, 2016 1 次提交
  6. 22 1月, 2016 1 次提交
  7. 15 1月, 2016 7 次提交
  8. 05 1月, 2016 1 次提交
    • C
      soreuseport: define reuseport groups · ef456144
      Craig Gallek 提交于
      struct sock_reuseport is an optional shared structure referenced by each
      socket belonging to a reuseport group.  When a socket is bound to an
      address/port not yet in use and the reuseport flag has been set, the
      structure will be allocated and attached to the newly bound socket.
      When subsequent calls to bind are made for the same address/port, the
      shared structure will be updated to include the new socket and the
      newly bound socket will reference the group structure.
      
      Usually, when an incoming packet was destined for a reuseport group,
      all sockets in the same group needed to be considered before a
      dispatching decision was made.  With this structure, an appropriate
      socket can be found after looking up just one socket in the group.
      
      This shared structure will also allow for more complicated decisions to
      be made when selecting a socket (eg a BPF filter).
      
      This work is based off a similar implementation written by
      Ying Cai <ycai@google.com> for implementing policy-based reuseport
      selection.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef456144
  9. 17 12月, 2015 1 次提交
  10. 16 12月, 2015 2 次提交
  11. 15 12月, 2015 2 次提交
    • E
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet 提交于
      David Wilder reported crashes caused by dst reuse.
      
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      </quote>
      
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      sockets.
      
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      
      Can probably be backported back to linux-3.6 kernels
      Reported-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5037e9ef
    • H
      net: add validation for the socket syscall protocol argument · 79462ad0
      Hannes Frederic Sowa 提交于
      郭永刚 reported that one could simply crash the kernel as root by
      using a simple program:
      
      	int socket_fd;
      	struct sockaddr_in addr;
      	addr.sin_port = 0;
      	addr.sin_addr.s_addr = INADDR_ANY;
      	addr.sin_family = 10;
      
      	socket_fd = socket(10,3,0x40000000);
      	connect(socket_fd , &addr,16);
      
      AF_INET, AF_INET6 sockets actually only support 8-bit protocol
      identifiers. inet_sock's skc_protocol field thus is sized accordingly,
      thus larger protocol identifiers simply cut off the higher bits and
      store a zero in the protocol fields.
      
      This could lead to e.g. NULL function pointer because as a result of
      the cut off inet_num is zero and we call down to inet_autobind, which
      is NULL for raw sockets.
      
      kernel: Call Trace:
      kernel:  [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
      kernel:  [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
      kernel:  [<ffffffff81645069>] SYSC_connect+0xd9/0x110
      kernel:  [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
      kernel:  [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200
      kernel:  [<ffffffff81645e0e>] SyS_connect+0xe/0x10
      kernel:  [<ffffffff81779515>] tracesys_phase2+0x84/0x89
      
      I found no particular commit which introduced this problem.
      
      CVE: CVE-2015-8543
      Cc: Cong Wang <cwang@twopensource.com>
      Reported-by: N郭永刚 <guoyonggang@360.cn>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79462ad0
  12. 12 12月, 2015 1 次提交
  13. 09 12月, 2015 2 次提交
    • T
      net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct · 2a56a1fe
      Tejun Heo 提交于
      Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
      ->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
      and its accessors are defined in cgroup-defs.h.  This is to prepare
      for overloading the fields with a cgroup pointer.
      
      This patch mostly performs equivalent conversions but the followings
      are noteworthy.
      
      * Equality test before updating classid is removed from
        sock_update_classid().  This shouldn't make any noticeable
        difference and a similar test will be implemented on the helper side
        later.
      
      * sock_update_netprioidx() now takes struct sock_cgroup_data and can
        be moved to netprio_cgroup.h without causing include dependency
        loop.  Moved.
      
      * The dummy version of sock_update_netprioidx() converted to a static
        inline function while at it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a56a1fe
    • T
      netprio_cgroup: limit the maximum css->id to USHRT_MAX · 297dbde1
      Tejun Heo 提交于
      netprio builds per-netdev contiguous priomap array which is indexed by
      css->id.  The array is allocated using kzalloc() effectively limiting
      the maximum ID supported to some thousand range.  This patch caps the
      maximum supported css->id to USHRT_MAX which should be way above what
      is actually useable.
      
      This allows reducing sock->sk_cgrp_prioidx to u16 from u32.  The freed
      up part will be used to overload the cgroup related fields.
      sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
      two cgroup related fields are adjacent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      297dbde1
  14. 06 12月, 2015 1 次提交
  15. 04 12月, 2015 1 次提交
    • E
      ipv6: kill sk_dst_lock · 6bd4f355
      Eric Dumazet 提交于
      While testing the np->opt RCU conversion, I found that UDP/IPv6 was
      using a mixture of xchg() and sk_dst_lock to protect concurrent changes
      to sk->sk_dst_cache, leading to possible corruptions and crashes.
      
      ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
      way to fix the mess is to remove sk_dst_lock completely, as we did for
      IPv4.
      
      __ip6_dst_store() and ip6_dst_store() share same implementation.
      
      sk_setup_caps() being called with socket lock being held or not,
      we have to use sk_dst_set() instead of __sk_dst_set()
      
      Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
      in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
      
      This is because ip6_dst_store() can be called from process context,
      without any lock held.
      
      As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
      from another cpu doing a concurrent ip6_dst_store()
      
      Doing the dst dereference before doing the install is needed to make
      sure no use after free would trigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd4f355
  16. 03 12月, 2015 1 次提交
    • E
      tcp: suppress too verbose messages in tcp_send_ack() · 7450aaf6
      Eric Dumazet 提交于
      If tcp_send_ack() can not allocate skb, we properly handle this
      and setup a timer to try later.
      
      Use __GFP_NOWARN to avoid polluting syslog in the case host is
      under memory pressure, so that pertinent messages are not lost under
      a flood of useless information.
      
      sk_gfp_atomic() can use its gfp_mask argument (all callers currently
      were using GFP_ATOMIC before this patch)
      
      We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this
      function now takes into account its second argument (gfp_mask)
      
      Note that when tcp_transmit_skb() is called with clone_it set to false,
      we do not attempt memory allocations, so can pass a 0 gfp_mask, which
      most compilers can emit faster than a non zero or constant value.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7450aaf6
  17. 02 12月, 2015 2 次提交
    • E
      net: fix sock_wake_async() rcu protection · ceb5d58b
      Eric Dumazet 提交于
      Dmitry provided a syzkaller (http://github.com/google/syzkaller)
      triggering a fault in sock_wake_async() when async IO is requested.
      
      Said program stressed af_unix sockets, but the issue is generic
      and should be addressed in core networking stack.
      
      The problem is that by the time sock_wake_async() is called,
      we should not access the @flags field of 'struct socket',
      as the inode containing this socket might be freed without
      further notice, and without RCU grace period.
      
      We already maintain an RCU protected structure, "struct socket_wq"
      so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
      is the safe route.
      
      It also reduces number of cache lines needing dirtying, so might
      provide a performance improvement anyway.
      
      In followup patches, we might move remaining flags (SOCK_NOSPACE,
      SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
      being mostly read and let it being shared between cpus.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceb5d58b
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  18. 01 12月, 2015 1 次提交
  19. 16 11月, 2015 1 次提交