1. 13 10月, 2015 1 次提交
  2. 04 10月, 2015 1 次提交
    • E
      tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab
      Eric Dumazet 提交于
      Before letting request sockets being put in TCP/DCCP regular
      ehash table, we need to add either :
      
      - SLAB_DESTROY_BY_RCU flag to their kmem_cache
      - add RCU grace period before freeing them.
      
      Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
      like ESTABLISH and TIMEWAIT sockets, use it here.
      
      req_prot_init() being only used by TCP and DCCP, I did not add
      a new slab_flags into their rsk_prot, but reuse prot->slab_flags
      
      Since all reqsk_alloc() users are correctly dealing with a failure,
      add the __GFP_NOWARN flag to avoid traces under pressure.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e96f78ab
  3. 16 9月, 2015 1 次提交
  4. 28 8月, 2015 1 次提交
  5. 31 7月, 2015 1 次提交
  6. 27 7月, 2015 1 次提交
  7. 01 7月, 2015 1 次提交
    • C
      sock_diag: don't broadcast kernel sockets · b922622e
      Craig Gallek 提交于
      Kernel sockets do not hold a reference for the network namespace to
      which they point.  Socket destruction broadcasting relies on the
      network namespace and will cause the splat below when a kernel socket
      is destroyed.
      
      This fix simply ignores kernel sockets when they are destroyed.
      
      Reported as:
      general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1
      Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
      Stack:
       ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90
       ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90
       ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8
      Call Trace:
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag]
       [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160
       [<ffffffff8408ea97>] process_one_work+0x147/0x420
       [<ffffffff8408f0f9>] worker_thread+0x69/0x470
       [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0
       [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320
       [<ffffffff84093cd7>] kthread+0x107/0x120
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
       [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
      
      Tested:
        Using a debug kernel while 'ss -E' is running:
        ip netns add test-ns
        ip netns delete test-ns
      
      Fixes: eb4cb008 sock_diag: define destruction multicast groups
      Fixes: 26abe143 net: Modify sk_alloc to not reference count the
        netns of kernel sockets.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b922622e
  8. 29 6月, 2015 1 次提交
  9. 16 6月, 2015 1 次提交
  10. 12 6月, 2015 1 次提交
    • S
      net: don't wait for order-3 page allocation · fb05e7a8
      Shaohua Li 提交于
      We saw excessive direct memory compaction triggered by skb_page_frag_refill.
      This causes performance issues and add latency. Commit 5640f768
      introduces the order-3 allocation. According to the changelog, the order-3
      allocation isn't a must-have but to improve performance. But direct memory
      compaction has high overhead. The benefit of order-3 allocation can't
      compensate the overhead of direct memory compaction.
      
      This patch makes the order-3 page allocation atomic. If there is no memory
      pressure and memory isn't fragmented, the alloction will still success, so we
      don't sacrifice the order-3 benefit here. If the atomic allocation fails,
      direct memory compaction will not be triggered, skb_page_frag_refill will
      fallback to order-0 immediately, hence the direct memory compaction overhead is
      avoided. In the allocation failure case, kswapd is waken up and doing
      compaction, so chances are allocation could success next time.
      
      alloc_skb_with_frags is the same.
      
      The mellanox driver does similar thing, if this is accepted, we must fix
      the driver too.
      
      V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
      V2: make the changelog clearer
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Debabrata Banerjee <dbavatar@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb05e7a8
  11. 11 6月, 2015 1 次提交
    • M
      net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap · 5d753610
      Mel Gorman 提交于
      Jeff Layton reported the following;
      
       [   74.232485] ------------[ cut here ]------------
       [   74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
       [   74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
       [   74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
       [   74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       [   74.245546]  0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
       [   74.246786]  0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
       [   74.248175]  0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
       [   74.249483] Call Trace:
       [   74.249872]  [<ffffffff8179263d>] dump_stack+0x45/0x57
       [   74.250703]  [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
       [   74.251655]  [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
       [   74.252585]  [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
       [   74.253519]  [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
       [   74.254537]  [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
       [   74.255610]  [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
       [   74.256582]  [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
       [   74.257496]  [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
       [   74.258318]  [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
       [   74.259253]  [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
       [   74.260158] ---[ end trace 2530722966429f10 ]---
      
      The warning in question was unnecessary but with Jeff's series the rules
      are also clearer.  This patch removes the warning and updates the comment
      to explain why sk_mem_reclaim() may still be called.
      
      [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
                points out that it's not needed.]
      
      Cc: Leon Romanovsky <leon@leon.nu>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d753610
  12. 27 5月, 2015 1 次提交
  13. 18 5月, 2015 1 次提交
  14. 11 5月, 2015 3 次提交
  15. 04 5月, 2015 1 次提交
  16. 13 4月, 2015 1 次提交
  17. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990
  18. 24 3月, 2015 1 次提交
  19. 21 3月, 2015 1 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
  20. 17 3月, 2015 2 次提交
  21. 13 3月, 2015 3 次提交
  22. 12 3月, 2015 1 次提交
    • E
      net: add real socket cookies · 33cf7c90
      Eric Dumazet 提交于
      A long standing problem in netlink socket dumps is the use
      of kernel socket addresses as cookies.
      
      1) It is a security concern.
      
      2) Sockets can be reused quite quickly, so there is
         no guarantee a cookie is used once and identify
         a flow.
      
      3) request sock, establish sock, and timewait socks
         for a given flow have different cookies.
      
      Part of our effort to bring better TCP statistics requires
      to switch to a different allocator.
      
      In this patch, I chose to use a per network namespace 64bit generator,
      and to use it only in the case a socket needs to be dumped to netlink.
      (This might be refined later if needed)
      
      Note that I tried to carry cookies from request sock, to establish sock,
      then timewait sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33cf7c90
  23. 11 3月, 2015 1 次提交
  24. 03 3月, 2015 1 次提交
  25. 02 3月, 2015 1 次提交
  26. 03 2月, 2015 1 次提交
    • W
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn 提交于
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b245be1f
  27. 06 12月, 2014 1 次提交
    • A
      net: sock: allow eBPF programs to be attached to sockets · 89aa0758
      Alexei Starovoitov 提交于
      introduce new setsockopt() command:
      
      setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
      
      where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
      and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
      
      setsockopt() calls bpf_prog_get() which increments refcnt of the program,
      so it doesn't get unloaded while socket is using the program.
      
      The same eBPF program can be attached to multiple sockets.
      
      User task exit automatically closes socket which calls sk_filter_uncharge()
      which decrements refcnt of eBPF program
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89aa0758
  28. 25 11月, 2014 1 次提交
    • D
      crypto: algif - add and use sock_kzfree_s() instead of memzero_explicit() · 79e88659
      Daniel Borkmann 提交于
      Commit e1bd95bf ("crypto: algif - zeroize IV buffer") and
      2a6af25b ("crypto: algif - zeroize message digest buffer")
      added memzero_explicit() calls on buffers that are later on
      passed back to sock_kfree_s().
      
      This is a discussed follow-up that, instead, extends the sock
      API and adds sock_kzfree_s(), which internally uses kzfree()
      instead of kfree() for passing the buffers back to slab.
      
      Having sock_kzfree_s() allows to keep the changes more minimal
      by just having a drop-in replacement instead of adding
      memzero_explicit() calls everywhere before sock_kfree_s().
      
      In kzfree(), the compiler is not allowed to optimize the memset()
      away and thus there's no need for memzero_explicit(). Both,
      sock_kfree_s() and sock_kzfree_s() are wrappers for
      __sock_kfree_s() and call into kfree() resp. kzfree(); here,
      __sock_kfree_s() needs to be explicitly inlined as we want the
      compiler to optimize the call and condition away and thus it
      produces e.g. on x86_64 the _same_ assembler output for
      sock_kfree_s() before and after, and thus also allows for
      avoiding code duplication.
      
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      79e88659
  29. 12 11月, 2014 1 次提交
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
  30. 06 11月, 2014 1 次提交
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
  31. 15 10月, 2014 1 次提交
  32. 28 9月, 2014 1 次提交
  33. 20 9月, 2014 1 次提交
  34. 09 9月, 2014 1 次提交
  35. 06 9月, 2014 1 次提交