1. 12 6月, 2015 1 次提交
    • S
      net: don't wait for order-3 page allocation · fb05e7a8
      Shaohua Li 提交于
      We saw excessive direct memory compaction triggered by skb_page_frag_refill.
      This causes performance issues and add latency. Commit 5640f768
      introduces the order-3 allocation. According to the changelog, the order-3
      allocation isn't a must-have but to improve performance. But direct memory
      compaction has high overhead. The benefit of order-3 allocation can't
      compensate the overhead of direct memory compaction.
      
      This patch makes the order-3 page allocation atomic. If there is no memory
      pressure and memory isn't fragmented, the alloction will still success, so we
      don't sacrifice the order-3 benefit here. If the atomic allocation fails,
      direct memory compaction will not be triggered, skb_page_frag_refill will
      fallback to order-0 immediately, hence the direct memory compaction overhead is
      avoided. In the allocation failure case, kswapd is waken up and doing
      compaction, so chances are allocation could success next time.
      
      alloc_skb_with_frags is the same.
      
      The mellanox driver does similar thing, if this is accepted, we must fix
      the driver too.
      
      V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
      V2: make the changelog clearer
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Debabrata Banerjee <dbavatar@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb05e7a8
  2. 11 6月, 2015 1 次提交
    • M
      net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap · 5d753610
      Mel Gorman 提交于
      Jeff Layton reported the following;
      
       [   74.232485] ------------[ cut here ]------------
       [   74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
       [   74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
       [   74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
       [   74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       [   74.245546]  0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
       [   74.246786]  0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
       [   74.248175]  0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
       [   74.249483] Call Trace:
       [   74.249872]  [<ffffffff8179263d>] dump_stack+0x45/0x57
       [   74.250703]  [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
       [   74.251655]  [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
       [   74.252585]  [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
       [   74.253519]  [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
       [   74.254537]  [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
       [   74.255610]  [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
       [   74.256582]  [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
       [   74.257496]  [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
       [   74.258318]  [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
       [   74.259253]  [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
       [   74.260158] ---[ end trace 2530722966429f10 ]---
      
      The warning in question was unnecessary but with Jeff's series the rules
      are also clearer.  This patch removes the warning and updates the comment
      to explain why sk_mem_reclaim() may still be called.
      
      [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
                points out that it's not needed.]
      
      Cc: Leon Romanovsky <leon@leon.nu>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d753610
  3. 27 5月, 2015 1 次提交
  4. 18 5月, 2015 1 次提交
  5. 11 5月, 2015 3 次提交
  6. 04 5月, 2015 1 次提交
  7. 13 4月, 2015 1 次提交
  8. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990
  9. 24 3月, 2015 1 次提交
  10. 21 3月, 2015 1 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
  11. 17 3月, 2015 2 次提交
  12. 13 3月, 2015 3 次提交
  13. 12 3月, 2015 1 次提交
    • E
      net: add real socket cookies · 33cf7c90
      Eric Dumazet 提交于
      A long standing problem in netlink socket dumps is the use
      of kernel socket addresses as cookies.
      
      1) It is a security concern.
      
      2) Sockets can be reused quite quickly, so there is
         no guarantee a cookie is used once and identify
         a flow.
      
      3) request sock, establish sock, and timewait socks
         for a given flow have different cookies.
      
      Part of our effort to bring better TCP statistics requires
      to switch to a different allocator.
      
      In this patch, I chose to use a per network namespace 64bit generator,
      and to use it only in the case a socket needs to be dumped to netlink.
      (This might be refined later if needed)
      
      Note that I tried to carry cookies from request sock, to establish sock,
      then timewait sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33cf7c90
  14. 11 3月, 2015 1 次提交
  15. 03 3月, 2015 1 次提交
  16. 02 3月, 2015 1 次提交
  17. 03 2月, 2015 1 次提交
    • W
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn 提交于
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b245be1f
  18. 06 12月, 2014 1 次提交
    • A
      net: sock: allow eBPF programs to be attached to sockets · 89aa0758
      Alexei Starovoitov 提交于
      introduce new setsockopt() command:
      
      setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
      
      where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
      and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
      
      setsockopt() calls bpf_prog_get() which increments refcnt of the program,
      so it doesn't get unloaded while socket is using the program.
      
      The same eBPF program can be attached to multiple sockets.
      
      User task exit automatically closes socket which calls sk_filter_uncharge()
      which decrements refcnt of eBPF program
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89aa0758
  19. 25 11月, 2014 1 次提交
    • D
      crypto: algif - add and use sock_kzfree_s() instead of memzero_explicit() · 79e88659
      Daniel Borkmann 提交于
      Commit e1bd95bf ("crypto: algif - zeroize IV buffer") and
      2a6af25b ("crypto: algif - zeroize message digest buffer")
      added memzero_explicit() calls on buffers that are later on
      passed back to sock_kfree_s().
      
      This is a discussed follow-up that, instead, extends the sock
      API and adds sock_kzfree_s(), which internally uses kzfree()
      instead of kfree() for passing the buffers back to slab.
      
      Having sock_kzfree_s() allows to keep the changes more minimal
      by just having a drop-in replacement instead of adding
      memzero_explicit() calls everywhere before sock_kfree_s().
      
      In kzfree(), the compiler is not allowed to optimize the memset()
      away and thus there's no need for memzero_explicit(). Both,
      sock_kfree_s() and sock_kzfree_s() are wrappers for
      __sock_kfree_s() and call into kfree() resp. kzfree(); here,
      __sock_kfree_s() needs to be explicitly inlined as we want the
      compiler to optimize the call and condition away and thus it
      produces e.g. on x86_64 the _same_ assembler output for
      sock_kfree_s() before and after, and thus also allows for
      avoiding code duplication.
      
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      79e88659
  20. 12 11月, 2014 1 次提交
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
  21. 06 11月, 2014 1 次提交
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
  22. 15 10月, 2014 1 次提交
  23. 28 9月, 2014 1 次提交
  24. 20 9月, 2014 1 次提交
  25. 09 9月, 2014 1 次提交
  26. 06 9月, 2014 3 次提交
  27. 02 9月, 2014 1 次提交
    • W
      sock: deduplicate errqueue dequeue · 364a9e93
      Willem de Bruijn 提交于
      sk->sk_error_queue is dequeued in four locations. All share the
      exact same logic. Deduplicate.
      
      Also collapse the two critical sections for dequeue (at the top of
      the recv handler) and signal (at the bottom).
      
      This moves signal generation for the next packet forward, which should
      be harmless.
      
      It also changes the behavior if the recv handler exits early with an
      error. Previously, a signal for follow-up packets on the errqueue
      would then not be scheduled. The new behavior, to always signal, is
      arguably a bug fix.
      
      For rxrpc, the change causes the same function to be called repeatedly
      for each queued packet (because the recv handler == sk_error_report).
      It is likely that all packets will fail for the same reason (e.g.,
      memory exhaustion).
      
      This code runs without sk_lock held, so it is not safe to trust that
      sk->sk_err is immutable inbetween releasing q->lock and the subsequent
      test. Introduce int err just to avoid this potential race.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      364a9e93
  28. 30 8月, 2014 1 次提交
    • E
      net: attempt a single high order allocation · d9b2938a
      Eric Dumazet 提交于
      In commit ed98df33 ("net: use __GFP_NORETRY for high order
      allocations") we tried to address one issue caused by order-3
      allocations.
      
      We still observe high latencies and system overhead in situations where
      compaction is not successful.
      
      Instead of trying order-3, order-2, and order-1, do a single order-3
      best effort and immediately fallback to plain order-0.
      
      This mimics slub strategy to fallback to slab min order if the high
      order allocation used for performance failed.
      
      Order-3 allocations give a performance boost only if they can be done
      without recurring and expensive memory scan.
      
      Quoting David :
      
      The page allocator relies on synchronous (sync light) memory compaction
      after direct reclaim for allocations that don't retry and deferred
      compaction doesn't work with this strategy because the allocation order
      is always decreasing from the previous failed attempt.
      
      This means sync light compaction will always be encountered if memory
      cannot be defragmented or reclaimed several times during the
      skb_page_frag_refill() iteration.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9b2938a
  29. 23 8月, 2014 1 次提交
  30. 06 8月, 2014 3 次提交
    • W
      net-timestamp: TCP timestamping · 4ed2d765
      Willem de Bruijn 提交于
      TCP timestamping extends SO_TIMESTAMPING to bytestreams.
      
      Bytestreams do not have a 1:1 relationship between send() buffers and
      network packets. The feature interprets a send call on a bytestream as
      a request for a timestamp for the last byte in that send() buffer.
      
      The choice corresponds to a request for a timestamp when all bytes in
      the buffer have been sent. That assumption depends on in-order kernel
      transmission. This is the common case. That said, it is possible to
      construct a traffic shaping tree that would result in reordering.
      The guarantee is strong, then, but not ironclad.
      
      This implementation supports send and sendpages (splice). GSO replaces
      one large packet with multiple smaller packets. This patch also copies
      the option into the correct smaller packet.
      
      This patch does not yet support timestamping on data in an initial TCP
      Fast Open SYN, because that takes a very different data path.
      
      If ID generation in ee_data is enabled, bytestream timestamps return a
      byte offset, instead of the packet counter for datagrams.
      
      The implementation supports a single timestamp per packet. It silenty
      replaces requests for previous timestamps. To avoid missing tstamps,
      flush the tcp queue by disabling Nagle, cork and autocork. Missing
      tstamps can be detected by offset when the ee_data ID is enabled.
      
      Implementation details:
      
      - On GSO, the timestamping code can be included in the main loop. I
      moved it into its own loop to reduce the impact on the common case
      to a single branch.
      
      - To avoid leaking the absolute seqno to userspace, the offset
      returned in ee_data must always be relative. It is an offset between
      an skb and sk field. The first is always set (also for GSO & ACK).
      The second must also never be uninitialized. Only allow the ID
      option on sockets in the ESTABLISHED state, for which the seqno
      is available. Never reset it to zero (instead, move it to the
      current seqno when reenabling the option).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed2d765
    • W
      net-timestamp: add key to disambiguate concurrent datagrams · 09c2d251
      Willem de Bruijn 提交于
      Datagrams timestamped on transmission can coexist in the kernel stack
      and be reordered in packet scheduling. When reading looped datagrams
      from the socket error queue it is not always possible to unique
      correlate looped data with original send() call (for application
      level retransmits). Even if possible, it may be expensive and complex,
      requiring packet inspection.
      
      Introduce a data-independent ID mechanism to associate timestamps with
      send calls. Pass an ID alongside the timestamp in field ee_data of
      sock_extended_err.
      
      The ID is a simple 32 bit unsigned int that is associated with the
      socket and incremented on each send() call for which software tx
      timestamp generation is enabled.
      
      The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
      avoid changing ee_data for existing applications that expect it 0.
      The counter is reset each time the flag is reenabled. Reenabling
      does not change the ID of already submitted data. It is possible
      to receive out of order IDs if the timestamp stream is not quiesced
      first.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c2d251
    • W
      net-timestamp: move timestamp flags out of sk_flags · b9f40e21
      Willem de Bruijn 提交于
      sk_flags is reaching its limit. New timestamping options will not fit.
      Move all of them into a new field sk->sk_tsflags.
      
      Added benefit is that this removes boilerplate code to convert between
      SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.
      
      SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
      timestamp logic (netstamp_needed). That can be simplified and this
      last key removed, but will leave that for a separate patch.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
      though that scatters tstamp fields throughout the struct.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f40e21
  31. 03 8月, 2014 1 次提交