1. 14 7月, 2016 1 次提交
    • W
      dccp: limit sk_filter trim to payload · 4f0c40d9
      Willem de Bruijn 提交于
      Dccp verifies packet integrity, including length, at initial rcv in
      dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.
      
      A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
      skb_copy_datagram_msg interprets this as a negative value, so
      (correctly) fails with EFAULT. The negative length is reported in
      ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.
      
      Introduce an sk_receive_skb variant that caps how small a filter
      program can trim packets, and call this in dccp with the header
      length. Excessively trimmed packets are now processed normally and
      queued for reception as 0B payloads.
      
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f0c40d9
  2. 12 7月, 2016 1 次提交
  3. 04 5月, 2016 1 次提交
    • E
      net: add __sock_wfree() helper · 1d2077ac
      Eric Dumazet 提交于
      Hosts sending lot of ACK packets exhibit high sock_wfree() cost
      because of cache line miss to test SOCK_USE_WRITE_QUEUE
      
      We could move this flag close to sk_wmem_alloc but it is better
      to perform the atomic_sub_and_test() on a clean cache line,
      as it avoid one extra bus transaction.
      
      skb_orphan_partial() can also have a fast track for packets that either
      are TCP acks, or already went through another skb_orphan_partial()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d2077ac
  4. 03 5月, 2016 2 次提交
    • E
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet 提交于
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d41a69f1
    • E
      net: do not block BH while processing socket backlog · 5413d1ba
      Eric Dumazet 提交于
      Socket backlog processing is a major latency source.
      
      With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
      holding cpu for more than 5 ms, and packets being dropped by the NIC
      once ring buffer is filled.
      
      All users are now ready to be called from process context,
      we can unblock BH and let interrupts be serviced faster.
      
      cond_resched_softirq() could be removed, as it has no more user.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5413d1ba
  5. 08 4月, 2016 1 次提交
  6. 07 4月, 2016 1 次提交
  7. 06 4月, 2016 2 次提交
  8. 05 4月, 2016 5 次提交
  9. 18 3月, 2016 1 次提交
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
  10. 26 2月, 2016 1 次提交
    • T
      net: Facility to report route quality of connected sockets · a87cb3e4
      Tom Herbert 提交于
      This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
      purpose is to allow an application to give feedback to the kernel about
      the quality of the network path for a connected socket. The value
      argument indicates the type of quality report. For this initial patch
      the only supported advice is a value of 1 which indicates "bad path,
      please reroute"-- the action taken by the kernel is to call
      dst_negative_advice which will attempt to choose a different ECMP route,
      reset the TX hash for flow label and UDP source port in encapsulation,
      etc.
      
      This facility should be useful for connected UDP sockets where only the
      application can provide any feedback about path quality. It could also
      be useful for TCP applications that have additional knowledge about the
      path outside of the normal TCP control loop.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a87cb3e4
  11. 11 2月, 2016 1 次提交
    • C
      soreuseport: Prep for fast reuseport TCP socket selection · fa463497
      Craig Gallek 提交于
      Both of the lines in this patch probably should have been included
      in the initial implementation of this code for generic socket
      support, but weren't technically necessary since only UDP sockets
      were supported.
      
      First, the sk_reuseport_cb points to a structure which assumes
      each socket in the group has this pointer assigned at the same
      time it's added to the array in the structure.  The sk_clone_lock
      function breaks this assumption.  Since a child socket shouldn't
      implicitly be in a reuseport group, the simple fix is to clear
      the field in the clone.
      
      Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
      SO_REUSEPORT also be set first.  For UDP sockets, this is easily
      enforced at bind-time since that process both puts the socket in
      the appropriate receive hlist and updates the reuseport structures.
      Since these operations can happen at two different times for TCP
      sockets (bind and listen) it must be explicitly checked to enforce
      the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
      setsockopt call.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa463497
  12. 15 1月, 2016 4 次提交
  13. 05 1月, 2016 1 次提交
  14. 18 12月, 2015 1 次提交
    • W
      net: check both type and procotol for tcp sockets · ac5cc977
      WANG Cong 提交于
      Dmitry reported the following out-of-bound access:
      
      Call Trace:
       [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
      mm/kasan/report.c:294
       [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
       [<     inline     >] SYSC_setsockopt net/socket.c:1746
       [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
       [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
      arch/x86/entry/entry_64.S:185
      
      This is because we mistake a raw socket as a tcp socket.
      We should check both sk->sk_type and sk->sk_protocol to ensure
      it is a tcp socket.
      
      Willem points out __skb_complete_tx_timestamp() needs to fix as well.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac5cc977
  15. 12 12月, 2015 1 次提交
  16. 09 12月, 2015 2 次提交
    • T
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo 提交于
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1060a1
    • T
      net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct · 2a56a1fe
      Tejun Heo 提交于
      Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
      ->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
      and its accessors are defined in cgroup-defs.h.  This is to prepare
      for overloading the fields with a cgroup pointer.
      
      This patch mostly performs equivalent conversions but the followings
      are noteworthy.
      
      * Equality test before updating classid is removed from
        sock_update_classid().  This shouldn't make any noticeable
        difference and a similar test will be implemented on the helper side
        later.
      
      * sock_update_netprioidx() now takes struct sock_cgroup_data and can
        be moved to netprio_cgroup.h without causing include dependency
        loop.  Moved.
      
      * The dummy version of sock_update_netprioidx() converted to a static
        inline function while at it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a56a1fe
  17. 06 12月, 2015 1 次提交
  18. 04 12月, 2015 1 次提交
    • E
      ipv6: kill sk_dst_lock · 6bd4f355
      Eric Dumazet 提交于
      While testing the np->opt RCU conversion, I found that UDP/IPv6 was
      using a mixture of xchg() and sk_dst_lock to protect concurrent changes
      to sk->sk_dst_cache, leading to possible corruptions and crashes.
      
      ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
      way to fix the mess is to remove sk_dst_lock completely, as we did for
      IPv4.
      
      __ip6_dst_store() and ip6_dst_store() share same implementation.
      
      sk_setup_caps() being called with socket lock being held or not,
      we have to use sk_dst_set() instead of __sk_dst_set()
      
      Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
      in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
      
      This is because ip6_dst_store() can be called from process context,
      without any lock held.
      
      As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
      from another cpu doing a concurrent ip6_dst_store()
      
      Doing the dst dereference before doing the install is needed to make
      sure no use after free would trigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd4f355
  19. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  20. 01 12月, 2015 1 次提交
  21. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  22. 03 11月, 2015 1 次提交
  23. 28 10月, 2015 1 次提交
  24. 13 10月, 2015 2 次提交
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
  25. 04 10月, 2015 1 次提交
    • E
      tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab
      Eric Dumazet 提交于
      Before letting request sockets being put in TCP/DCCP regular
      ehash table, we need to add either :
      
      - SLAB_DESTROY_BY_RCU flag to their kmem_cache
      - add RCU grace period before freeing them.
      
      Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
      like ESTABLISH and TIMEWAIT sockets, use it here.
      
      req_prot_init() being only used by TCP and DCCP, I did not add
      a new slab_flags into their rsk_prot, but reuse prot->slab_flags
      
      Since all reqsk_alloc() users are correctly dealing with a failure,
      add the __GFP_NOWARN flag to avoid traces under pressure.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e96f78ab
  26. 16 9月, 2015 1 次提交
  27. 28 8月, 2015 1 次提交
  28. 31 7月, 2015 1 次提交
  29. 27 7月, 2015 1 次提交