1. 05 4月, 2016 1 次提交
  2. 18 3月, 2016 1 次提交
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
  3. 26 2月, 2016 1 次提交
    • T
      net: Facility to report route quality of connected sockets · a87cb3e4
      Tom Herbert 提交于
      This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
      purpose is to allow an application to give feedback to the kernel about
      the quality of the network path for a connected socket. The value
      argument indicates the type of quality report. For this initial patch
      the only supported advice is a value of 1 which indicates "bad path,
      please reroute"-- the action taken by the kernel is to call
      dst_negative_advice which will attempt to choose a different ECMP route,
      reset the TX hash for flow label and UDP source port in encapsulation,
      etc.
      
      This facility should be useful for connected UDP sockets where only the
      application can provide any feedback about path quality. It could also
      be useful for TCP applications that have additional knowledge about the
      path outside of the normal TCP control loop.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a87cb3e4
  4. 11 2月, 2016 1 次提交
    • C
      soreuseport: Prep for fast reuseport TCP socket selection · fa463497
      Craig Gallek 提交于
      Both of the lines in this patch probably should have been included
      in the initial implementation of this code for generic socket
      support, but weren't technically necessary since only UDP sockets
      were supported.
      
      First, the sk_reuseport_cb points to a structure which assumes
      each socket in the group has this pointer assigned at the same
      time it's added to the array in the structure.  The sk_clone_lock
      function breaks this assumption.  Since a child socket shouldn't
      implicitly be in a reuseport group, the simple fix is to clear
      the field in the clone.
      
      Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
      SO_REUSEPORT also be set first.  For UDP sockets, this is easily
      enforced at bind-time since that process both puts the socket in
      the appropriate receive hlist and updates the reuseport structures.
      Since these operations can happen at two different times for TCP
      sockets (bind and listen) it must be explicitly checked to enforce
      the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
      setsockopt call.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa463497
  5. 15 1月, 2016 4 次提交
  6. 05 1月, 2016 1 次提交
  7. 18 12月, 2015 1 次提交
    • W
      net: check both type and procotol for tcp sockets · ac5cc977
      WANG Cong 提交于
      Dmitry reported the following out-of-bound access:
      
      Call Trace:
       [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
      mm/kasan/report.c:294
       [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
       [<     inline     >] SYSC_setsockopt net/socket.c:1746
       [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
       [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
      arch/x86/entry/entry_64.S:185
      
      This is because we mistake a raw socket as a tcp socket.
      We should check both sk->sk_type and sk->sk_protocol to ensure
      it is a tcp socket.
      
      Willem points out __skb_complete_tx_timestamp() needs to fix as well.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac5cc977
  8. 12 12月, 2015 1 次提交
  9. 09 12月, 2015 2 次提交
    • T
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo 提交于
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1060a1
    • T
      net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct · 2a56a1fe
      Tejun Heo 提交于
      Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
      ->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
      and its accessors are defined in cgroup-defs.h.  This is to prepare
      for overloading the fields with a cgroup pointer.
      
      This patch mostly performs equivalent conversions but the followings
      are noteworthy.
      
      * Equality test before updating classid is removed from
        sock_update_classid().  This shouldn't make any noticeable
        difference and a similar test will be implemented on the helper side
        later.
      
      * sock_update_netprioidx() now takes struct sock_cgroup_data and can
        be moved to netprio_cgroup.h without causing include dependency
        loop.  Moved.
      
      * The dummy version of sock_update_netprioidx() converted to a static
        inline function while at it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a56a1fe
  10. 06 12月, 2015 1 次提交
  11. 04 12月, 2015 1 次提交
    • E
      ipv6: kill sk_dst_lock · 6bd4f355
      Eric Dumazet 提交于
      While testing the np->opt RCU conversion, I found that UDP/IPv6 was
      using a mixture of xchg() and sk_dst_lock to protect concurrent changes
      to sk->sk_dst_cache, leading to possible corruptions and crashes.
      
      ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
      way to fix the mess is to remove sk_dst_lock completely, as we did for
      IPv4.
      
      __ip6_dst_store() and ip6_dst_store() share same implementation.
      
      sk_setup_caps() being called with socket lock being held or not,
      we have to use sk_dst_set() instead of __sk_dst_set()
      
      Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
      in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
      
      This is because ip6_dst_store() can be called from process context,
      without any lock held.
      
      As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
      from another cpu doing a concurrent ip6_dst_store()
      
      Doing the dst dereference before doing the install is needed to make
      sure no use after free would trigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bd4f355
  12. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  13. 01 12月, 2015 1 次提交
  14. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  15. 03 11月, 2015 1 次提交
  16. 28 10月, 2015 1 次提交
  17. 13 10月, 2015 2 次提交
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
  18. 04 10月, 2015 1 次提交
    • E
      tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab
      Eric Dumazet 提交于
      Before letting request sockets being put in TCP/DCCP regular
      ehash table, we need to add either :
      
      - SLAB_DESTROY_BY_RCU flag to their kmem_cache
      - add RCU grace period before freeing them.
      
      Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
      like ESTABLISH and TIMEWAIT sockets, use it here.
      
      req_prot_init() being only used by TCP and DCCP, I did not add
      a new slab_flags into their rsk_prot, but reuse prot->slab_flags
      
      Since all reqsk_alloc() users are correctly dealing with a failure,
      add the __GFP_NOWARN flag to avoid traces under pressure.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e96f78ab
  19. 16 9月, 2015 1 次提交
  20. 28 8月, 2015 1 次提交
  21. 31 7月, 2015 1 次提交
  22. 27 7月, 2015 1 次提交
  23. 01 7月, 2015 1 次提交
    • C
      sock_diag: don't broadcast kernel sockets · b922622e
      Craig Gallek 提交于
      Kernel sockets do not hold a reference for the network namespace to
      which they point.  Socket destruction broadcasting relies on the
      network namespace and will cause the splat below when a kernel socket
      is destroyed.
      
      This fix simply ignores kernel sockets when they are destroyed.
      
      Reported as:
      general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1
      Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
      Stack:
       ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90
       ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90
       ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8
      Call Trace:
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag]
       [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20
       [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
       [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160
       [<ffffffff8408ea97>] process_one_work+0x147/0x420
       [<ffffffff8408f0f9>] worker_thread+0x69/0x470
       [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0
       [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320
       [<ffffffff84093cd7>] kthread+0x107/0x120
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
       [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70
       [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
      
      Tested:
        Using a debug kernel while 'ss -E' is running:
        ip netns add test-ns
        ip netns delete test-ns
      
      Fixes: eb4cb008 sock_diag: define destruction multicast groups
      Fixes: 26abe143 net: Modify sk_alloc to not reference count the
        netns of kernel sockets.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b922622e
  24. 29 6月, 2015 1 次提交
  25. 16 6月, 2015 1 次提交
  26. 12 6月, 2015 1 次提交
    • S
      net: don't wait for order-3 page allocation · fb05e7a8
      Shaohua Li 提交于
      We saw excessive direct memory compaction triggered by skb_page_frag_refill.
      This causes performance issues and add latency. Commit 5640f768
      introduces the order-3 allocation. According to the changelog, the order-3
      allocation isn't a must-have but to improve performance. But direct memory
      compaction has high overhead. The benefit of order-3 allocation can't
      compensate the overhead of direct memory compaction.
      
      This patch makes the order-3 page allocation atomic. If there is no memory
      pressure and memory isn't fragmented, the alloction will still success, so we
      don't sacrifice the order-3 benefit here. If the atomic allocation fails,
      direct memory compaction will not be triggered, skb_page_frag_refill will
      fallback to order-0 immediately, hence the direct memory compaction overhead is
      avoided. In the allocation failure case, kswapd is waken up and doing
      compaction, so chances are allocation could success next time.
      
      alloc_skb_with_frags is the same.
      
      The mellanox driver does similar thing, if this is accepted, we must fix
      the driver too.
      
      V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
      V2: make the changelog clearer
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Debabrata Banerjee <dbavatar@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb05e7a8
  27. 11 6月, 2015 1 次提交
    • M
      net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap · 5d753610
      Mel Gorman 提交于
      Jeff Layton reported the following;
      
       [   74.232485] ------------[ cut here ]------------
       [   74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
       [   74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
       [   74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
       [   74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       [   74.245546]  0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
       [   74.246786]  0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
       [   74.248175]  0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
       [   74.249483] Call Trace:
       [   74.249872]  [<ffffffff8179263d>] dump_stack+0x45/0x57
       [   74.250703]  [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
       [   74.251655]  [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
       [   74.252585]  [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
       [   74.253519]  [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
       [   74.254537]  [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
       [   74.255610]  [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
       [   74.256582]  [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
       [   74.257496]  [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
       [   74.258318]  [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
       [   74.259253]  [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
       [   74.260158] ---[ end trace 2530722966429f10 ]---
      
      The warning in question was unnecessary but with Jeff's series the rules
      are also clearer.  This patch removes the warning and updates the comment
      to explain why sk_mem_reclaim() may still be called.
      
      [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
                points out that it's not needed.]
      
      Cc: Leon Romanovsky <leon@leon.nu>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d753610
  28. 27 5月, 2015 1 次提交
  29. 18 5月, 2015 1 次提交
  30. 11 5月, 2015 3 次提交
  31. 04 5月, 2015 1 次提交
  32. 13 4月, 2015 1 次提交
  33. 07 4月, 2015 1 次提交
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990