1. 02 8月, 2012 1 次提交
  2. 01 8月, 2012 10 次提交
    • M
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman 提交于
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
    • M
      netvm: set PF_MEMALLOC as appropriate during SKB processing · b4b9e355
      Mel Gorman 提交于
      In order to make sure pfmemalloc packets receive all memory needed to
      proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
      This is limited to a subset of protocols that are expected to be used for
      writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
      expected to communicate with userspace processes which could be paged out.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [jslaby@suse.cz: Lock imbalance fix]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4b9e355
    • M
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman 提交于
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
    • M
      netvm: allow the use of __GFP_MEMALLOC by specific sockets · 7cb02404
      Mel Gorman 提交于
      Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
      for their allocations.  These sockets will be able to go below watermarks
      and allocate from the emergency reserve.  Such sockets are to be used to
      service the VM (iow.  to swap over).  They must be handled kernel side,
      exposing such a socket to user-space is a bug.
      
      There is a risk that the reserves be depleted so for now, the
      administrator is responsible for increasing min_free_kbytes as necessary
      to prevent deadlock for their workloads.
      
      [a.p.zijlstra@chello.nl: Original patches]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb02404
    • M
      net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket · 99a1dec7
      Mel Gorman 提交于
      Introduce sk_gfp_atomic(), this function allows to inject sock specific
      flags to each sock related allocation.  It is only used on allocation
      paths that may be required for writing pages back to network storage.
      
      [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99a1dec7
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
    • D
      ipv4: Properly purge netdev references on uncached routes. · caacf05e
      David S. Miller 提交于
      When a device is unregistered, we have to purge all of the
      references to it that may exist in the entire system.
      
      If a route is uncached, we currently have no way of accomplishing
      this.
      
      So create a global list that is scanned when a network device goes
      down.  This mirrors the logic in net/core/dst.c's dst_ifdown().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      caacf05e
    • D
      c5038a83
    • E
      ipv4: percpu nh_rth_output cache · d26b3a7c
      Eric Dumazet 提交于
      Input path is mostly run under RCU and doesnt touch dst refcnt
      
      But output path on forwarding or UDP workloads hits
      badly dst refcount, and we have lot of false sharing, for example
      in ipv4_mtu() when reading rt->rt_pmtu
      
      Using a percpu cache for nh_rth_output gives a nice performance
      increase at a small cost.
      
      24 udpflood test on my 24 cpu machine (dummy0 output device)
      (each process sends 1.000.000 udp frames, 24 processes are started)
      
      before : 5.24 s
      after : 2.06 s
      For reference, time on linux-3.5 : 6.60 s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d26b3a7c
    • E
      ipv4: Restore old dst_free() behavior. · 54764bb6
      Eric Dumazet 提交于
      commit 404e0a8b (net: ipv4: fix RCU races on dst refcounts) tried
      to solve a race but added a problem at device/fib dismantle time :
      
      We really want to call dst_free() as soon as possible, even if sockets
      still have dst in their cache.
      dst_release() calls in free_fib_info_rcu() are not welcomed.
      
      Root of the problem was that now we also cache output routes (in
      nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
      rt_free(), because output route lookups are done in process context.
      
      Based on feedback and initial patch from David Miller (adding another
      call_rcu_bh() call in fib, but it appears it was not the right fix)
      
      I left the inet_sk_rx_dst_set() helper and added __rcu attributes
      to nh_rth_output and nh_rth_input to better document what is going on in
      this code.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54764bb6
  3. 31 7月, 2012 2 次提交
    • E
      ipv4: remove rt_cache_rebuild_count · 0c7462a2
      Eric Dumazet 提交于
      After IP route cache removal, rt_cache_rebuild_count is no longer
      used.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c7462a2
    • E
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet 提交于
      commit c6cffba4 (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      
      crashes happen on tcp workloads if routes are added/deleted at the same
      time.
      
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      
      With this patch, my machines survive various benchmarks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404e0a8b
  4. 27 7月, 2012 2 次提交
  5. 24 7月, 2012 2 次提交
    • D
      ipv4: Change rt->rt_iif encoding. · 13378cad
      David S. Miller 提交于
      On input packet processing, rt->rt_iif will be zero if we should
      use skb->dev->ifindex.
      
      Since we access rt->rt_iif consistently via inet_iif(), that is
      the only spot whose interpretation have to adjust.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13378cad
    • D
      ipv4: Prepare for change of rt->rt_iif encoding. · 92101b3b
      David S. Miller 提交于
      Use inet_iif() consistently, and for TCP record the input interface of
      cached RX dst in inet sock.
      
      rt->rt_iif is going to be encoded differently, so that we can
      legitimately cache input routes in the FIB info more aggressively.
      
      When the input interface is "use SKB device index" the rt->rt_iif will
      be set to zero.
      
      This forces us to move the TCP RX dst cache installation into the ipv4
      specific code, and as well it should since doing the route caching for
      ipv6 is pointless at the moment since it is not inspected in the ipv6
      input paths yet.
      
      Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
      obsolete set to a non-zero value to force invocation of the check
      callback.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92101b3b
  6. 23 7月, 2012 4 次提交
    • E
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet 提交于
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563d34d0
    • A
      get rid of ->scm_work_list · 6120d3db
      Al Viro 提交于
      recursion in __scm_destroy() will be cut by delaying final fput()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6120d3db
    • J
      net: netprio_cgroup: rework update socket logic · 406a3c63
      John Fastabend 提交于
      Instead of updating the sk_cgrp_prioidx struct field on every send
      this only updates the field when a task is moved via cgroup
      infrastructure.
      
      This allows sockets that may be used by a kernel worker thread
      to be managed. For example in the iscsi case today a user can
      put iscsid in a netprio cgroup and control traffic will be sent
      with the correct sk_cgrp_prioidx value set but as soon as data
      is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
      is updated with the kernel worker threads value which is the
      default case.
      
      It seems more correct to only update the field when the user
      explicitly sets it via control group infrastructure. This allows
      the users to manage sockets that may be used with other threads.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      406a3c63
    • N
      sctp: Implement quick failover draft from tsvwg · 5aa93bcf
      Neil Horman 提交于
      I've seen several attempts recently made to do quick failover of sctp transports
      by reducing various retransmit timers and counters.  While its possible to
      implement a faster failover on multihomed sctp associations, its not
      particularly robust, in that it can lead to unneeded retransmits, as well as
      false connection failures due to intermittent latency on a network.
      
      Instead, lets implement the new ietf quick failover draft found here:
      http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
      
      This will let the sctp stack identify transports that have had a small number of
      errors, and avoid using them quickly until their reliability can be
      re-established.  I've tested this out on two virt guests connected via multiple
      isolated virt networks and believe its in compliance with the above draft and
      works well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: Sridhar Samudrala <sri@us.ibm.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: linux-sctp@vger.kernel.org
      CC: joe@perches.com
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa93bcf
  7. 21 7月, 2012 19 次提交