1. 02 8月, 2012 1 次提交
  2. 01 8月, 2012 5 次提交
    • M
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman 提交于
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
    • M
      netvm: set PF_MEMALLOC as appropriate during SKB processing · b4b9e355
      Mel Gorman 提交于
      In order to make sure pfmemalloc packets receive all memory needed to
      proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
      This is limited to a subset of protocols that are expected to be used for
      writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
      expected to communicate with userspace processes which could be paged out.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [jslaby@suse.cz: Lock imbalance fix]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4b9e355
    • M
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman 提交于
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
    • M
      netvm: allow the use of __GFP_MEMALLOC by specific sockets · 7cb02404
      Mel Gorman 提交于
      Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
      for their allocations.  These sockets will be able to go below watermarks
      and allocate from the emergency reserve.  Such sockets are to be used to
      service the VM (iow.  to swap over).  They must be handled kernel side,
      exposing such a socket to user-space is a bug.
      
      There is a risk that the reserves be depleted so for now, the
      administrator is responsible for increasing min_free_kbytes as necessary
      to prevent deadlock for their workloads.
      
      [a.p.zijlstra@chello.nl: Original patches]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb02404
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
  3. 23 7月, 2012 1 次提交
    • J
      net: netprio_cgroup: rework update socket logic · 406a3c63
      John Fastabend 提交于
      Instead of updating the sk_cgrp_prioidx struct field on every send
      this only updates the field when a task is moved via cgroup
      infrastructure.
      
      This allows sockets that may be used by a kernel worker thread
      to be managed. For example in the iscsi case today a user can
      put iscsid in a netprio cgroup and control traffic will be sent
      with the correct sk_cgrp_prioidx value set but as soon as data
      is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
      is updated with the kernel worker threads value which is the
      default case.
      
      It seems more correct to only update the field when the user
      explicitly sets it via control group infrastructure. This allows
      the users to manage sockets that may be used with other threads.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      406a3c63
  4. 12 7月, 2012 1 次提交
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  5. 20 6月, 2012 1 次提交
    • D
      ipv4: Early TCP socket demux. · 41063e9d
      David S. Miller 提交于
      Input packet processing for local sockets involves two major demuxes.
      One for the route and one for the socket.
      
      But we can optimize this down to one demux for certain kinds of local
      sockets.
      
      Currently we only do this for established TCP sockets, but it could
      at least in theory be expanded to other kinds of connections.
      
      If a TCP socket is established then it's identity is fully specified.
      
      This means that whatever input route was used during the three-way
      handshake must work equally well for the rest of the connection since
      the keys will not change.
      
      Once we move to established state, we cache the receive packet's input
      route to use later.
      
      Like the existing cached route in sk->sk_dst_cache used for output
      packets, we have to check for route invalidations using dst->obsolete
      and dst->ops->check().
      
      Early demux occurs outside of a socket locked section, so when a route
      invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
      actually inside of established state packet processing and thus have
      the socket locked.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41063e9d
  6. 01 6月, 2012 1 次提交
  7. 17 5月, 2012 2 次提交
  8. 09 5月, 2012 1 次提交
  9. 03 5月, 2012 2 次提交
  10. 01 5月, 2012 1 次提交
  11. 29 4月, 2012 1 次提交
  12. 27 4月, 2012 1 次提交
  13. 24 4月, 2012 1 次提交
    • E
      net: add a limit parameter to sk_add_backlog() · f545a38f
      Eric Dumazet 提交于
      sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
      memory limit. We need to make this limit a parameter for TCP use.
      
      No functional change expected in this patch, all callers still using the
      old sk_rcvbuf limit.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f545a38f
  14. 22 4月, 2012 1 次提交
  15. 16 4月, 2012 1 次提交
  16. 11 4月, 2012 1 次提交
    • G
      cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg · 1d62e436
      Glauber Costa 提交于
      The only reason cgroup was used, was to be consistent with the populate()
      interface. Now that we're getting rid of it, not only we no longer need
      it, but we also *can't* call it this way.
      
      Since we will no longer rely on populate(), this will be called from
      create(). During create, the association between struct mem_cgroup
      and struct cgroup does not yet exist, since cgroup internals hasn't
      yet initialized its bookkeeping. This means we would not be able
      to draw the memcg pointer from the cgroup pointer in these
      functions, which is highly undesirable.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      CC: Li Zefan <lizefan@huawei.com>
      CC: Johannes Weiner <hannes@cmpxchg.org>
      CC: Michal Hocko <mhocko@suse.cz>
      1d62e436
  17. 29 3月, 2012 1 次提交
  18. 25 2月, 2012 1 次提交
  19. 24 2月, 2012 2 次提交
  20. 22 2月, 2012 1 次提交
    • P
      sock: Introduce the SO_PEEK_OFF sock option · ef64a54f
      Pavel Emelyanov 提交于
      This one specifies where to start MSG_PEEK-ing queue data from. When
      set to negative value means that MSG_PEEK works as ususally -- peeks
      from the head of the queue always.
      
      When some bytes are peeked from queue and the peeking offset is non
      negative it is moved forward so that the next peek will return next
      portion of data.
      
      When non-peeking recvmsg occurs and the peeking offset is non negative
      is is moved backward so that the next peek will still peek the proper
      data (i.e. the one that would have been picked if there were no non
      peeking recv in between).
      
      The offset is set using per-proto opteration to let the protocol handle
      the locking issues and to check whether the peeking offset feature is
      supported by the protocol the socket belongs to.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef64a54f
  21. 11 2月, 2012 1 次提交
  22. 03 2月, 2012 1 次提交
    • L
      cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5
      Li Zefan 提交于
      The argument is not used at all, and it's not necessary, because
      a specific callback handler of course knows which subsys it
      belongs to.
      
      Now only ->pupulate() takes this argument, because the handlers of
      this callback always call cgroup_add_file()/cgroup_add_files().
      
      So we reduce a few lines of code, though the shrinking of object size
      is minimal.
      
       16 files changed, 113 insertions(+), 162 deletions(-)
      
         text    data     bss     dec     hex filename
      5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
      5486170  656987 7039960 13183117         c9288d vmlinux.o
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      761b3ef5
  23. 23 1月, 2012 1 次提交
    • G
      net: introduce res_counter_charge_nofail() for socket allocations · 0e90b31f
      Glauber Costa 提交于
      There is a case in __sk_mem_schedule(), where an allocation
      is beyond the maximum, but yet we are allowed to proceed.
      It happens under the following condition:
      
      	sk->sk_wmem_queued + size >= sk->sk_sndbuf
      
      The network code won't revert the allocation in this case,
      meaning that at some point later it'll try to do it. Since
      this is never communicated to the underlying res_counter
      code, there is an inbalance in res_counter uncharge operation.
      
      I see two ways of fixing this:
      
      1) storing the information about those allocations somewhere
         in memcg, and then deducting from that first, before
         we start draining the res_counter,
      2) providing a slightly different allocation function for
         the res_counter, that matches the original behavior of
         the network code more closely.
      
      I decided to go for #2 here, believing it to be more elegant,
      since #1 would require us to do basically that, but in a more
      obscure way.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      CC: Tejun Heo <tj@kernel.org>
      CC: Li Zefan <lizf@cn.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e90b31f
  24. 10 1月, 2012 1 次提交
  25. 09 1月, 2012 1 次提交
  26. 08 1月, 2012 1 次提交
    • G
      net: fix sock_clone reference mismatch with tcp memcontrol · f3f511e1
      Glauber Costa 提交于
      Sockets can also be created through sock_clone. Because it copies
      all data in the sock structure, it also copies the memcg-related pointer,
      and all should be fine. However, since we now use reference counts in
      socket creation, we are left with some sockets that have no reference
      counts. It matters when we destroy them, since it leads to a mismatch.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Greg Thelen <gthelen@google.com>
      CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3f511e1
  27. 23 12月, 2011 1 次提交
  28. 17 12月, 2011 1 次提交
    • G
      net: fix sleeping while atomic problem in sock mem_cgroup. · 36b77a52
      Glauber Costa 提交于
      We can't scan the proto_list to initialize sock cgroups, as it
      holds a rwlock, and we also want to keep the code generic enough to
      avoid calling the initialization functions of protocols directly,
      
      Convert proto_list_lock into a mutex, so we can sleep and do the
      necessary allocations. This lock is seldom taken, so there shouldn't
      be any performance penalties associated with that
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Stephen Rothwell <sfr@canb.auug.org.au>
      CC: Randy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36b77a52
  29. 13 12月, 2011 3 次提交
  30. 29 11月, 2011 1 次提交
  31. 23 11月, 2011 1 次提交
    • N
      net: add network priority cgroup infrastructure (v4) · 5bc1421e
      Neil Horman 提交于
      This patch adds in the infrastructure code to create the network priority
      cgroup.  The cgroup, in addition to the standard processes file creates two
      control files:
      
      1) prioidx - This is a read-only file that exports the index of this cgroup.
      This is a value that is both arbitrary and unique to a cgroup in this subsystem,
      and is used to index the per-device priority map
      
      2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
      <name:priority> where name is the name of a network interface and priority is
      indicates the priority assigned to frames egresessing on the named interface and
      originating from a pid in this cgroup
      
      This cgroup allows for skb priority to be set prior to a root qdisc getting
      selected. This is benenficial for DCB enabled systems, in that it allows for any
      application to use dcb configured priorities so without application modification
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      CC: Robert Love <robert.w.love@intel.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bc1421e