1. 19 12月, 2012 2 次提交
    • G
      sl[au]b: always get the cache from its page in kmem_cache_free() · b9ce5ef4
      Glauber Costa 提交于
      struct page already has this information.  If we start chaining caches,
      this information will always be more trustworthy than whatever is passed
      into the function.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ce5ef4
    • G
      slab/slub: consider a memcg parameter in kmem_create_cache · 2633d7a0
      Glauber Costa 提交于
      Allow a memcg parameter to be passed during cache creation.  When the slub
      allocator is being used, it will only merge caches that belong to the same
      memcg.  We'll do this by scanning the global list, and then translating
      the cache to a memcg-specific cache
      
      Default function is created as a wrapper, passing NULL to the memcg
      version.  We only merge caches that belong to the same memcg.
      
      A helper is provided, memcg_css_id: because slub needs a unique cache name
      for sysfs.  Since this is visible, but not the canonical location for slab
      data, the cache name is not used, the css_id should suffice.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2633d7a0
  2. 12 12月, 2012 1 次提交
    • L
      slub, hotplug: ignore unrelated node's hot-adding and hot-removing · b9d5ab25
      Lai Jiangshan 提交于
      SLUB only focuses on the nodes which have normal memory and it ignores the
      other node's hot-adding and hot-removing.
      
      Aka: if some memory of a node which has no onlined memory is online, but
      this new memory onlined is not normal memory (for example, highmem), we
      should not allocate kmem_cache_node for SLUB.
      
      And if the last normal memory is offlined, but the node still has memory,
      we should remove kmem_cache_node for that node.  (The current code delays
      it when all of the memory is offlined)
      
      So we only do something when marg->status_change_nid_normal > 0.
      marg->status_change_nid is not suitable here.
      
      The same problem doesn't exist in SLAB, because SLAB allocates kmem_list3
      for every node even the node don't have normal memory, SLAB tolerates
      kmem_list3 on alien nodes.  SLUB only focuses on the nodes which have
      normal memory, it don't tolerate alien kmem_cache_node.  The patch makes
      SLUB become self-compatible and avoids WARNs and BUGs in rare conditions.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Rob Landley <rob@landley.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9d5ab25
  3. 11 12月, 2012 4 次提交
  4. 31 10月, 2012 2 次提交
  5. 24 10月, 2012 4 次提交
  6. 19 10月, 2012 1 次提交
    • J
      slub: remove one code path and reduce lock contention in __slab_free() · 837d678d
      Joonsoo Kim 提交于
      When we try to free object, there is some of case that we need
      to take a node lock. This is the necessary step for preventing a race.
      After taking a lock, then we try to cmpxchg_double_slab().
      But, there is a possible scenario that cmpxchg_double_slab() is failed
      with taking a lock. Following example explains it.
      
      CPU A               CPU B
      need lock
      ...                 need lock
      ...                 lock!!
      lock..but spin      free success
      spin...             unlock
      lock!!
      free fail
      
      In this case, retry with taking a lock is occured in CPU A.
      I think that in this case for CPU A,
      "release a lock first, and re-take a lock if necessary" is preferable way.
      
      There are two reasons for this.
      
      First, this makes __slab_free()'s logic somehow simple.
      With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
      So we can remove one code path.
      
      Second, it may reduce lock contention.
      When we do retrying, status of slab is already changed,
      so we don't need a lock anymore in almost every case.
      "release a lock first, and re-take a lock if necessary" policy is
      helpful to this.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      837d678d
  7. 03 10月, 2012 1 次提交
  8. 25 9月, 2012 1 次提交
  9. 19 9月, 2012 1 次提交
  10. 18 9月, 2012 1 次提交
    • J
      slub: consider pfmemalloc_match() in get_partial_node() · 8ba00bb6
      Joonsoo Kim 提交于
      get_partial() is currently not checking pfmemalloc_match() meaning that
      it is possible for pfmemalloc pages to leak to non-pfmemalloc users.
      This is a problem in the following situation.  Assume that there is a
      request from normal allocation and there are no objects in the per-cpu
      cache and no node-partial slab.
      
      In this case, slab_alloc enters the slow path and new_slab_objects() is
      called which may return a PFMEMALLOC page.  As the current user is not
      allowed to access PFMEMALLOC page, deactivate_slab() is called
      ([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc
      checks]) and returns an object from PFMEMALLOC page.
      
      Next time, when we get another request from normal allocation,
      slab_alloc() enters the slow-path and calls new_slab_objects().  In
      new_slab_objects(), we call get_partial() and get a partial slab which
      was just deactivated but is a pfmemalloc page.  We extract one object
      from it and re-deactivate.
      
        "deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly.
      
      As a result, access to PFMEMALLOC page is not properly restricted and it
      can cause a performance degradation due to frequent deactivation.
      deactivation frequently.
      
      This patch changes get_partial_node() to take pfmemalloc_match() into
      account and prevents the "deactivate -> re-get in get_partial()
      scenario.  Instead, new_slab() is called.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ba00bb6
  11. 10 9月, 2012 1 次提交
    • C
      slub: Zero initial memory segment for kmem_cache and kmem_cache_node · 9df53b15
      Christoph Lameter 提交于
      Tony Luck reported the following problem on IA-64:
      
        Worked fine yesterday on next-20120905, crashes today. First sign of
        trouble was an unaligned access, then a NULL dereference. SL*B related
        bits of my config:
      
        CONFIG_SLUB_DEBUG=y
        # CONFIG_SLAB is not set
        CONFIG_SLUB=y
        CONFIG_SLABINFO=y
        # CONFIG_SLUB_DEBUG_ON is not set
        # CONFIG_SLUB_STATS is not set
      
        And he console log.
      
        PID hash table entries: 4096 (order: 1, 32768 bytes)
        Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
        Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
        Memory: 2047920k/2086064k available (13992k code, 38144k reserved,
        6012k data, 880k init)
        kernel unaligned access to 0xca2ffc55fb373e95, ip=0xa0000001001be550
        swapper[0]: error during unaligned kernel access
         -1 [1]
        Modules linked in:
      
        Pid: 0, CPU 0, comm:              swapper
        psr : 00001010084a2018 ifs : 800000000000060f ip  :
        [<a0000001001be550>]    Not tainted (3.6.0-rc4-zx1-smp-next-20120906)
        ip is at new_slab+0x90/0x680
        unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003
        rnat: 9666960159966a59 bsps: a0000001001441c0 pr  : 9666960159965a59
        ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
        csd : 0000000000000000 ssd : 0000000000000000
        b0  : a0000001001be500 b6  : a00000010112cb20 b7  : a0000001011660a0
        f6  : 0fff7f0f0f0f0e54f0000 f7  : 0ffe8c5c1000000000000
        f8  : 1000d8000000000000000 f9  : 100068800000000000000
        f10 : 10005f0f0f0f0e54f0000 f11 : 1003e0000000000000078
        r1  : a00000010155eef0 r2  : 0000000000000000 r3  : fffffffffffc1638
        r8  : e0000040600081b8 r9  : ca2ffc55fb373e95 r10 : 0000000000000000
        r11 : e000004040001646 r12 : a000000101287e20 r13 : a000000101280000
        r14 : 0000000000004000 r15 : 0000000000000078 r16 : ca2ffc55fb373e75
        r17 : e000004040040000 r18 : fffffffffffc1646 r19 : e000004040001646
        r20 : fffffffffffc15f8 r21 : 000000000000004d r22 : a00000010132fa68
        r23 : 00000000000000ed r24 : 0000000000000000 r25 : 0000000000000000
        r26 : 0000000000000001 r27 : a0000001012b8500 r28 : a00000010135f4a0
        r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000001
        Unable to handle kernel NULL pointer dereference (address
        0000000000000018)
        swapper[0]: Oops 11003706212352 [2]
        Modules linked in:
      
        Pid: 0, CPU 0, comm:              swapper
        psr : 0000121008022018 ifs : 800000000000cc18 ip  :
        [<a0000001004dc8f1>]    Not tainted (3.6.0-rc4-zx1-smp-next-20120906)
        ip is at __copy_user+0x891/0x960
        unat: 0000000000000000 pfs : 0000000000000813 rsc : 0000000000000003
        rnat: 0000000000000000 bsps: 0000000000000000 pr  : 9666960159961765
        ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
        csd : 0000000000000000 ssd : 0000000000000000
        b0  : a00000010004b550 b6  : a00000010004b740 b7  : a00000010000c750
        f6  : 000000000000000000000 f7  : 1003e9e3779b97f4a7c16
        f8  : 1003e0a00000010001550 f9  : 100068800000000000000
        f10 : 10005f0f0f0f0e54f0000 f11 : 1003e0000000000000078
        r1  : a00000010155eef0 r2  : a0000001012870b0 r3  : a0000001012870b8
        r8  : 0000000000000298 r9  : 0000000000000013 r10 : 0000000000000000
        r11 : 9666960159961a65 r12 : a000000101287010 r13 : a000000101280000
        r14 : a000000101287068 r15 : a000000101287080 r16 : 0000000000000298
        r17 : 0000000000000010 r18 : 0000000000000018 r19 : a000000101287310
        r20 : 0000000000000290 r21 : 0000000000000000 r22 : 0000000000000000
        r23 : a000000101386f58 r24 : 0000000000000000 r25 : 000000007fffffff
        r26 : a000000101287078 r27 : a0000001013c69b0 r28 : 0000000000000000
        r29 : 0000000000000014 r30 : 0000000000000000 r31 : 0000000000000813
      
      Sedat Dilek and Hugh Dickins reported similar problems as well.
      
      Earlier patches in the common set moved the zeroing of the kmem_cache
      structure into common code. See "Move allocation of kmem_cache into
      common code".
      
      The allocation for the two special structures is still done from SLUB
      specific code but no zeroing is done since the cache creation functions
      used to zero. This now needs to be updated so that the structures are
      zeroed during allocation in kmem_cache_init().  Otherwise random pointer
      values may be followed.
      Reported-by: NTony Luck <tony.luck@intel.com>
      Reported-by: NSedat Dilek <sedat.dilek@gmail.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      9df53b15
  12. 05 9月, 2012 14 次提交
  13. 16 8月, 2012 3 次提交
  14. 01 8月, 2012 2 次提交
    • C
      mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks · 5091b74a
      Christoph Lameter 提交于
      This patch removes the check for pfmemalloc from the alloc hotpath and
      puts the logic after the election of a new per cpu slab.  For a pfmemalloc
      page we do not use the fast path but force the use of the slow path which
      is also used for the debug case.
      
      This has the side-effect of weakening pfmemalloc processing in the
      following way;
      
      1. A process that is allocating for network swap calls __slab_alloc.
         pfmemalloc_match is true so the freelist is loaded and c->freelist is
         now pointing to a pfmemalloc page.
      
      2. A process that is attempting normal allocations calls slab_alloc,
         finds the pfmemalloc page on the freelist and uses it because it did
         not check pfmemalloc_match()
      
      The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
      the kmalloc slabs being the most vunerable caches on the grounds they
      are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
      later patch will still protect the system as processes will get throttled
      if the pfmemalloc reserves get depleted but performance will not degrade
      as smoothly.
      
      [mgorman@suse.de: Expanded changelog]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5091b74a
    • M
      mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages · 072bb0aa
      Mel Gorman 提交于
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  Swap over the network is considered as an option in diskless
      systems.  The two likely scenarios are when blade servers are used as part
      of a cluster where the form factor or maintenance costs do not allow the
      use of disks and thin clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap according to the manual at
      https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
      There is also documentation and tutorials on how to setup swap over NBD at
      places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
      nbd-client also documents the use of NBD as swap.  Despite this, the fact
      is that a machine using NBD for swap can deadlock within minutes if swap
      is used intensively.  This patch series addresses the problem.
      
      The core issue is that network block devices do not use mempools like
      normal block devices do.  As the host cannot control where they receive
      packets from, they cannot reliably work out in advance how much memory
      they might need.  Some years ago, Peter Zijlstra developed a series of
      patches that supported swap over an NFS that at least one distribution is
      carrying within their kernels.  This patch series borrows very heavily
      from Peter's work to support swapping over NBD as a pre-requisite to
      supporting swap-over-NFS.  The bulk of the complexity is concerned with
      preserving memory that is allocated from the PFMEMALLOC reserves for use
      by the network layer which is needed for both NBD and NFS.
      
      Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
      	preserve access to pages allocated under low memory situations
      	to callers that are freeing memory.
      
      Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
      
      Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
      	reserves without setting PFMEMALLOC.
      
      Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
      	for later use by network packet processing.
      
      Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
      
      Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
      
      Patches 7-12 allows network processing to use PFMEMALLOC reserves when
      	the socket has been marked as being used by the VM to clean pages. If
      	packets are received and stored in pages that were allocated under
      	low-memory situations and are unrelated to the VM, the packets
      	are dropped.
      
      	Patch 11 reintroduces __skb_alloc_page which the networking
      	folk may object to but is needed in some cases to propogate
      	pfmemalloc from a newly allocated page to an skb. If there is a
      	strong objection, this patch can be dropped with the impact being
      	that swap-over-network will be slower in some cases but it should
      	not fail.
      
      Patch 13 is a micro-optimisation to avoid a function call in the
      	common case.
      
      Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
      	PFMEMALLOC if necessary.
      
      Patch 15 notes that it is still possible for the PFMEMALLOC reserve
      	to be depleted. To prevent this, direct reclaimers get throttled on
      	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
      	expected that kswapd and the direct reclaimers already running
      	will clean enough pages for the low watermark to be reached and
      	the throttled processes are woken up.
      
      Patch 16 adds a statistic to track how often processes get throttled
      
      Some basic performance testing was run using kernel builds, netperf on
      loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
      sysbench.  Each of them were expected to use the sl*b allocators
      reasonably heavily but there did not appear to be significant performance
      variances.
      
      For testing swap-over-NBD, a machine was booted with 2G of RAM with a
      swapfile backed by NBD.  8*NUM_CPU processes were started that create
      anonymous memory mappings and read them linearly in a loop.  The total
      size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
      memory pressure.
      
      Without the patches and using SLUB, the machine locks up within minutes
      and runs to completion with them applied.  With SLAB, the story is
      different as an unpatched kernel run to completion.  However, the patched
      kernel completed the test 45% faster.
      
      MICRO
                                               3.5.0-rc2 3.5.0-rc2
      					 vanilla     swapnbd
      Unrecognised test vmscan-anon-mmap-write
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             197.80    173.07
      User+Sys Time Running Test (seconds)        206.96    182.03
      Total Elapsed Time (seconds)               3240.70   1762.09
      
      This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
      
      Allocations of pages below the min watermark run a risk of the machine
      hanging due to a lack of memory.  To prevent this, only callers who have
      PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
      allowed to allocate with ALLOC_NO_WATERMARKS.  Once they are allocated to
      a slab though, nothing prevents other callers consuming free objects
      within those slabs.  This patch limits access to slab pages that were
      alloced from the PFMEMALLOC reserves.
      
      When this patch is applied, pages allocated from below the low watermark
      are returned with page->pfmemalloc set and it is up to the caller to
      determine how the page should be protected.  SLAB restricts access to any
      page with page->pfmemalloc set to callers which are known to able to
      access the PFMEMALLOC reserve.  If one is not available, an attempt is
      made to allocate a new page rather than use a reserve.  SLUB is a bit more
      relaxed in that it only records if the current per-CPU page was allocated
      from PFMEMALLOC reserve and uses another partial slab if the caller does
      not have the necessary GFP or process flags.  This was found to be
      sufficient in tests to avoid hangs due to SLUB generally maintaining
      smaller lists than SLAB.
      
      In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
      a slab allocation even though free objects are available because they are
      being preserved for callers that are freeing pages.
      
      [a.p.zijlstra@chello.nl: Original implementation]
      [sebastian@breakpoint.cc: Correct order of page flag clearing]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072bb0aa
  15. 11 7月, 2012 1 次提交
  16. 09 7月, 2012 1 次提交