1. 26 7月, 2011 22 次提交
  2. 24 7月, 2011 1 次提交
    • M
      [S390] reference bit testing for unmapped pages · 50a15981
      Martin Schwidefsky 提交于
      On x86 a page without a mapper is by definition not referenced / old.
      The s390 architecture keeps the reference bit in the storage key and
      the current code will check the storage key for page without a mapper.
      This leads to an interesting effect: the first time an s390 system
      needs to write pages to swap it only finds referenced pages. This
      causes a lot of pages to get added and written to the swap device.
      To avoid this behaviour change page_referenced to query the storage
      key only if there is a mapper of the page.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      50a15981
  3. 22 7月, 2011 1 次提交
  4. 21 7月, 2011 5 次提交
  5. 20 7月, 2011 5 次提交
    • D
      vmscan: add customisable shrinker batch size · e9299f50
      Dave Chinner 提交于
      For shrinkers that have their own cond_resched* calls, having
      shrink_slab break the work down into small batches is not
      paticularly efficient. Add a custom batchsize field to the struct
      shrinker so that shrinkers can use a larger batch size if they
      desire.
      
      A value of zero (uninitialised) means "use the default", so
      behaviour is unchanged by this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e9299f50
    • D
      vmscan: reduce wind up shrinker->nr when shrinker can't do work · 3567b59a
      Dave Chinner 提交于
      When a shrinker returns -1 to shrink_slab() to indicate it cannot do
      any work given the current memory reclaim requirements, it adds the
      entire total_scan count to shrinker->nr. The idea ehind this is that
      whenteh shrinker is next called and can do work, it will do the work
      of the previously aborted shrinker call as well.
      
      However, if a filesystem is doing lots of allocation with GFP_NOFS
      set, then we get many, many more aborts from the shrinkers than we
      do successful calls. The result is that shrinker->nr winds up to
      it's maximum permissible value (twice the current cache size) and
      then when the next shrinker call that can do work is issued, it
      has enough scan count built up to free the entire cache twice over.
      
      This manifests itself in the cache going from full to empty in a
      matter of seconds, even when only a small part of the cache is
      needed to be emptied to free sufficient memory.
      
      Under metadata intensive workloads on ext4 and XFS, I'm seeing the
      VFS caches increase memory consumption up to 75% of memory (no page
      cache pressure) over a period of 30-60s, and then the shrinker
      empties them down to zero in the space of 2-3s. This cycle repeats
      over and over again, with the shrinker completely trashing the inode
      and dentry caches every minute or so the workload continues.
      
      This behaviour was made obvious by the shrink_slab tracepoints added
      earlier in the series, and made worse by the patch that corrected
      the concurrent accounting of shrinker->nr.
      
      To avoid this problem, stop repeated small increments of the total
      scan value from winding shrinker->nr up to a value that can cause
      the entire cache to be freed. We still need to allow it to wind up,
      so use the delta as the "large scan" threshold check - if the delta
      is more than a quarter of the entire cache size, then it is a large
      scan and allowed to cause lots of windup because we are clearly
      needing to free lots of memory.
      
      If it isn't a large scan then limit the total scan to half the size
      of the cache so that windup never increases to consume the whole
      cache. Reducing the total scan limit further does not allow enough
      wind-up to maintain the current levels of performance, whilst a
      higher threshold does not prevent the windup from freeing the entire
      cache under sustained workloads.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3567b59a
    • D
      vmscan: shrinker->nr updates race and go wrong · acf92b48
      Dave Chinner 提交于
      shrink_slab() allows shrinkers to be called in parallel so the
      struct shrinker can be updated concurrently. It does not provide any
      exclusio for such updates, so we can get the shrinker->nr value
      increasing or decreasing incorrectly.
      
      As a result, when a shrinker repeatedly returns a value of -1 (e.g.
      a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
      sometimes updating with the scan count that wasn't used, sometimes
      losing it altogether. Worse is when a shrinker does work and that
      update is lost due to racy updates, which means the shrinker will do
      the work again!
      
      Fix this by making the total_scan calculations independent of
      shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
      other updates via cmpxchg loops.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acf92b48
    • D
      vmscan: add shrink_slab tracepoints · 09576073
      Dave Chinner 提交于
      It is impossible to understand what the shrinkers are actually doing
      without instrumenting the code, so add a some tracepoints to allow
      insight to be gained.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      09576073
    • S
      vmscan: fix a livelock in kswapd · 4746efde
      Shaohua Li 提交于
      I'm running a workload which triggers a lot of swap in a machine with 4
      nodes.  After I kill the workload, I found a kswapd livelock.  Sometimes
      kswapd3 or kswapd2 are keeping running and I can't access filesystem,
      but most memory is free.
      
      This looks like a regression since commit 08951e54 ("mm: vmscan:
      correct check for kswapd sleeping in sleeping_prematurely").
      
      Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
      for classzone_idx.  The reason is end_zone in balance_pgdat() is 0 by
      default, if all zones have watermark ok, end_zone will keep 0.
      
      Later sleeping_prematurely() always returns true.  Because this is an
      order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
      present_pages in pgdat_balanced() are 0.  We add a special case here.
      If a zone has no page, we think it's balanced.  This fixes the livelock.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4746efde
  6. 18 7月, 2011 1 次提交
    • H
      slab: fix DEBUG_SLAB build · c225150b
      Hugh Dickins 提交于
      Fix CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y build error and warnings.
      
      Now that ARCH_SLAB_MINALIGN defaults to __alignof__(unsigned long long),
      it is always defined (when slab.h included), but cannot be used in #if:
      mm/slab.c: In function `cache_alloc_debugcheck_after':
      mm/slab.c:3156:5: warning: "__alignof__" is not defined
      mm/slab.c:3156:5: error: missing binary operator before token "("
      make[1]: *** [mm/slab.o] Error 1
      
      So just remove the #if and #endif lines, but then 64-bit build warns:
      mm/slab.c: In function `cache_alloc_debugcheck_after':
      mm/slab.c:3156:6: warning: cast from pointer to integer of different size
      mm/slab.c:3158:10: warning: format `%d' expects type `int', but argument
                                  3 has type `long unsigned int'
      Fix those with casts, whatever the actual type of ARCH_SLAB_MINALIGN.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      c225150b
  7. 13 7月, 2011 1 次提交
    • T
      x86, numa: Implement pfn -> nid mapping granularity check · 1e01979c
      Tejun Heo 提交于
      SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
      sections array to map pfn to nid which is limited in granularity.  If
      NUMA nodes are laid out such that the mapping cannot be accurate, boot
      will fail triggering BUG_ON() in mminit_verify_page_links().
      
      On 32bit, it's 512MiB w/ PAE and SPARSEMEM.  This seems to have been
      granular enough until commit 2706a0bf (x86, NUMA: Enable
      CONFIG_AMD_NUMA on 32bit too).  Apparently, there is a machine which
      aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT.  This
      led to the following BUG_ON().
      
       On node 0 totalpages: 2096615
         DMA zone: 32 pages used for memmap
         DMA zone: 0 pages reserved
         DMA zone: 3927 pages, LIFO batch:0
         Normal zone: 1740 pages used for memmap
         Normal zone: 220978 pages, LIFO batch:31
         HighMem zone: 16405 pages used for memmap
         HighMem zone: 1853533 pages, LIFO batch:31
       BUG: Int 6: CR2   (null)
            EDI   (null)  ESI 00000002  EBP 00000002  ESP c1543ecc
            EBX f2400000  EDX 00000006  ECX   (null)  EAX 00000001
            err   (null)  EIP c16209aa   CS 00000060  flg 00010002
       Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
                (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe   (null)
              f7200b80 c16395f0 00200a02 f7200a80   (null) 000375fe 00000002   (null)
       Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0bf #17
       Call Trace:
        [<c136b1e5>] ? early_fault+0x2e/0x2e
        [<c16209aa>] ? mminit_verify_page_links+0x12/0x42
        [<c1620613>] ? memmap_init_zone+0xaf/0x10c
        [<c1620929>] ? free_area_init_node+0x2b9/0x2e3
        [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
        [<c1601d80>] ? paging_init+0x112/0x118
        [<c15f578d>] ? setup_arch+0x791/0x82f
        [<c15f43d9>] ? start_kernel+0x6a/0x257
      
      This patch implements node_map_pfn_alignment() which determines
      maximum internode alignment and update numa_register_memblks() to
      reject NUMA configuration if alignment exceeds the pfn -> nid mapping
      granularity of the memory model as determined by PAGES_PER_SECTION.
      
      This makes the problematic machine boot w/ flatmem by rejecting the
      NUMA config and provides protection against crazy NUMA configurations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
      LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
      Reported-and-Tested-by: NHans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Conny Seidel <conny.seidel@amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1e01979c
  8. 09 7月, 2011 4 次提交