1. 10 10月, 2014 24 次提交
    • V
      mm, compaction: reduce zone checking frequency in the migration scanner · 7d49d886
      Vlastimil Babka 提交于
      The unification of the migrate and free scanner families of function has
      highlighted a difference in how the scanners ensure they only isolate
      pages of the intended zone.  This is important for taking zone lock or lru
      lock of the correct zone.  Due to nodes overlapping, it is however
      possible to encounter a different zone within the range of the zone being
      compacted.
      
      The free scanner, since its inception by commit 748446bb ("mm:
      compaction: memory compaction core"), has been checking the zone of the
      first valid page in a pageblock, and skipping the whole pageblock if the
      zone does not match.
      
      This checking was completely missing from the migration scanner at first,
      and later added by commit dc908600 ("mm: compaction: check for
      overlapping nodes during isolation for migration") in a reaction to a bug
      report.  But the zone comparison in migration scanner is done once per a
      single scanned page, which is more defensive and thus more costly than a
      check per pageblock.
      
      This patch unifies the checking done in both scanners to once per
      pageblock, through a new pageblock_pfn_to_page() function, which also
      includes pfn_valid() checks.  It is more defensive than the current free
      scanner checks, as it checks both the first and last page of the
      pageblock, but less defensive by the migration scanner per-page checks.
      It assumes that node overlapping may result (on some architecture) in a
      boundary between two nodes falling into the middle of a pageblock, but
      that there cannot be a node0 node1 node0 interleaving within a single
      pageblock.
      
      The result is more code being shared and a bit less per-page CPU cost in
      the migration scanner.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d49d886
    • V
      mm, compaction: move pageblock checks up from isolate_migratepages_range() · edc2ca61
      Vlastimil Babka 提交于
      isolate_migratepages_range() is the main function of the compaction
      scanner, called either on a single pageblock by isolate_migratepages()
      during regular compaction, or on an arbitrary range by CMA's
      __alloc_contig_migrate_range().  It currently perfoms two pageblock-wide
      compaction suitability checks, and because of the CMA callpath, it tracks
      if it crossed a pageblock boundary in order to repeat those checks.
      
      However, closer inspection shows that those checks are always true for CMA:
      - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
      - migrate_async_suitable() check is skipped because CMA uses sync compaction
      
      We can therefore move the compaction-specific checks to
      isolate_migratepages() and simplify isolate_migratepages_range().
      Furthermore, we can mimic the freepage scanner family of functions, which
      has isolate_freepages_block() function called both by compaction from
      isolate_freepages() and by CMA from isolate_freepages_range(), where each
      use-case adds own specific glue code.  This allows further code
      simplification.
      
      Thus, we rename isolate_migratepages_range() to
      isolate_migratepages_block() and limit its functionality to a single
      pageblock (or its subset).  For CMA, a new different
      isolate_migratepages_range() is created as a CMA-specific wrapper for the
      _block() function.  The checks specific to compaction are moved to
      isolate_migratepages().  As part of the unification of these two families
      of functions, we remove the redundant zone parameter where applicable,
      since zone pointer is already passed in cc->zone.
      
      Furthermore, going back to compact_zone() and compact_finished() when
      pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
      - the checks are meant to skip pageblocks quickly.  The patch therefore
      also introduces a simple loop into isolate_migratepages() so that it does
      not return immediately on failed pageblock checks, but keeps going until
      isolate_migratepages_range() gets called once.  Similarily to
      isolate_freepages(), the function periodically checks if it needs to
      reschedule or abort async compaction.
      
      [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc2ca61
    • V
      mm, compaction: do not recheck suitable_migration_target under lock · f8224aa5
      Vlastimil Babka 提交于
      isolate_freepages_block() rechecks if the pageblock is suitable to be a
      target for migration after it has taken the zone->lock.  However, the
      check has been optimized to occur only once per pageblock, and
      compact_checklock_irqsave() might be dropping and reacquiring lock, which
      means somebody else might have changed the pageblock's migratetype
      meanwhile.
      
      Furthermore, nothing prevents the migratetype to change right after
      isolate_freepages_block() has finished isolating.  Given how imperfect
      this is, it's simpler to just rely on the check done in
      isolate_freepages() without lock, and not pretend that the recheck under
      lock guarantees anything.  It is just a heuristic after all.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8224aa5
    • V
      mm, compaction: do not count compact_stall if all zones skipped compaction · 98dd3b48
      Vlastimil Babka 提交于
      The compact_stall vmstat counter counts the number of allocations stalled
      by direct compaction.  It does not count when all attempted zones had
      deferred compaction, but it does count when all zones skipped compaction.
      The skipping is decided based on very early check of
      compaction_suitable(), based on watermarks and memory fragmentation.
      Therefore it makes sense not to count skipped compactions as stalls.
      Moreover, compact_success or compact_fail is also already not being
      counted when compaction was skipped, so this patch changes the
      compact_stall counting to match the other two.
      
      Additionally, restructure __alloc_pages_direct_compact() code for better
      readability.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98dd3b48
    • V
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka 提交于
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760 ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
    • V
      mm, THP: don't hold mmap_sem in khugepaged when allocating THP · 8b164568
      Vlastimil Babka 提交于
      When allocating huge page for collapsing, khugepaged currently holds
      mmap_sem for reading on the mm where collapsing occurs.  Afterwards the
      read lock is dropped before write lock is taken on the same mmap_sem.
      
      Holding mmap_sem during whole huge page allocation is therefore useless,
      the vma needs to be rechecked after taking the write lock anyway.
      Furthemore, huge page allocation might involve a rather long sync
      compaction, and thus block any mmap_sem writers and i.e.  affect workloads
      that perform frequent m(un)map or mprotect oterations.
      
      This patch simply releases the read lock before allocating a huge page.
      It also deletes an outdated comment that assumed vma must be stable, as it
      was using alloc_hugepage_vma().  This is no longer true since commit
      9f1b868a ("mm: thp: khugepaged: add policy for finding target node").
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b164568
    • V
      mm: page_alloc: determine migratetype only once · 21bb9bd1
      Vlastimil Babka 提交于
      The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
      from gfp_mask in each retry pass, although the migratetype variable
      already has the value determined and it does not change.  Use the variable
      and perform the check only once.  Also convert #ifdef CONFIG_CMA to
      IS_ENABLED.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21bb9bd1
    • M
      mm: cma: adjust address limit to avoid hitting low/high memory boundary · f7426b98
      Marek Szyprowski 提交于
      Russell King recently noticed that limiting default CMA region only to low
      memory on ARM architecture causes serious memory management issues with
      machines having a lot of memory (which is mainly available as high
      memory).  More information can be found the following thread:
      http://thread.gmane.org/gmane.linux.ports.arm.kernel/348441/
      
      Those two patches removes this limit letting kernel to put default CMA
      region into high memory when this is possible (there is enough high memory
      available and architecture specific DMA limit fits).
      
      This should solve strange OOM issues on systems with lots of RAM (i.e.
      >1GiB) and large (>256M) CMA area.
      
      This patch (of 2):
      
      Automatically allocated regions should not cross low/high memory boundary,
      because such regions cannot be later correctly initialized due to spanning
      across two memory zones.  This patch adds a check for this case and a
      simple code for moving region to low memory if automatically selected
      address might not fit completely into high memory.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Daniel Drake <drake@endlessm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7426b98
    • Z
      memory-hotplug: add sysfs valid_zones attribute · ed2f2400
      Zhang Zhen 提交于
      Currently memory-hotplug has two limits:
      
      1. If the memory block is in ZONE_NORMAL, you can change it to
         ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.
      
      2. If the memory block is in ZONE_MOVABLE, you can change it to
         ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.
      
      With this patch, we can easy to know a memory block can be onlined to
      which zone, and don't need to know the above two limits.
      
      Updated the related Documentation.
      
      [akpm@linux-foundation.org: use conventional comment layout]
      [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
      [akpm@linux-foundation.org: remove unused local zone_prev]
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed2f2400
    • V
      mm/mmap.c: whitespace fixes · cc71aba3
      vishnu.ps 提交于
      Signed-off-by: Nvishnu.ps <vishnu.ps@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc71aba3
    • J
      mm/slab: use percpu allocator for cpu cache · bf0dea23
      Joonsoo Kim 提交于
      Because of chicken and egg problem, initialization of SLAB is really
      complicated.  We need to allocate cpu cache through SLAB to make the
      kmem_cache work, but before initialization of kmem_cache, allocation
      through SLAB is impossible.
      
      On the other hand, SLUB does initialization in a more simple way.  It uses
      percpu allocator to allocate cpu cache so there is no chicken and egg
      problem.
      
      So, this patch try to use percpu allocator in SLAB.  This simplifies the
      initialization step in SLAB so that we could maintain SLAB code more
      easily.
      
      In my testing there is no performance difference.
      
      This implementation relies on percpu allocator.  Because percpu allocator
      uses vmalloc address space, vmalloc address space could be exhausted by
      this change on many cpu system with *32 bit* kernel.  This implementation
      can cover 1024 cpus in worst case by following calculation.
      
      Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
      	120 objects per cpu_cache = 140 MB
      Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
      	80 objects per cpu_cache = 46 MB
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf0dea23
    • J
      mm/slab: support slab merge · 12220dea
      Joonsoo Kim 提交于
      Slab merge is good feature to reduce fragmentation.  If new creating slab
      have similar size and property with exsitent slab, this feature reuse it
      rather than creating new one.  As a result, objects are packed into fewer
      slabs so that fragmentation is reduced.
      
      Below is result of my testing.
      
      * After boot, sleep 20; cat /proc/meminfo | grep Slab
      
      <Before>
      Slab: 25136 kB
      
      <After>
      Slab: 24364 kB
      
      We can save 3% memory used by slab.
      
      For supporting this feature in SLAB, we need to implement SLAB specific
      kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
      SLUB specific processing related to debug flag and object size change on
      these functions.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12220dea
    • J
      mm/slab_common: commonize slab merge logic · 423c929c
      Joonsoo Kim 提交于
      Slab merge is good feature to reduce fragmentation.  Now, it is only
      applied to SLUB, but, it would be good to apply it to SLAB.  This patch is
      preparation step to apply slab merge to SLAB by commonizing slab merge
      logic.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      423c929c
    • M
      slab: fix for_each_kmem_cache_node() · 9163582c
      Mikulas Patocka 提交于
      Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node().  The
      for loop reads the array "node" before verifying that the index is within
      the range.  This results in kmemcheck warning.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9163582c
    • J
      slub: fall back to node_to_mem_node() node if allocating on memoryless node · a561ce00
      Joonsoo Kim 提交于
      Update the SLUB code to search for partial slabs on the nearest node with
      memory in the presence of memoryless nodes.  Additionally, do not consider
      it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
      memoryless-node specified allocation goes off-node.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a561ce00
    • J
      topology: add support for node_to_mem_node() to determine the fallback node · ad2c8144
      Joonsoo Kim 提交于
      Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
      on ppc LPARs with memoryless nodes, a large amount of memory was consumed
      by slabs and was marked unreclaimable.  He tracked it down to slab
      deactivations in the SLUB core when we allocate remotely, leading to poor
      efficiency always when memoryless nodes are present.
      
      After much discussion, Joonsoo provided a few patches that help
      significantly.  They don't resolve the problem altogether:
      
       - memory hotplug still needs testing, that is when a memoryless node
         becomes memory-ful, we want to dtrt
       - there are other reasons for going off-node than memoryless nodes,
         e.g., fully exhausted local nodes
      
      Neither case is resolved with this series, but I don't think that should
      block their acceptance, as they can be explored/resolved with follow-on
      patches.
      
      The series consists of:
      
      [1/3] topology: add support for node_to_mem_node() to determine the
            fallback node
      
      [2/3] slub: fallback to node_to_mem_node() node if allocating on
            memoryless node
      
            - Joonsoo's patches to cache the nearest node with memory for each
              NUMA node
      
      [3/3] Partial revert of 81c98869 (""kthread: ensure locality of
            task_struct allocations")
      
       - At Tejun's request, keep the knowledge of memoryless node fallback
         to the allocator core.
      
      This patch (of 3):
      
      We need to determine the fallback node in slub allocator if the allocation
      target node is memoryless node.  Without it, the SLUB wrongly select the
      node which has no memory and can't use a partial slab, because of node
      mismatch.  Introduced function, node_to_mem_node(X), will return a node Y
      with memory that has the nearest distance.  If X is memoryless node, it
      will return nearest distance node, but, if X is normal node, it will
      return itself.
      
      We will use this function in following patch to determine the fallback
      node.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2c8144
    • C
      slub: disable tracing and failslab for merged slabs · c9e16131
      Christoph Lameter 提交于
      Tracing of mergeable slabs as well as uses of failslab are confusing since
      the objects of multiple slab caches will be affected.  Moreover this
      creates a situation where a mergeable slab will become unmergeable.
      
      If tracing or failslab testing is desired then it may be best to switch
      merging off for starters.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NWANG Chao <chaowang@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9e16131
    • J
      mm/slab: factor out unlikely part of cache_free_alien() · 25c4f304
      Joonsoo Kim 提交于
      cache_free_alien() is rarely used function when node mismatch.  But, it is
      defined with inline attribute so it is inlined to __cache_free() which is
      core free function of slab allocator.  It uselessly makes
      kmem_cache_free()/kfree() functions large.  What we really need to inline
      is just checking node match so this patch factor out other parts of
      cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().
      
      <Before>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      <After>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      0000000000001110 00000000000001b5 T kfree
      0000000000000750 0000000000000181 T kmem_cache_free
      
      You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25c4f304
    • J
      mm/slab: noinline __ac_put_obj() · d3aec344
      Joonsoo Kim 提交于
      Our intention of __ac_put_obj() is that it doesn't affect anything if
      sk_memalloc_socks() is disabled.  But, because __ac_put_obj() is too
      small, compiler inline it to ac_put_obj() and affect code size of free
      path.  This patch add noinline keyword for __ac_put_obj() not to distrupt
      normal free path at all.
      
      <Before>
      nm -S slab-orig.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e80 00000000000002f5 t cache_alloc_refill
      0000000000001230 0000000000000258 T kfree
      0000000000000690 000000000000024c T kmem_cache_free
      
      <After>
      nm -S slab-patched.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e00 00000000000002e5 t cache_alloc_refill
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      cache_alloc_refill: 0x2f5->0x2e5
      kfree: 0x256->0x228
      kmem_cache_free: 0x24c->0x216
      
      code size of each function is reduced slightly.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3aec344
    • J
      mm/slab: move cache_flusharray() out of unlikely.text section · 3d880194
      Joonsoo Kim 提交于
      Now, due to likely keyword, compiled code of cache_flusharray() is on
      unlikely.text section.  Although it is uncommon case compared to free to
      cpu cache case, it is common case than free_block().  But, free_block() is
      on normal text section.  This patch fix this odd situation to remove
      likely keyword.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d880194
    • J
      mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() · 61f47105
      Joonsoo Kim 提交于
      Now, we track caller if tracing or slab debugging is enabled.  If they are
      disabled, we could save one argument passing overhead by calling
      __kmalloc(_node)().  But, I think that it would be marginal.  Furthermore,
      default slab allocator, SLUB, doesn't use this technique so I think that
      it's okay to change this situation.
      
      After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
      kernel build and remove some complicated '#if' defintion.  It looks more
      benefitial to me.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f47105
    • J
      mm/slab_common: move kmem_cache definition to internal header · 07f361b2
      Joonsoo Kim 提交于
      We don't need to keep kmem_cache definition in include/linux/slab.h if we
      don't need to inline kmem_cache_size().  According to my code inspection,
      this function is only called at lc_create() in lib/lru_cache.c which may
      be called at initialization phase of something, so we don't need to inline
      it.  Therfore, move it to slab_common.c and move kmem_cache definition to
      internal header.
      
      After this change, we can change kmem_cache definition easily without full
      kernel build.  For instance, we can turn on/off CONFIG_SLUB_STATS without
      full kernel build.
      
      [akpm@linux-foundation.org: export kmem_cache_size() to modules]
      [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07f361b2
    • A
      mm/slab_common.c: suppress warning · 3aa24f51
      Andrew Morton 提交于
      False positive:
      
      mm/slab_common.c: In function 'kmem_cache_create':
      mm/slab_common.c:204: warning: 's' may be used uninitialized in this function
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa24f51
    • O
      proc/maps: make vm_is_stack() logic namespace-friendly · 58cb6548
      Oleg Nesterov 提交于
      - Rename vm_is_stack() to task_of_stack() and change it to return
        "struct task_struct *" rather than the global (and thus wrong in
        general) pid_t.
      
      - Add the new pid_of_stack() helper which calls task_of_stack() and
        uses the right namespace to report the correct pid_t.
      
        Unfortunately we need to define this helper twice, in task_mmu.c
        and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
        and move at least pid_of_stack/task_of_stack there to avoid the
        code duplication.
      
      - Change show_map_vma() and show_numa_map() to use the new helper.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cb6548
  2. 03 10月, 2014 4 次提交
    • J
      mm: page_alloc: fix zone allocation fairness on UP · abe5f972
      Johannes Weiner 提交于
      The zone allocation batches can easily underflow due to higher-order
      allocations or spills to remote nodes.  On SMP that's fine, because
      underflows are expected from concurrency and dealt with by returning 0.
      But on UP, zone_page_state will just return a wrapped unsigned long,
      which will get past the <= 0 check and then consider the zone eligible
      until its watermarks are hit.
      
      Commit 3a025760 ("mm: page_alloc: spill to remote nodes before
      waking kswapd") already made the counter-resetting use
      atomic_long_read() to accomodate underflows from remote spills, but it
      didn't go all the way with it.
      
      Make it clear that these batches are expected to go negative regardless
      of concurrency, and use atomic_long_read() everywhere.
      
      Fixes: 81c0a2bb ("mm: page_alloc: fair zone allocator policy")
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NLeon Romanovsky <leon@leon.nu>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abe5f972
    • J
      mm: memcontrol: do not iterate uninitialized memcgs · 2f7dd7a4
      Johannes Weiner 提交于
      The cgroup iterators yield css objects that have not yet gone through
      css_online(), but they are not complete memcgs at this point and so the
      memcg iterators should not return them.  Commit d8ad3055 ("mm/memcg:
      iteration skip memcgs not yet fully initialized") set out to implement
      exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
      not meet the ordering requirements for memcg, and so the iterator may
      skip over initialized groups, or return partially initialized memcgs.
      
      The cgroup core can not reasonably provide a clear answer on whether the
      object around the css has been fully initialized, as that depends on
      controller-specific locking and lifetime rules.  Thus, introduce a
      memcg-specific flag that is set after the memcg has been initialized in
      css_online(), and read before mem_cgroup_iter() callers access the memcg
      members.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f7dd7a4
    • M
      mm: numa: Do not mark PTEs pte_numa when splitting huge pages · abc40bd2
      Mel Gorman 提交于
      This patch reverts 1ba6e0b5 ("mm: numa: split_huge_page: transfer the
      NUMA type from the pmd to the pte"). If a huge page is being split due
      a protection change and the tail will be in a PROT_NONE vma then NUMA
      hinting PTEs are temporarily created in the protected VMA.
      
       VM_RW|VM_PROTNONE
      |-----------------|
            ^
            split here
      
      In the specific case above, it should get fixed up by change_pte_range()
      but there is a window of opportunity for weirdness to happen. Similarly,
      if a huge page is shrunk and split during a protection update but before
      pmd_numa is cleared then a pte_numa can be left behind.
      
      Instead of adding complexity trying to deal with the case, this patch
      will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
      will not be triggered which is marginal in comparison to the complexity
      in dealing with the corner cases during THP split.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abc40bd2
    • M
      mm: migrate: Close race between migration completion and mprotect · d3cb8bf6
      Mel Gorman 提交于
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3cb8bf6
  3. 27 9月, 2014 2 次提交
    • M
      fuse: honour max_read and max_write in direct_io mode · 2c80929c
      Miklos Szeredi 提交于
      The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
      bytes a caller asked to pack into fuse request. This value may be lesser
      than capacity of fuse request or iov_iter.  So fuse_get_user_pages() must
      ensure that *nbytesp won't grow.
      
      Now, when helper iov_iter_get_pages() performs all hard work of extracting
      pages from iov_iter, it can be done by passing properly calculated
      "maxsize" to the helper.
      
      The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
      this capability, so pass LONG_MAX as the maxsize argument here.
      
      Fixes: c9c37e2e ("fuse: switch to iov_iter_get_pages()")
      Reported-by: NWerner Baumann <werner.baumann@onlinehome.de>
      Tested-by: NMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c80929c
    • M
      shmem: fix nlink for rename overwrite directory · b928095b
      Miklos Szeredi 提交于
      If overwriting an empty directory with rename, then need to drop the extra
      nlink.
      
      Test prog:
      
      #include <stdio.h>
      #include <fcntl.h>
      #include <err.h>
      #include <sys/stat.h>
      
      int main(void)
      {
      	const char *test_dir1 = "test-dir1";
      	const char *test_dir2 = "test-dir2";
      	int res;
      	int fd;
      	struct stat statbuf;
      
      	res = mkdir(test_dir1, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir1);
      
      	res = mkdir(test_dir2, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir2);
      
      	fd = open(test_dir2, O_RDONLY);
      	if (fd == -1)
      		err(1, "open(\"%s\")", test_dir2);
      
      	res = rename(test_dir1, test_dir2);
      	if (res == -1)
      		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
      
      	res = fstat(fd, &statbuf);
      	if (res == -1)
      		err(1, "fstat(%i)", fd);
      
      	if (statbuf.st_nlink != 0) {
      		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
      		return 1;
      	}
      
      	return 0;
      }
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b928095b
  4. 26 9月, 2014 2 次提交
  5. 25 9月, 2014 3 次提交
  6. 24 9月, 2014 2 次提交
    • A
      kvm: Fix page ageing bugs · 57128468
      Andres Lagar-Cavilla 提交于
      1. We were calling clear_flush_young_notify in unmap_one, but we are
      within an mmu notifier invalidate range scope. The spte exists no more
      (due to range_start) and the accessed bit info has already been
      propagated (due to kvm_pfn_set_accessed). Simply call
      clear_flush_young.
      
      2. We clear_flush_young on a primary MMU PMD, but this may be mapped
      as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
      This required expanding the interface of the clear_flush_young mmu
      notifier, so a lot of code has been trivially touched.
      
      3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
      the access bit by blowing the spte. This requires proper synchronizing
      with MMU notifier consumers, like every other removal of spte's does.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57128468
    • A
      kvm: Faults which trigger IO release the mmap_sem · 234b239b
      Andres Lagar-Cavilla 提交于
      When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
      has been swapped out or is behind a filemap, this will trigger async
      readahead and return immediately. The rationale is that KVM will kick
      back the guest with an "async page fault" and allow for some other
      guest process to take over.
      
      If async PFs are enabled the fault is retried asap from an async
      workqueue. If not, it's retried immediately in the same code path. In
      either case the retry will not relinquish the mmap semaphore and will
      block on the IO. This is a bad thing, as other mmap semaphore users
      now stall as a function of swap or filemap latency.
      
      This patch ensures both the regular and async PF path re-enter the
      fault allowing for the mmap semaphore to be relinquished in the case
      of IO wait.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      234b239b
  7. 19 9月, 2014 1 次提交
  8. 14 9月, 2014 1 次提交
  9. 11 9月, 2014 1 次提交