1. 10 10月, 2014 19 次提交
    • V
      mm, THP: don't hold mmap_sem in khugepaged when allocating THP · 8b164568
      Vlastimil Babka 提交于
      When allocating huge page for collapsing, khugepaged currently holds
      mmap_sem for reading on the mm where collapsing occurs.  Afterwards the
      read lock is dropped before write lock is taken on the same mmap_sem.
      
      Holding mmap_sem during whole huge page allocation is therefore useless,
      the vma needs to be rechecked after taking the write lock anyway.
      Furthemore, huge page allocation might involve a rather long sync
      compaction, and thus block any mmap_sem writers and i.e.  affect workloads
      that perform frequent m(un)map or mprotect oterations.
      
      This patch simply releases the read lock before allocating a huge page.
      It also deletes an outdated comment that assumed vma must be stable, as it
      was using alloc_hugepage_vma().  This is no longer true since commit
      9f1b868a ("mm: thp: khugepaged: add policy for finding target node").
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b164568
    • V
      mm: page_alloc: determine migratetype only once · 21bb9bd1
      Vlastimil Babka 提交于
      The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
      from gfp_mask in each retry pass, although the migratetype variable
      already has the value determined and it does not change.  Use the variable
      and perform the check only once.  Also convert #ifdef CONFIG_CMA to
      IS_ENABLED.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21bb9bd1
    • M
      mm: cma: adjust address limit to avoid hitting low/high memory boundary · f7426b98
      Marek Szyprowski 提交于
      Russell King recently noticed that limiting default CMA region only to low
      memory on ARM architecture causes serious memory management issues with
      machines having a lot of memory (which is mainly available as high
      memory).  More information can be found the following thread:
      http://thread.gmane.org/gmane.linux.ports.arm.kernel/348441/
      
      Those two patches removes this limit letting kernel to put default CMA
      region into high memory when this is possible (there is enough high memory
      available and architecture specific DMA limit fits).
      
      This should solve strange OOM issues on systems with lots of RAM (i.e.
      >1GiB) and large (>256M) CMA area.
      
      This patch (of 2):
      
      Automatically allocated regions should not cross low/high memory boundary,
      because such regions cannot be later correctly initialized due to spanning
      across two memory zones.  This patch adds a check for this case and a
      simple code for moving region to low memory if automatically selected
      address might not fit completely into high memory.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Daniel Drake <drake@endlessm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7426b98
    • Z
      memory-hotplug: add sysfs valid_zones attribute · ed2f2400
      Zhang Zhen 提交于
      Currently memory-hotplug has two limits:
      
      1. If the memory block is in ZONE_NORMAL, you can change it to
         ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.
      
      2. If the memory block is in ZONE_MOVABLE, you can change it to
         ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.
      
      With this patch, we can easy to know a memory block can be onlined to
      which zone, and don't need to know the above two limits.
      
      Updated the related Documentation.
      
      [akpm@linux-foundation.org: use conventional comment layout]
      [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
      [akpm@linux-foundation.org: remove unused local zone_prev]
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed2f2400
    • V
      mm/mmap.c: whitespace fixes · cc71aba3
      vishnu.ps 提交于
      Signed-off-by: Nvishnu.ps <vishnu.ps@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc71aba3
    • J
      mm/slab: use percpu allocator for cpu cache · bf0dea23
      Joonsoo Kim 提交于
      Because of chicken and egg problem, initialization of SLAB is really
      complicated.  We need to allocate cpu cache through SLAB to make the
      kmem_cache work, but before initialization of kmem_cache, allocation
      through SLAB is impossible.
      
      On the other hand, SLUB does initialization in a more simple way.  It uses
      percpu allocator to allocate cpu cache so there is no chicken and egg
      problem.
      
      So, this patch try to use percpu allocator in SLAB.  This simplifies the
      initialization step in SLAB so that we could maintain SLAB code more
      easily.
      
      In my testing there is no performance difference.
      
      This implementation relies on percpu allocator.  Because percpu allocator
      uses vmalloc address space, vmalloc address space could be exhausted by
      this change on many cpu system with *32 bit* kernel.  This implementation
      can cover 1024 cpus in worst case by following calculation.
      
      Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
      	120 objects per cpu_cache = 140 MB
      Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
      	80 objects per cpu_cache = 46 MB
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf0dea23
    • J
      mm/slab: support slab merge · 12220dea
      Joonsoo Kim 提交于
      Slab merge is good feature to reduce fragmentation.  If new creating slab
      have similar size and property with exsitent slab, this feature reuse it
      rather than creating new one.  As a result, objects are packed into fewer
      slabs so that fragmentation is reduced.
      
      Below is result of my testing.
      
      * After boot, sleep 20; cat /proc/meminfo | grep Slab
      
      <Before>
      Slab: 25136 kB
      
      <After>
      Slab: 24364 kB
      
      We can save 3% memory used by slab.
      
      For supporting this feature in SLAB, we need to implement SLAB specific
      kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
      SLUB specific processing related to debug flag and object size change on
      these functions.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12220dea
    • J
      mm/slab_common: commonize slab merge logic · 423c929c
      Joonsoo Kim 提交于
      Slab merge is good feature to reduce fragmentation.  Now, it is only
      applied to SLUB, but, it would be good to apply it to SLAB.  This patch is
      preparation step to apply slab merge to SLAB by commonizing slab merge
      logic.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      423c929c
    • M
      slab: fix for_each_kmem_cache_node() · 9163582c
      Mikulas Patocka 提交于
      Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node().  The
      for loop reads the array "node" before verifying that the index is within
      the range.  This results in kmemcheck warning.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9163582c
    • J
      slub: fall back to node_to_mem_node() node if allocating on memoryless node · a561ce00
      Joonsoo Kim 提交于
      Update the SLUB code to search for partial slabs on the nearest node with
      memory in the presence of memoryless nodes.  Additionally, do not consider
      it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
      memoryless-node specified allocation goes off-node.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a561ce00
    • J
      topology: add support for node_to_mem_node() to determine the fallback node · ad2c8144
      Joonsoo Kim 提交于
      Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
      on ppc LPARs with memoryless nodes, a large amount of memory was consumed
      by slabs and was marked unreclaimable.  He tracked it down to slab
      deactivations in the SLUB core when we allocate remotely, leading to poor
      efficiency always when memoryless nodes are present.
      
      After much discussion, Joonsoo provided a few patches that help
      significantly.  They don't resolve the problem altogether:
      
       - memory hotplug still needs testing, that is when a memoryless node
         becomes memory-ful, we want to dtrt
       - there are other reasons for going off-node than memoryless nodes,
         e.g., fully exhausted local nodes
      
      Neither case is resolved with this series, but I don't think that should
      block their acceptance, as they can be explored/resolved with follow-on
      patches.
      
      The series consists of:
      
      [1/3] topology: add support for node_to_mem_node() to determine the
            fallback node
      
      [2/3] slub: fallback to node_to_mem_node() node if allocating on
            memoryless node
      
            - Joonsoo's patches to cache the nearest node with memory for each
              NUMA node
      
      [3/3] Partial revert of 81c98869 (""kthread: ensure locality of
            task_struct allocations")
      
       - At Tejun's request, keep the knowledge of memoryless node fallback
         to the allocator core.
      
      This patch (of 3):
      
      We need to determine the fallback node in slub allocator if the allocation
      target node is memoryless node.  Without it, the SLUB wrongly select the
      node which has no memory and can't use a partial slab, because of node
      mismatch.  Introduced function, node_to_mem_node(X), will return a node Y
      with memory that has the nearest distance.  If X is memoryless node, it
      will return nearest distance node, but, if X is normal node, it will
      return itself.
      
      We will use this function in following patch to determine the fallback
      node.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2c8144
    • C
      slub: disable tracing and failslab for merged slabs · c9e16131
      Christoph Lameter 提交于
      Tracing of mergeable slabs as well as uses of failslab are confusing since
      the objects of multiple slab caches will be affected.  Moreover this
      creates a situation where a mergeable slab will become unmergeable.
      
      If tracing or failslab testing is desired then it may be best to switch
      merging off for starters.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NWANG Chao <chaowang@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9e16131
    • J
      mm/slab: factor out unlikely part of cache_free_alien() · 25c4f304
      Joonsoo Kim 提交于
      cache_free_alien() is rarely used function when node mismatch.  But, it is
      defined with inline attribute so it is inlined to __cache_free() which is
      core free function of slab allocator.  It uselessly makes
      kmem_cache_free()/kfree() functions large.  What we really need to inline
      is just checking node match so this patch factor out other parts of
      cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().
      
      <Before>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      <After>
      nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
      0000000000001110 00000000000001b5 T kfree
      0000000000000750 0000000000000181 T kmem_cache_free
      
      You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25c4f304
    • J
      mm/slab: noinline __ac_put_obj() · d3aec344
      Joonsoo Kim 提交于
      Our intention of __ac_put_obj() is that it doesn't affect anything if
      sk_memalloc_socks() is disabled.  But, because __ac_put_obj() is too
      small, compiler inline it to ac_put_obj() and affect code size of free
      path.  This patch add noinline keyword for __ac_put_obj() not to distrupt
      normal free path at all.
      
      <Before>
      nm -S slab-orig.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e80 00000000000002f5 t cache_alloc_refill
      0000000000001230 0000000000000258 T kfree
      0000000000000690 000000000000024c T kmem_cache_free
      
      <After>
      nm -S slab-patched.o |
      	grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"
      
      0000000000001e00 00000000000002e5 t cache_alloc_refill
      00000000000011e0 0000000000000228 T kfree
      0000000000000670 0000000000000216 T kmem_cache_free
      
      cache_alloc_refill: 0x2f5->0x2e5
      kfree: 0x256->0x228
      kmem_cache_free: 0x24c->0x216
      
      code size of each function is reduced slightly.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3aec344
    • J
      mm/slab: move cache_flusharray() out of unlikely.text section · 3d880194
      Joonsoo Kim 提交于
      Now, due to likely keyword, compiled code of cache_flusharray() is on
      unlikely.text section.  Although it is uncommon case compared to free to
      cpu cache case, it is common case than free_block().  But, free_block() is
      on normal text section.  This patch fix this odd situation to remove
      likely keyword.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d880194
    • J
      mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() · 61f47105
      Joonsoo Kim 提交于
      Now, we track caller if tracing or slab debugging is enabled.  If they are
      disabled, we could save one argument passing overhead by calling
      __kmalloc(_node)().  But, I think that it would be marginal.  Furthermore,
      default slab allocator, SLUB, doesn't use this technique so I think that
      it's okay to change this situation.
      
      After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
      kernel build and remove some complicated '#if' defintion.  It looks more
      benefitial to me.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f47105
    • J
      mm/slab_common: move kmem_cache definition to internal header · 07f361b2
      Joonsoo Kim 提交于
      We don't need to keep kmem_cache definition in include/linux/slab.h if we
      don't need to inline kmem_cache_size().  According to my code inspection,
      this function is only called at lc_create() in lib/lru_cache.c which may
      be called at initialization phase of something, so we don't need to inline
      it.  Therfore, move it to slab_common.c and move kmem_cache definition to
      internal header.
      
      After this change, we can change kmem_cache definition easily without full
      kernel build.  For instance, we can turn on/off CONFIG_SLUB_STATS without
      full kernel build.
      
      [akpm@linux-foundation.org: export kmem_cache_size() to modules]
      [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07f361b2
    • A
      mm/slab_common.c: suppress warning · 3aa24f51
      Andrew Morton 提交于
      False positive:
      
      mm/slab_common.c: In function 'kmem_cache_create':
      mm/slab_common.c:204: warning: 's' may be used uninitialized in this function
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa24f51
    • O
      proc/maps: make vm_is_stack() logic namespace-friendly · 58cb6548
      Oleg Nesterov 提交于
      - Rename vm_is_stack() to task_of_stack() and change it to return
        "struct task_struct *" rather than the global (and thus wrong in
        general) pid_t.
      
      - Add the new pid_of_stack() helper which calls task_of_stack() and
        uses the right namespace to report the correct pid_t.
      
        Unfortunately we need to define this helper twice, in task_mmu.c
        and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
        and move at least pid_of_stack/task_of_stack there to avoid the
        code duplication.
      
      - Change show_map_vma() and show_numa_map() to use the new helper.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cb6548
  2. 03 10月, 2014 4 次提交
    • J
      mm: page_alloc: fix zone allocation fairness on UP · abe5f972
      Johannes Weiner 提交于
      The zone allocation batches can easily underflow due to higher-order
      allocations or spills to remote nodes.  On SMP that's fine, because
      underflows are expected from concurrency and dealt with by returning 0.
      But on UP, zone_page_state will just return a wrapped unsigned long,
      which will get past the <= 0 check and then consider the zone eligible
      until its watermarks are hit.
      
      Commit 3a025760 ("mm: page_alloc: spill to remote nodes before
      waking kswapd") already made the counter-resetting use
      atomic_long_read() to accomodate underflows from remote spills, but it
      didn't go all the way with it.
      
      Make it clear that these batches are expected to go negative regardless
      of concurrency, and use atomic_long_read() everywhere.
      
      Fixes: 81c0a2bb ("mm: page_alloc: fair zone allocator policy")
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NLeon Romanovsky <leon@leon.nu>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abe5f972
    • J
      mm: memcontrol: do not iterate uninitialized memcgs · 2f7dd7a4
      Johannes Weiner 提交于
      The cgroup iterators yield css objects that have not yet gone through
      css_online(), but they are not complete memcgs at this point and so the
      memcg iterators should not return them.  Commit d8ad3055 ("mm/memcg:
      iteration skip memcgs not yet fully initialized") set out to implement
      exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
      not meet the ordering requirements for memcg, and so the iterator may
      skip over initialized groups, or return partially initialized memcgs.
      
      The cgroup core can not reasonably provide a clear answer on whether the
      object around the css has been fully initialized, as that depends on
      controller-specific locking and lifetime rules.  Thus, introduce a
      memcg-specific flag that is set after the memcg has been initialized in
      css_online(), and read before mem_cgroup_iter() callers access the memcg
      members.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f7dd7a4
    • M
      mm: numa: Do not mark PTEs pte_numa when splitting huge pages · abc40bd2
      Mel Gorman 提交于
      This patch reverts 1ba6e0b5 ("mm: numa: split_huge_page: transfer the
      NUMA type from the pmd to the pte"). If a huge page is being split due
      a protection change and the tail will be in a PROT_NONE vma then NUMA
      hinting PTEs are temporarily created in the protected VMA.
      
       VM_RW|VM_PROTNONE
      |-----------------|
            ^
            split here
      
      In the specific case above, it should get fixed up by change_pte_range()
      but there is a window of opportunity for weirdness to happen. Similarly,
      if a huge page is shrunk and split during a protection update but before
      pmd_numa is cleared then a pte_numa can be left behind.
      
      Instead of adding complexity trying to deal with the case, this patch
      will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
      will not be triggered which is marginal in comparison to the complexity
      in dealing with the corner cases during THP split.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abc40bd2
    • M
      mm: migrate: Close race between migration completion and mprotect · d3cb8bf6
      Mel Gorman 提交于
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3cb8bf6
  3. 27 9月, 2014 2 次提交
    • M
      fuse: honour max_read and max_write in direct_io mode · 2c80929c
      Miklos Szeredi 提交于
      The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
      bytes a caller asked to pack into fuse request. This value may be lesser
      than capacity of fuse request or iov_iter.  So fuse_get_user_pages() must
      ensure that *nbytesp won't grow.
      
      Now, when helper iov_iter_get_pages() performs all hard work of extracting
      pages from iov_iter, it can be done by passing properly calculated
      "maxsize" to the helper.
      
      The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
      this capability, so pass LONG_MAX as the maxsize argument here.
      
      Fixes: c9c37e2e ("fuse: switch to iov_iter_get_pages()")
      Reported-by: NWerner Baumann <werner.baumann@onlinehome.de>
      Tested-by: NMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c80929c
    • M
      shmem: fix nlink for rename overwrite directory · b928095b
      Miklos Szeredi 提交于
      If overwriting an empty directory with rename, then need to drop the extra
      nlink.
      
      Test prog:
      
      #include <stdio.h>
      #include <fcntl.h>
      #include <err.h>
      #include <sys/stat.h>
      
      int main(void)
      {
      	const char *test_dir1 = "test-dir1";
      	const char *test_dir2 = "test-dir2";
      	int res;
      	int fd;
      	struct stat statbuf;
      
      	res = mkdir(test_dir1, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir1);
      
      	res = mkdir(test_dir2, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir2);
      
      	fd = open(test_dir2, O_RDONLY);
      	if (fd == -1)
      		err(1, "open(\"%s\")", test_dir2);
      
      	res = rename(test_dir1, test_dir2);
      	if (res == -1)
      		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
      
      	res = fstat(fd, &statbuf);
      	if (res == -1)
      		err(1, "fstat(%i)", fd);
      
      	if (statbuf.st_nlink != 0) {
      		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
      		return 1;
      	}
      
      	return 0;
      }
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b928095b
  4. 26 9月, 2014 2 次提交
  5. 25 9月, 2014 3 次提交
  6. 24 9月, 2014 2 次提交
    • A
      kvm: Fix page ageing bugs · 57128468
      Andres Lagar-Cavilla 提交于
      1. We were calling clear_flush_young_notify in unmap_one, but we are
      within an mmu notifier invalidate range scope. The spte exists no more
      (due to range_start) and the accessed bit info has already been
      propagated (due to kvm_pfn_set_accessed). Simply call
      clear_flush_young.
      
      2. We clear_flush_young on a primary MMU PMD, but this may be mapped
      as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
      This required expanding the interface of the clear_flush_young mmu
      notifier, so a lot of code has been trivially touched.
      
      3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
      the access bit by blowing the spte. This requires proper synchronizing
      with MMU notifier consumers, like every other removal of spte's does.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57128468
    • A
      kvm: Faults which trigger IO release the mmap_sem · 234b239b
      Andres Lagar-Cavilla 提交于
      When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
      has been swapped out or is behind a filemap, this will trigger async
      readahead and return immediately. The rationale is that KVM will kick
      back the guest with an "async page fault" and allow for some other
      guest process to take over.
      
      If async PFs are enabled the fault is retried asap from an async
      workqueue. If not, it's retried immediately in the same code path. In
      either case the retry will not relinquish the mmap semaphore and will
      block on the IO. This is a bad thing, as other mmap semaphore users
      now stall as a function of swap or filemap latency.
      
      This patch ensures both the regular and async PF path re-enter the
      fault allowing for the mmap semaphore to be relinquished in the case
      of IO wait.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      234b239b
  7. 19 9月, 2014 1 次提交
  8. 14 9月, 2014 1 次提交
  9. 11 9月, 2014 2 次提交
  10. 09 9月, 2014 1 次提交
  11. 05 9月, 2014 1 次提交
  12. 30 8月, 2014 2 次提交
    • H
      x86,mm: fix pte_special versus pte_numa · b38af472
      Hugh Dickins 提交于
      Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at
      mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
      where zap_pte_range() checks page->mapping to see if PageAnon(page).
      
      Those addresses fit struct pages for pfns d2001 and d2000, and in each
      dump a register or a stack slot showed d2001730 or d2000730: pte flags
      0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
      a hole between cfffffff and 100000000, which would need special access.
      
      Commit c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on
      the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
      pte no longer passes the pte_special() test, so zap_pte_range() goes on
      to try to access a non-existent struct page.
      
      Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to
      complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).  A
      hint that this was a problem was that c46a7c81 added pte_numa() test
      to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast
      path: This was papering over a pte_special() snag when the zero page was
      encountered during zap.  This patch reverts vm_normal_page() to how it
      was before, relying on pte_special().
      
      It still appears that this patch may be incomplete: aren't there other
      places which need to be handling PROTNONE along with PRESENT?  For
      example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a
      PROT_NONE area, that would make it pte_special().  This is side-stepped
      by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there
      are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be
      interesting.
      
      Fixes: c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: <stable@vger.kernel.org>	[3.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b38af472
    • M
      hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked · 7ea8574e
      Michal Hocko 提交于
      spin_lock may be an empty struct for !SMP configurations and so
      arch_spin_is_locked may return unconditional 0 and trigger the VM_BUG_ON
      even when the lock is held.
      
      Replace spin_is_locked by lockdep_assert_held.  We will not BUG anymore
      but it is questionable whether crashing makes a lot of sense in the
      uncharge path.  Uncharge happens after the last page reference was
      released so nobody should touch the page and the function doesn't update
      any shared state except for res counter which uses synchronization of
      its own.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ea8574e