1. 08 4月, 2014 36 次提交
    • K
      mm: use 'const char *' insted of 'char *' for reason in dump_page() · d230dec1
      Kirill A. Shutemov 提交于
      I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
      warning:
      
        warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
      
      Let's convert 'reason' to 'const char *' in dump_page() and friends: we
      shouldn't modify it anyway.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d230dec1
    • G
      mm/vmalloc.c: enhance vm_map_ram() comment · 36437638
      Gioh Kim 提交于
      vm_map_ram() has a fragmentation problem when it cannot purge a
      chunk(ie, 4M address space) if there is a pinning object in that
      addresss space.  So it could consume all VMALLOC address space easily.
      
      We can fix the fragmentation problem by using vmap instead of
      vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
      Minchan said vm_map_ram is 5 times faster than vmap in his tests.  So I
      thought we should fix fragment problem of vm_map_ram because our
      proprietary GPU driver has used it heavily.
      
      On second thought, it's not an easy because we should reuse freed space
      for solving the problem and it could make more IPI and bitmap operation
      for searching hole.  It could mitigate API's goal which is very fast
      mapping.  And even fragmentation problem wouldn't show in 64 bit
      machine.
      
      Another option is that the user should separate long-life and short-life
      object and use vmap for long-life but vm_map_ram for short-life.  If we
      inform the user about the characteristic of vm_map_ram the user can
      choose one according to the page lifetime.
      
      Let's add some notice messages to user.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NGioh Kim <gioh.kim@lge.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36437638
    • C
    • M
      mempool: add unlikely and likely hints · eb9a3c62
      Mikulas Patocka 提交于
      Add unlikely and likely hints to the function mempool_free.  It lays out
      the code in such a way that the common path is executed straighforward and
      saves a cache line.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb9a3c62
    • D
      mm, compaction: determine isolation mode only once · da1c67a7
      David Rientjes 提交于
      The conditions that control the isolation mode in
      isolate_migratepages_range() do not change during the iteration, so
      extract them out and only define the value once.
      
      This actually does have an effect, gcc doesn't optimize it itself because
      of cc->sync.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da1c67a7
    • D
      res_counter: remove interface for locked charging and uncharging · 539a13b4
      David Rientjes 提交于
      The res_counter_{charge,uncharge}_locked() variants are not used in the
      kernel outside of the resource counter code itself, so remove the
      interface.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      539a13b4
    • D
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes 提交于
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • D
      mm, mempolicy: rename slab_node for clarity · 2a389610
      David Rientjes 提交于
      slab_node() is actually a mempolicy function, so rename it to
      mempolicy_slab_node() to make it clearer that it used for processes with
      mempolicies.
      
      At the same time, cleanup its code by saving numa_mem_id() in a local
      variable (since we require a node with memory, not just any node) and
      remove an obsolete comment that assumes the mempolicy is actually passed
      into the function.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a389610
    • D
      fork: collapse copy_flags into copy_process · 514ddb44
      David Rientjes 提交于
      copy_flags() does not use the clone_flags formal and can be collapsed
      into copy_process() for cleaner code.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      514ddb44
    • G
      mm: use macros from compiler.h instead of __attribute__((...)) · 3b32123d
      Gideon Israel Dsouza 提交于
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes with
      the right macro in the memory management (/mm) subsystem.
      
      [akpm@linux-foundation.org: while-we're-there consistency tweaks]
      Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b32123d
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • N
      mm: implement ->map_pages for shmem/tmpfs · d7c17551
      Ning Qu 提交于
      In shmem/tmpfs, we also use the generic filemap_map_pages, seems the
      additional checking is not worth a separate version of map_pages for it.
      Signed-off-by: NNing Qu <quning@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7c17551
    • K
      mm: add debugfs tunable for fault_around_order · 1592eef0
      Kirill A. Shutemov 提交于
      Let's allow people to tweak faultaround at runtime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1592eef0
    • K
      mm: cleanup size checks in filemap_fault() and filemap_map_pages() · 99e3e53f
      Kirill A. Shutemov 提交于
      Minor cleanups:
       - 'size' variable is now in bytes, not pages;
       - use round_up(): it should be easier to read.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99e3e53f
    • K
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov 提交于
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • K
      mm: introduce vm_ops->map_pages() · 8c6e50b0
      Kirill A. Shutemov 提交于
      Here's new version of faultaround patchset.  It took a while to tune it
      and collect performance data.
      
      First patch adds new callback ->map_pages to vm_operations_struct.
      
      ->map_pages() is called when VM asks to map easy accessible pages.
      Filesystem should find and map pages associated with offsets from
      "pgoff" till "max_pgoff".  ->map_pages() is called with page table
      locked and must not block.  If it's not possible to reach a page without
      blocking, filesystem should skip it.  Filesystem should use do_set_pte()
      to setup page table entry.  Pointer to entry associated with offset
      "pgoff" is passed in "pte" field in vm_fault structure.  Pointers to
      entries for other offsets should be calculated relative to "pte".
      
      Currently VM use ->map_pages only on read page fault path.  We try to
      map FAULT_AROUND_PAGES a time.  FAULT_AROUND_PAGES is 16 for now.
      Performance data for different FAULT_AROUND_ORDER is below.
      
      TODO:
       - implement ->map_pages() for shmem/tmpfs;
       - modify get_user_pages() to be able to use ->map_pages() and implement
         mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
      
      =========================================================================
      Tested on 4-socket machine (120 threads) with 128GiB of RAM.
      
      Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
      somewhere between 3 and 5. Let's say 4 :)
      
      Linux build (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
      	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
      Linux rebuild (make -j60)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
      	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
      Git test suite (make -j60 test)
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
      	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
      	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413
      
      Two synthetic tests: access every word in file in sequential/random order.
      It doesn't improve much after FAULT_AROUND_ORDER == 4.
      
      Sequential access 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
      	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
       8 threads
      	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
      	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
       32 threads
      	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
      	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
       60 threads
      	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
      	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
       120 threads
      	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
      	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
      Random access 1GiB file
       1 thread
      	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
      	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
       8 threads
      	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
      	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
       32 threads
      	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
      	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
       60 threads
      	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
      	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
       120 threads
      	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
      	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594
      
      Touch only one page in page table in 16GiB file
      FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
       1 thread
      	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
      	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
       8 threads
      	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
      	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
       32 threads
      	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
      	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
       60 threads
      	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
      	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
       120 threads
      	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
      	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125
      
      This patch (of 2):
      
      Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
      accessible pages around fault address.
      
      On read page fault, if filesystem provides ->map_pages(), we try to map up
      to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
      number of minor page faults.
      
      We call ->map_pages first and use ->fault() as fallback if page by the
      offset is not ready to be mapped (cold page cache or something).
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c6e50b0
    • A
      drivers/lguest/page_tables.c: rename do_set_pte() · 179e0963
      Andrew Morton 提交于
      "mm: introduce vm_ops->map_pages()" wants to export a do_set_pte() from core
      kernel.  Rename lguest's do_set_pte() to something more lguest-specific.
      
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      179e0963
    • K
      tools/vm/page-types.c: page-cache sniffing feature · 65a6a410
      Konstantin Khlebnikov 提交于
      After this patch 'page-types' can walk over a file's mappings and
      analyze populated page cache pages mostly without disturbing its state.
      
      It maps chunk of file, marks VMA as MADV_RANDOM to turn off readahead,
      pokes VMA via mincore() to determine cached pages, triggers page-fault
      only for them, and finally gathers information via pagemap/kpageflags.
      Before unmap it marks VMA as MADV_SEQUENTIAL for ignoring reference
      bits.
      
      usage: page-types -f <path>
      
      If <path> is directory it will analyse all files in all subdirectories.
      
      Symlinks are not followed as well as mount points.  Hardlinks aren't
      handled, they'll be dumped as many times as they are found.  Recursive
      walk brings all dentries into dcache and populates page cache of
      block-devices aka 'Buffers'.
      
      Probably it's worth to add ioctl for dumping file page cache as array of
      PFNs as a replacement for this hackish juggling with
      mmap/madvise/mincore/pagemap.  Also recursive walk could be replaced
      with dumping cached inodes via some ioctl or debugfs interface followed
      by openning them via open_by_handle_at, this would fix hardlinks
      handling and unneeded population of dcache and buffers.  This interface
      might be used as data source for constructing readahead plans and for
      background optimizations of actively used files.
      
      collateral changes:
      + fix 64-bit LFS: define _FILE_OFFSET_BITS instead of _LARGEFILE64_SOURCE
      + replace lseek + read with single pread
      + make show_page_range() reusable after flush
      
      usage example:
      
        ~/src/linux/tools/vm$ sudo ./page-types -L -f page-types
        foffset offset    flags
        page-types       Inode: 2229277       Size: 89065 (22 pages)
        Modify: Tue Feb 25 12:00:59 2014 (162 seconds ago)
        Access: Tue Feb 25 12:01:00 2014 (161 seconds ago)
        0       3cbf3b     __RU_lA____M________________________
        1       38946a     __RU_lA____M________________________
        2       1a3cec     __RU_lA____M________________________
        3       1a8321     __RU_lA____M________________________
        4       3af7cc     __RU_lA____M________________________
        5       1ed532     __RU_lA_____________________________
        6       2e436a     __RU_lA_____________________________
        7       29a35e     ___U_lA_____________________________
        8       2de86e     ___U_lA_____________________________
        9       3bdfb4     ___U_lA_____________________________
        10      3cd8a3     ___U_lA_____________________________
        11      2afa50     ___U_lA_____________________________
        12      2534c2     ___U_lA_____________________________
        13      1b7a40     ___U_lA_____________________________
        14      17b0be     ___U_lA_____________________________
        15      392b0c     ___U_lA_____________________________
        16      3ba46a     __RU_lA_____________________________
        17      397dc8     ___U_lA_____________________________
        18      1f2a36     ___U_lA_____________________________
        19      21fd30     __RU_lA_____________________________
        20      2c35ba     __RU_l______________________________
        21      20f181     __RU_l______________________________
      
                     flags page-count   MB  symbolic-flags                        long-symbolic-flags
        0x000000000000002c          2    0  __RU_l______________________________  referenced,uptodate,lru
        0x0000000000000068         11    0  ___U_lA_____________________________  uptodate,lru,active
        0x000000000000006c          4    0  __RU_lA_____________________________  referenced,uptodate,lru,active
        0x000000000000086c          5    0  __RU_lA____M________________________  referenced,uptodate,lru,active,mmap
                     total         22    0
      
        ~/src/linux/tools/vm$ sudo ./page-types -f /
                     flags page-count     MB  symbolic-flags                        long-symbolic-flags
        0x0000000000000028      21761     85  ___U_l______________________________  uptodate,lru
        0x000000000000002c     127279    497  __RU_l______________________________  referenced,uptodate,lru
        0x0000000000000068      74160    289  ___U_lA_____________________________  uptodate,lru,active
        0x000000000000006c      84469    329  __RU_lA_____________________________  referenced,uptodate,lru,active
        0x000000000000007c          1      0  __RUDlA_____________________________  referenced,uptodate,dirty,lru,active
        0x0000000000000228        370      1  ___U_l___I__________________________  uptodate,lru,reclaim
        0x0000000000000828         49      0  ___U_l_____M________________________  uptodate,lru,mmap
        0x000000000000082c        126      0  __RU_l_____M________________________  referenced,uptodate,lru,mmap
        0x0000000000000868        137      0  ___U_lA____M________________________  uptodate,lru,active,mmap
        0x000000000000086c      12890     50  __RU_lA____M________________________  referenced,uptodate,lru,active,mmap
                     total     321242   1254
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a6a410
    • K
      mm: disable split page table lock for !MMU · 9164550e
      Kirill A. Shutemov 提交于
      There's no reason to enable split page table lock if don't have page
      tables.
      
      It also triggers build error at least on ARM since we don't define
      pmd_page() for !MMU.
      
        In file included from arch/arm/kernel/asm-offsets.c:14:0:
        include/linux/mm.h: In function 'pte_lockptr':
        include/linux/mm.h:1392:2: error: implicit declaration of function 'pmd_page' [-Werror=implicit-function-declaration]
        include/linux/mm.h:1392:2: warning: passing argument 1 of 'ptlock_ptr' makes pointer from integer without a cast [enabled by default]
        include/linux/mm.h:1384:27: note: expected 'struct page *' but argument is of type 'int'
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9164550e
    • A
      exec: kill the unnecessary mm->def_flags setting in load_elf_binary() · ab0e113f
      Alex Thorlton 提交于
      load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
      is always zero.  Not only this looks strange, this is unnecessary
      because mm_init() has already set ->def_flags = 0.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab0e113f
    • A
      mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE · a0715cc2
      Alex Thorlton 提交于
      Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs.  It
      also adds a prctl control which allows us to set the THP disable bit in
      mm->def_flags so that VMs will pick up the setting as they are created.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0715cc2
    • A
      mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" · 1e1836e8
      Alex Thorlton 提交于
      The main motivation behind this patch is to provide a way to disable THP
      for jobs where the code cannot be modified, and using a malloc hook with
      madvise is not an option (i.e.  statically allocated data).  This patch
      allows us to do just that, without affecting other jobs running on the
      system.
      
      We need to do this sort of thing for jobs where THP hurts performance,
      due to the possibility of increased remote memory accesses that can be
      created by situations such as the following:
      
      When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
      be handed out, and the THP will be stuck on whatever node the chunk was
      originally referenced from.  If many remote nodes need to do work on
      that same chunk, they'll be making remote accesses.
      
      With THP disabled, 4K pages can be handed out to separate nodes as
      they're needed, greatly reducing the amount of remote accesses to
      memory.
      
      This patch is based on some of my work combined with some
      suggestions/patches given by Oleg Nesterov.  The main goal here is to
      add a prctl switch to allow us to disable to THP on a per mm_struct
      basis.
      
      Here's a bit of test data with the new patch in place...
      
      First with the flag unset:
      
        # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 0
        Set thp_disabled state to 0
        Process pid = 18027
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
                 1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
                     7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
                 1,698,932 page-faults               #    0.000 M/sec
        355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
        536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
        409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
        148,286,797,266,411 instructions             #    0.42  insns per cycle
                                                     #    3.62  stalled cycles per insn [100.00%]
        27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
             1,188,655,196 branch-misses             #    0.00% of all branches
      
             427.001706337 seconds time elapsed
      
      Now with the flag set:
      
        # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 1
        Set thp_disabled state to 1
        Process pid = 144957
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
                   534,205 context-switches          #    0.000 M/sec                   [100.00%]
                     4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
                63,133,119 page-faults               #    0.000 M/sec
        147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
        200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
        105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
        180,916,213,503,160 instructions             #    1.22  insns per cycle
                                                     #    1.11  stalled cycles per insn [100.00%]
        26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
               714,066,351 branch-misses             #    0.00% of all branches
      
             216.196778807 seconds time elapsed
      
      As with previous versions of the patch, We're getting about a 2x
      performance increase here.  Here's a link to the test case I used, along
      with the little wrapper to activate the flag:
      
        http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz
      
      This patch (of 3):
      
      Revert commit 8e72033f and add in code to fix up any issues caused
      by the revert.
      
      The revert is necessary because hugepage_madvise would return -EINVAL
      when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
      patch set.
      
      Here's a snip of an e-mail from Gerald detailing the original purpose of
      this code, and providing justification for the revert:
      
        "The intent of commit 8e72033f was to guard against any future
         programming errors that may result in an madvice(MADV_HUGEPAGE) on
         guest mappings, which would crash the kernel.
      
         Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
         8e72033f was to be reverted, because that check will also prevent
         a kernel crash in the case described above, it will now send a
         SIGSEGV instead.
      
         This would now also allow to do the madvise on other parts, if
         needed, so it is a more flexible approach.  One could also say that
         it would have been better to do it this way right from the
         beginning..."
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e1836e8
    • J
      mm/compaction: clean-up code on success of ballon isolation · b6c75016
      Joonsoo Kim 提交于
      It is just for clean-up to reduce code size and improve readability.
      There is no functional change.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6c75016
    • J
      mm/compaction: check pageblock suitability once per pageblock · c122b208
      Joonsoo Kim 提交于
      isolation_suitable() and migrate_async_suitable() is used to be sure
      that this pageblock range is fine to be migragted.  It isn't needed to
      call it on every page.  Current code do well if not suitable, but, don't
      do well when suitable.
      
      1) It re-checks isolation_suitable() on each page of a pageblock that was
         already estabilished as suitable.
      2) It re-checks migrate_async_suitable() on each page of a pageblock that
         was not entered through the next_pageblock: label, because
         last_pageblock_nr is not otherwise updated.
      
      This patch fixes situation by 1) calling isolation_suitable() only once
      per pageblock and 2) always updating last_pageblock_nr to the pageblock
      that was just checked.
      
      Additionally, move PageBuddy() check after pageblock unit check, since
      pageblock check is the first thing we should do and makes things more
      simple.
      
      [vbabka@suse.cz: rephrase commit description]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c122b208
    • J
      mm/compaction: change the timing to check to drop the spinlock · be1aa03b
      Joonsoo Kim 提交于
      It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
      pfn page.  This may results in below situation while isolating
      migratepage.
      
      1. try isolate 0x0 ~ 0x200 pfn pages.
      2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
         the spinlock.
      3. Then, to complete isolating, retry to aquire the lock.
      
      I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
      criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
      this time, locked variable would be false.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1aa03b
    • J
      mm/compaction: do not call suitable_migration_target() on every page · 01ead534
      Joonsoo Kim 提交于
      suitable_migration_target() checks that pageblock is suitable for
      migration target.  In isolate_freepages_block(), it is called on every
      page and this is inefficient.  So make it called once per pageblock.
      
      suitable_migration_target() also checks if page is highorder or not, but
      it's criteria for highorder is pageblock order.  So calling it once
      within pageblock range has no problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01ead534
    • J
      mm/compaction: disallow high-order page for migration target · 7d348b9e
      Joonsoo Kim 提交于
      Purpose of compaction is to get a high order page.  Currently, if we
      find high-order page while searching migration target page, we break it
      to order-0 pages and use them as migration target.  It is contrary to
      purpose of compaction, so disallow high-order page to be used for
      migration target.
      
      Additionally, clean-up logic in suitable_migration_target() to simplify
      the code.  There is no functional changes from this clean-up.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d348b9e
    • M
      mm: exclude memoryless nodes from zone_reclaim · 70ef57e6
      Michal Hocko 提交于
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ef57e6
    • D
      mm/memory.c: update comment in unmap_single_vma() · 7aa6b4ad
      Davidlohr Bueso 提交于
      The described issue now occurs inside mmap_region().  And unfortunately
      is still valid.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aa6b4ad
    • W
      mm/vmscan: do not check compaction_ready on promoted zones · 9bbc04ee
      Weijie Yang 提交于
      We abort direct reclaim if we find the zone is ready for compaction.
      Sometimes the zone is just a promoted highmem zone to force a scan of
      highmem, which is not the intended zone the caller want to allocate a
      page from.  In this situation, setting aborted_reclaim to indicate the
      caller turned back to retry the allocation is waste of time and could
      cause a loop in __alloc_pages_slowpath().
      
      This patch does not check compaction_ready() on promoted zones to avoid
      the above situation.  Only set aborted_reclaim if the caller intended
      zone is ready for compaction.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bbc04ee
    • W
      mm/vmscan: restore sc->gfp_mask after promoting it to __GFP_HIGHMEM · 619d0d76
      Weijie Yang 提交于
      We promote sc->gfp_mask to __GFP_HIGHMEM to forcibly scan highmem if
      there are too many buffer_heads pinning highmem.  See cc715d99 ("mm:
      vmscan: forcibly scan highmem if there are too many buffer_heads pinning
      highmem").
      
      This patch restores sc->gfp_mask to its caller original value after
      finishing the scan job, to avoid the impact on other invocations from
      its upper caller, such as vmpressure_prio(), shrink_slab().
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      619d0d76
    • R
      mm: move mmu notifier call from change_protection to change_pmd_range · a5338093
      Rik van Riel 提交于
      The NUMA scanning code can end up iterating over many gigabytes of
      unpopulated memory, especially in the case of a freshly started KVM
      guest with lots of memory.
      
      This results in the mmu notifier code being called even when there are
      no mapped pages in a virtual address range.  The amount of time wasted
      can be enough to trigger soft lockup warnings with very large KVM
      guests.
      
      This patch moves the mmu notifier call to the pmd level, which
      represents 1GB areas of memory on x86-64.  Furthermore, the mmu notifier
      code is only called from the address in the PMD where present mappings
      are first encountered.
      
      The hugetlbfs code is left alone for now; hugetlb mappings are not
      relocatable, and as such are left alone by the NUMA code, and should
      never trigger this problem to begin with.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reported-by: NXing Gang <gang.xing@hp.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5338093
    • M
      mm: numa: recheck for transhuge pages under lock during protection changes · 1ad9f620
      Mel Gorman 提交于
      Sasha reported the following bug using trinity
      
        kernel BUG at mm/mprotect.c:149!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G        W    3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
        task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
        RIP: change_protection_range+0x3b3/0x500
        Call Trace:
          change_protection+0x25/0x30
          change_prot_numa+0x1b/0x30
          task_numa_work+0x279/0x360
          task_work_run+0xae/0xf0
          do_notify_resume+0x8e/0xe0
          retint_signal+0x4d/0x92
      
      The VM_BUG_ON was added in -mm by the patch "mm,numa: reorganize
      change_pmd_range".  The race existed without the patch but was just
      harder to hit.
      
      The problem is that a transhuge check is made without holding the PTL.
      It's possible at the time of the check that a parallel fault clears the
      pmd and inserts a new one which then triggers the VM_BUG_ON check.  This
      patch removes the VM_BUG_ON but fixes the race by rechecking transhuge
      under the PTL when marking page tables for NUMA hinting and bailing if a
      race occurred.  It is not a problem for calls to mprotect() as they hold
      mmap_sem for write.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ad9f620
    • R
      mm,numa: reorganize change_pmd_range() · 88a9ab6e
      Rik van Riel 提交于
      Reorganize the order of ifs in change_pmd_range a little, in preparation
      for the next patch.
      
      [akpm@linux-foundation.org: fix indenting, per David]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reported-by: NXing Gang <gang.xing@hp.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88a9ab6e
    • N
      mm/hugetlb.c: add NULL check of return value of huge_pte_offset · a9af0c5d
      Naoya Horiguchi 提交于
      huge_pte_offset() could return NULL, so we need NULL check to avoid
      potential NULL pointer dereferences.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9af0c5d
    • F
      ntfs: logging clean-up · 87c1b497
      Fabian Frederick 提交于
      - Convert spinlock/static array to va_format (inspired by Joe Perches
        help on previous logging patches).
      
      - Convert printk(KERN_ERR to pr_warn in __ntfs_warning.
      
      - Convert printk(KERN_ERR to pr_err in __ntfs_error.
      
      - Convert printk(KERN_DEBUG to pr_debug in __ntfs_debug.  (Note that
        __ntfs_debug is still guarded by #if DEBUG)
      
      - Improve !DEBUG to parse all arguments (Joe Perches).
      
      - Sparse pr_foo() conversions in super.c
      
      NTFS, NTFS-fs prefixes as well as 'warning' and 'error' were removed :
      pr_foo() automatically adds module name and error level is already
      specified.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87c1b497
  2. 07 4月, 2014 2 次提交
    • L
      Merge tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 2b3a8fd7
      Linus Torvalds 提交于
      Pull NFS client updates from Trond Myklebust:
       "Highlights include:
      
         - Stable fix for a use after free issue in the NFSv4.1 open code
         - Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
         - Optimise usage of readdirplus when confronted with 'ls -l' situations
         - Soft mount bugfixes
         - NFS over RDMA bugfixes
         - NFSv4 close locking fixes
         - Various NFSv4.x client state management optimisations
         - Rename/unlink code cleanups"
      
      * tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
        nfs: pass string length to pr_notice message about readdir loops
        NFSv4: Fix a use-after-free problem in open()
        SUNRPC: rpc_restart_call/rpc_restart_call_prepare should clear task->tk_status
        SUNRPC: Don't let rpc_delay() clobber non-timeout errors
        SUNRPC: Ensure call_connect_status() deals correctly with SOFTCONN tasks
        SUNRPC: Ensure call_status() deals correctly with SOFTCONN tasks
        NFSv4: Ensure we respect soft mount timeouts during trunking discovery
        NFSv4: Schedule recovery if nfs40_walk_client_list() is interrupted
        NFS: advertise only supported callback netids
        SUNRPC: remove KERN_INFO from dprintk() call sites
        SUNRPC: Fix large reads on NFS/RDMA
        NFS: Clean up: revert increase in READDIR RPC buffer max size
        SUNRPC: Ensure that call_bind times out correctly
        SUNRPC: Ensure that call_connect times out correctly
        nfs: emit a fsnotify_nameremove call in sillyrename codepath
        nfs: remove synchronous rename code
        nfs: convert nfs_rename to use async_rename infrastructure
        nfs: make nfs_async_rename non-static
        nfs: abstract out code needed to complete a sillyrename
        NFSv4: Clear the open state flags if the new stateid does not match
        ...
      2b3a8fd7
    • L
      Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux · 6f4c98e1
      Linus Torvalds 提交于
      Pull module updates from Rusty Russell:
       "Nothing major: the stricter permissions checking for sysfs broke a
        staging driver; fix included.  Greg KH said he'd take the patch but
        hadn't as the merge window opened, so it's included here to avoid
        breaking build"
      
      * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
        staging: fix up speakup kobject mode
        Use 'E' instead of 'X' for unsigned module taint flag.
        VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
        kallsyms: fix percpu vars on x86-64 with relocation.
        kallsyms: generalize address range checking
        module: LLVMLinux: Remove unused function warning from __param_check macro
        Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
        module: remove MODULE_GENERIC_TABLE
        module: allow multiple calls to MODULE_DEVICE_TABLE() per module
        module: use pr_cont
      6f4c98e1
  3. 06 4月, 2014 2 次提交
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · 18a1a7a1
      Linus Torvalds 提交于
      Pull arch/tile updates from Chris Metcalf:
       "These fix a few stray build issues seen in linux-next, and also add
        the minimal required support for perf to tilegx"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
        arch/tile: remove unused variable 'devcap'
        tile: Fix vDSO compilation issue with allyesconfig
        perf tools: Allow building for tile
        tile/perf: Support perf_events on tilegx and tilepro
        tile: Enable NMIs on return from handle_nmi() without errors
        tile: Add support for handling PMC hardware
        tile: don't use __get_cpu_var() with structure-typed arguments
        tile: avoid overflow in ns2cycles
      18a1a7a1
    • L
      Merge tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 04535d27
      Linus Torvalds 提交于
      Pull device mapper changes from Mike Snitzer:
      
       - Fix dm-cache corruption caused by discard_block_size > cache_block_size
      
       - Fix a lock-inversion detected by LOCKDEP in dm-cache
      
       - Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
         error path
      
       - Fix corruption due to non-atomic transaction commit which allowed a
         metadata superblock to be written before all other metadata was
         successfully written -- this is common to all targets that use the
         persistent-data library's transaction manager (dm-thinp, dm-cache and
         dm-era).
      
       - Various small cleanups in the DM core
      
       - Add the dm-era target which is useful for keeping track of which
         blocks were written within a user defined period of time called an
         'era'.  Use cases include tracking changed blocks for backup
         software, and partially invalidating the contents of a cache to
         restore cache coherency after rolling back a vendor snapshot.
      
       - Improve the on-disk layout of multithreaded writes to the
         dm-thin-pool by splitting the pool's deferred bio list to be a
         per-thin device list and then sorting that list using an rb_tree.
         The subsequent read throughput of the data written via multiple
         threads improved by ~70%.
      
       - Simplify the multipath target's handling of queuing IO by pushing
         requests back to the request queue rather than queueing the IO
         internally.
      
      * tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
        dm cache: fix a lock-inversion
        dm thin: sort the per thin deferred bios using an rb_tree
        dm thin: use per thin device deferred bio lists
        dm thin: simplify pool_is_congested
        dm thin: fix dangling bio in process_deferred_bios error path
        dm mpath: print more useful warnings in multipath_message()
        dm-mpath: do not activate failed paths
        dm mpath: remove extra nesting in map function
        dm mpath: remove map_io()
        dm mpath: reduce memory pressure when requeuing
        dm mpath: remove process_queued_ios()
        dm mpath: push back requests instead of queueing
        dm table: add dm_table_run_md_queue_async
        dm mpath: do not call pg_init when it is already running
        dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
        dm: stop using bi_private
        dm: remove dm_get_mapinfo
        dm: make dm_table_alloc_md_mempools static
        dm: take care to copy the space map roots before locking the superblock
        dm transaction manager: fix corruption due to non-atomic transaction commit
        ...
      04535d27