1. 24 1月, 2014 7 次提交
    • A
      numa: add a sysctl for numa_balancing · 54a43d54
      Andi Kleen 提交于
      Add a working sysctl to enable/disable automatic numa memory balancing
      at runtime.
      
      This allows us to track down performance problems with this feature and
      is generally a good idea.
      
      This was possible earlier through debugfs, but only with special
      debugging options set.  Also fix the boot message.
      
      [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a43d54
    • P
      mm: free memblock.memory in free_all_bootmem · 5e270e25
      Philipp Hachtmann 提交于
      When calling free_all_bootmem() the free areas under memblock's control
      are released to the buddy allocator.  Additionally the reserved list is
      freed if it was reallocated by memblock.  The same should apply for the
      memory list.
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e270e25
    • V
      memcg, slab: RCU protect memcg_params for root caches · f8570263
      Vladimir Davydov 提交于
      We relocate root cache's memcg_params whenever we need to grow the
      memcg_caches array to accommodate all kmem-active memory cgroups.
      Currently on relocation we free the old version immediately, which can
      lead to use-after-free, because the memcg_caches array is accessed
      lock-free (see cache_from_memcg_idx()).  This patch fixes this by making
      memcg_params RCU-protected for root caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8570263
    • V
      memcg, slab: clean up memcg cache initialization/destruction · 1aa13254
      Vladimir Davydov 提交于
      Currently, we have rather a messy function set relating to per-memcg
      kmem cache initialization/destruction.
      
      Per-memcg caches are created in memcg_create_kmem_cache().  This
      function calls kmem_cache_create_memcg() to allocate and initialize a
      kmem cache and then "registers" the new cache in the
      memcg_params::memcg_caches array of the parent cache.
      
      During its work-flow, kmem_cache_create_memcg() executes the following
      memcg-related functions:
      
       - memcg_alloc_cache_params(), to initialize memcg_params of the newly
         created cache;
       - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
         list.
      
      On the other hand, kmem_cache_destroy() called on a cache destruction
      only calls memcg_release_cache(), which does all the work: it cleans the
      reference to the cache in its parent's memcg_params::memcg_caches,
      removes the cache from the memcg_slab_caches list, and frees
      memcg_params.
      
      Such an inconsistency between destruction and initialization paths make
      the code difficult to read, so let's clean this up a bit.
      
      This patch moves all the code relating to registration of per-memcg
      caches (adding to memcg list, setting the pointer to a cache from its
      parent) to the newly created memcg_register_cache() and
      memcg_unregister_cache() functions making the initialization and
      destruction paths look symmetrical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aa13254
    • V
      memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path · 363a044f
      Vladimir Davydov 提交于
      We do not free the cache's memcg_params if __kmem_cache_create fails.
      Fix this.
      
      Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
      because it actually does not register the cache anywhere, but simply
      initialize kmem_cache::memcg_params.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      363a044f
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • D
      mm: print more details for bad_page() · f0b791a3
      Dave Hansen 提交于
      bad_page() is cool in that it prints out a bunch of data about the page.
      But, I can never remember which page flags are good and which are bad,
      or whether ->index or ->mapping is required to be NULL.
      
      This patch allows bad/dump_page() callers to specify a string about why
      they are dumping the page and adds explanation strings to a number of
      places.  It also adds a 'bad_flags' argument to bad_page(), which it
      then dumps out separately from the flags which are actually set.
      
      This way, the messages will show specifically why the page was bad,
      *specifically* which flags it is complaining about, if it was a page
      flag combination which was the problem.
      
      [akpm@linux-foundation.org: switch to pr_alert]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b791a3
  2. 23 1月, 2014 1 次提交
    • M
      fuse: fix pipe_buf_operations · 28a625cb
      Miklos Szeredi 提交于
      Having this struct in module memory could Oops when if the module is
      unloaded while the buffer still persists in a pipe.
      
      Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
      merge them into nosteal_pipe_buf_ops (this is the same as
      default_pipe_buf_ops except stealing the page from the buffer is not
      allowed).
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      28a625cb
  3. 22 1月, 2014 32 次提交
    • L
      neighbour.h: fix comment · c04e7da0
      Li Zhong 提交于
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      c04e7da0
    • M
      sched: Fix warning on make htmldocs caused by wait.h · f434f7af
      Masanari Iida 提交于
      Missing "@" in include/linux/wait.h cause "make htmldocs" failed
      with following warning messages.
      
      Warning(/home/iida/Repo/linux-next//include/linux/wait.h:304):
      No description found for parameter 'cmd1'
      Warning(/home/iida/Repo/linux-next//include/linux/wait.h:304):
      No description found for parameter 'cmd2'
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      f434f7af
    • D
      dm log userspace: allow mark requests to piggyback on flush requests · 5066a4df
      Dongmao Zhang 提交于
      In the cluster evironment, cluster write has poor performance because
      userspace_flush() has to contact a userspace program (cmirrord) for
      clear/mark/flush requests.  But both mark and flush requests require
      cmirrord to communicate the message to all the cluster nodes for each
      flush call.  This behaviour is really slow.
      
      To address this we now merge mark and flush requests together to reduce
      the kernel-userspace-kernel time.  We allow a new directive,
      "integrated_flush" that can be used to instruct the kernel log code to
      combine flush and mark requests when directed by userspace.  If not
      directed by userspace (due to an older version of the userspace code
      perhaps), the kernel will function as it did previously - preserving
      backwards compatibility.  Additionally, flush requests are performed
      lazily when only clear requests exist.
      Signed-off-by: NDongmao Zhang <dmzhang@suse.com>
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5066a4df
    • J
      mm/migrate: remove unused function, fail_migrate_page() · 78d5506e
      Joonsoo Kim 提交于
      fail_migrate_page() isn't used anywhere, so remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78d5506e
    • J
      mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages · 59c82b70
      Joonsoo Kim 提交于
      Some part of putback_lru_pages() and putback_movable_pages() is
      duplicated, so it could confuse us what we should use.  We can remove
      putback_lru_pages() since it is not really needed now.  This makes us
      undestand and maintain the code more easily.
      
      And comment on putback_movable_pages() is stale now, so fix it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59c82b70
    • V
      mm: compaction: encapsulate defer reset logic · de6c60a6
      Vlastimil Babka 提交于
      Currently there are several functions to manipulate the deferred
      compaction state variables.  The remaining case where the variables are
      touched directly is when a successful allocation occurs in direct
      compaction, or is expected to be successful in the future by kswapd.
      Here, the lowest order that is expected to fail is updated, and in the
      case of successful allocation, the deferred status and counter is reset
      completely.
      
      Create a new function compaction_defer_reset() to encapsulate this
      functionality and make it easier to understand the code.  No functional
      change.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de6c60a6
    • M
      mm: compaction: trace compaction begin and end · 0eb927c0
      Mel Gorman 提交于
      The broad goal of the series is to improve allocation success rates for
      huge pages through memory compaction, while trying not to increase the
      compaction overhead.  The original objective was to reintroduce
      capturing of high-order pages freed by the compaction, before they are
      split by concurrent activity.  However, several bugs and opportunities
      for simple improvements were found in the current implementation, mostly
      through extra tracepoints (which are however too ugly for now to be
      considered for sending).
      
      The patches mostly deal with two mechanisms that reduce compaction
      overhead, which is caching the progress of migrate and free scanners,
      and marking pageblocks where isolation failed to be skipped during
      further scans.
      
      Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
              compaction and potentially debug scanner pfn values.
      
      Patch 2 encapsulates the some functionality for handling deferred compactions
              for better maintainability, without a functional change
              type is not determined without being actually needed.
      
      Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
              they have been read to initialize a compaction run.
      
      Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
              and can lead to multiple compaction attempts quitting early without
              doing any work.
      
      Patch 5 improves the chances of sync compaction to process pageblocks that
              async compaction has skipped due to being !MIGRATE_MOVABLE.
      
      Patch 6 improves the chances of sync direct compaction to actually do anything
              when called after async compaction fails during allocation slowpath.
      
      The impact of patches were validated using mmtests's stress-highalloc
      benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
      with 4GB memory.
      
      Due to instability of the results (mostly related to the bugs fixed by
      patches 2 and 3), 10 iterations were performed, taking min,mean,max
      values for success rates and mean values for time and vmstat-based
      metrics.
      
      First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
      patches stacked on top of v3.13-rc2.  Patch 2 is OK to serve as baseline
      due to no functional changes in 1 and 2.  Comments below.
      
      stress-highalloc
                                   3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                    2-nothp               3-nothp               4-nothp               5-nothp               6-nothp
      Success 1 Min          9.00 (  0.00%)       10.00 (-11.11%)       43.00 (-377.78%)       43.00 (-377.78%)       33.00 (-266.67%)
      Success 1 Mean        27.50 (  0.00%)       25.30 (  8.00%)       45.50 (-65.45%)       45.90 (-66.91%)       46.30 (-68.36%)
      Success 1 Max         36.00 (  0.00%)       36.00 (  0.00%)       47.00 (-30.56%)       48.00 (-33.33%)       52.00 (-44.44%)
      Success 2 Min         10.00 (  0.00%)        8.00 ( 20.00%)       46.00 (-360.00%)       45.00 (-350.00%)       35.00 (-250.00%)
      Success 2 Mean        26.40 (  0.00%)       23.50 ( 10.98%)       47.30 (-79.17%)       47.60 (-80.30%)       48.10 (-82.20%)
      Success 2 Max         34.00 (  0.00%)       33.00 (  2.94%)       48.00 (-41.18%)       50.00 (-47.06%)       54.00 (-58.82%)
      Success 3 Min         65.00 (  0.00%)       63.00 (  3.08%)       85.00 (-30.77%)       84.00 (-29.23%)       85.00 (-30.77%)
      Success 3 Mean        76.70 (  0.00%)       70.50 (  8.08%)       86.20 (-12.39%)       85.50 (-11.47%)       86.00 (-12.13%)
      Success 3 Max         87.00 (  0.00%)       86.00 (  1.15%)       88.00 ( -1.15%)       87.00 (  0.00%)       87.00 (  0.00%)
      
                  3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                   2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
      User         6437.72     6459.76     5960.32     5974.55     6019.67
      System       1049.65     1049.09     1029.32     1031.47     1032.31
      Elapsed      1856.77     1874.48     1949.97     1994.22     1983.15
      
                                    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                     2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
      Minor Faults                 253952267   254581900   250030122   250507333   250157829
      Major Faults                       420         407         506         530         530
      Swap Ins                             4           9           9           6           6
      Swap Outs                          398         375         345         346         333
      Direct pages scanned            197538      189017      298574      287019      299063
      Kswapd pages scanned           1809843     1801308     1846674     1873184     1861089
      Kswapd pages reclaimed         1806972     1798684     1844219     1870509     1858622
      Direct pages reclaimed          197227      188829      298380      286822      298835
      Kswapd efficiency                  99%         99%         99%         99%         99%
      Kswapd velocity                953.382     970.449     952.243     934.569     922.286
      Direct efficiency                  99%         99%         99%         99%         99%
      Direct velocity                104.058     101.832     153.961     143.200     148.205
      Percentage direct scans             9%          9%         13%         13%         13%
      Zone normal velocity           347.289     359.676     348.063     339.933     332.983
      Zone dma32 velocity            710.151     712.605     758.140     737.835     737.507
      Zone dma velocity                0.000       0.000       0.000       0.000       0.000
      Page writes by reclaim         557.600     429.000     353.600     426.400     381.800
      Page writes file                   159          53           7          79          48
      Page writes anon                   398         375         345         346         333
      Page reclaim immediate             825         644         411         575         420
      Sector Reads                   2781750     2769780     2878547     2939128     2910483
      Sector Writes                 12080843    12083351    12012892    12002132    12010745
      Page rescued immediate               0           0           0           0           0
      Slabs scanned                  1575654     1545344     1778406     1786700     1794073
      Direct inode steals               9657       10037       15795       14104       14645
      Kswapd inode steals              46857       46335       50543       50716       51796
      Kswapd skipped wait                  0           0           0           0           0
      THP fault alloc                     97          91          81          71          77
      THP collapse alloc                 456         506         546         544         565
      THP splits                           6           5           5           4           4
      THP fault fallback                   0           1           0           0           0
      THP collapse fail                   14          14          12          13          12
      Compaction stalls                 1006         980        1537        1536        1548
      Compaction success                 303         284         562         559         578
      Compaction failures                702         696         974         976         969
      Page migrate success           1177325     1070077     3927538     3781870     3877057
      Page migrate failure                 0           0           0           0           0
      Compaction pages isolated      2547248     2306457     8301218     8008500     8200674
      Compaction migrate scanned    42290478    38832618   153961130   154143900   159141197
      Compaction free scanned       89199429    79189151   356529027   351943166   356326727
      Compaction cost                   1566        1426        5312        5156        5294
      NUMA PTE updates                     0           0           0           0           0
      NUMA hint faults                     0           0           0           0           0
      NUMA hint local faults               0           0           0           0           0
      NUMA hint local percent            100         100         100         100         100
      NUMA pages migrated                  0           0           0           0           0
      AutoNUMA cost                        0           0           0           0           0
      
      Observations:
      
      - The "Success 3" line is allocation success rate with system idle
        (phases 1 and 2 are with background interference).  I used to get stable
        values around 85% with vanilla 3.11.  The lower min and mean values came
        with 3.12.  This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
        zone allocator policy") As explained in comment for patch 3, I don't
        think the commit is wrong, but that it makes the effect of compaction
        bugs worse.  From patch 3 onwards, the results are OK and match the 3.11
        results.
      
      - Patch 4 also clearly helps phases 1 and 2, and exceeds any results
        I've seen with 3.11 (I didn't measure it that thoroughly then, but it
        was never above 40%).
      
      - Compaction cost and number of scanned pages is higher, especially due
        to patch 4.  However, keep in mind that patches 3 and 4 fix existing
        bugs in the current design of compaction overhead mitigation, they do
        not change it.  If overhead is found unacceptable, then it should be
        decreased differently (and consistently, not due to random conditions)
        than the current implementation does.  In contrast, patches 5 and 6
        (which are not strictly bug fixes) do not increase the overhead (but
        also not success rates).  This might be a limitation of the
        stress-highalloc benchmark as it's quite uniform.
      
      Another set of results is when configuring stress-highalloc t allocate
      with similar flags as THP uses:
       (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)
      
      stress-highalloc
                                   3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                      2-thp                 3-thp                 4-thp                 5-thp                 6-thp
      Success 1 Min          2.00 (  0.00%)        7.00 (-250.00%)       18.00 (-800.00%)       19.00 (-850.00%)       26.00 (-1200.00%)
      Success 1 Mean        19.20 (  0.00%)       17.80 (  7.29%)       29.20 (-52.08%)       29.90 (-55.73%)       32.80 (-70.83%)
      Success 1 Max         27.00 (  0.00%)       29.00 ( -7.41%)       35.00 (-29.63%)       36.00 (-33.33%)       37.00 (-37.04%)
      Success 2 Min          3.00 (  0.00%)        8.00 (-166.67%)       21.00 (-600.00%)       21.00 (-600.00%)       32.00 (-966.67%)
      Success 2 Mean        19.30 (  0.00%)       17.90 (  7.25%)       32.20 (-66.84%)       32.60 (-68.91%)       35.70 (-84.97%)
      Success 2 Max         27.00 (  0.00%)       30.00 (-11.11%)       36.00 (-33.33%)       37.00 (-37.04%)       39.00 (-44.44%)
      Success 3 Min         62.00 (  0.00%)       62.00 (  0.00%)       85.00 (-37.10%)       75.00 (-20.97%)       64.00 ( -3.23%)
      Success 3 Mean        66.30 (  0.00%)       65.50 (  1.21%)       85.60 (-29.11%)       83.40 (-25.79%)       83.50 (-25.94%)
      Success 3 Max         70.00 (  0.00%)       69.00 (  1.43%)       87.00 (-24.29%)       86.00 (-22.86%)       87.00 (-24.29%)
      
                  3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                     2-thp       3-thp       4-thp       5-thp       6-thp
      User         6547.93     6475.85     6265.54     6289.46     6189.96
      System       1053.42     1047.28     1043.23     1042.73     1038.73
      Elapsed      1835.43     1821.96     1908.67     1912.74     1956.38
      
                                    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                       2-thp       3-thp       4-thp       5-thp       6-thp
      Minor Faults                 256805673   253106328   253222299   249830289   251184418
      Major Faults                       395         375         423         434         448
      Swap Ins                            12          10          10          12           9
      Swap Outs                          530         537         487         455         415
      Direct pages scanned             71859       86046      153244      152764      190713
      Kswapd pages scanned           1900994     1870240     1898012     1892864     1880520
      Kswapd pages reclaimed         1897814     1867428     1894939     1890125     1877924
      Direct pages reclaimed           71766       85908      153167      152643      190600
      Kswapd efficiency                  99%         99%         99%         99%         99%
      Kswapd velocity               1029.000    1067.782    1000.091     991.049     951.218
      Direct efficiency                  99%         99%         99%         99%         99%
      Direct velocity                 38.897      49.127      80.747      79.983      96.468
      Percentage direct scans             3%          4%          7%          7%          9%
      Zone normal velocity           351.377     372.494     348.910     341.689     335.310
      Zone dma32 velocity            716.520     744.414     731.928     729.343     712.377
      Zone dma velocity                0.000       0.000       0.000       0.000       0.000
      Page writes by reclaim         669.300     604.000     545.700     538.900     429.900
      Page writes file                   138          66          58          83          14
      Page writes anon                   530         537         487         455         415
      Page reclaim immediate             806         655         772         548         517
      Sector Reads                   2711956     2703239     2811602     2818248     2839459
      Sector Writes                 12163238    12018662    12038248    11954736    11994892
      Page rescued immediate               0           0           0           0           0
      Slabs scanned                  1385088     1388364     1507968     1513292     1558656
      Direct inode steals               1739        2564        4622        5496        6007
      Kswapd inode steals              47461       46406       47804       48013       48466
      Kswapd skipped wait                  0           0           0           0           0
      THP fault alloc                    110          82          84          69          70
      THP collapse alloc                 445         482         467         462         539
      THP splits                           6           5           4           5           3
      THP fault fallback                   3           0           0           0           0
      THP collapse fail                   15          14          14          14          13
      Compaction stalls                  659         685        1033        1073        1111
      Compaction success                 222         225         410         427         456
      Compaction failures                436         460         622         646         655
      Page migrate success            446594      439978     1085640     1095062     1131716
      Page migrate failure                 0           0           0           0           0
      Compaction pages isolated      1029475     1013490     2453074     2482698     2565400
      Compaction migrate scanned     9955461    11344259    24375202    27978356    30494204
      Compaction free scanned       27715272    28544654    80150615    82898631    85756132
      Compaction cost                    552         555        1344        1379        1436
      NUMA PTE updates                     0           0           0           0           0
      NUMA hint faults                     0           0           0           0           0
      NUMA hint local faults               0           0           0           0           0
      NUMA hint local percent            100         100         100         100         100
      NUMA pages migrated                  0           0           0           0           0
      AutoNUMA cost                        0           0           0           0           0
      
      There are some differences from the previous results for THP-like allocations:
      
      - Here, the bad result for unpatched kernel in phase 3 is much more
        consistent to be between 65-70% and not related to the "regression" in
        3.12.  Still there is the improvement from patch 4 onwards, which brings
        it on par with simple GFP_HIGHUSER_MOVABLE allocations.
      
      - Compaction costs have increased, but nowhere near as much as the
        non-THP case.  Again, the patches should be worth the gained
        determininsm.
      
      - Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
         This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
        pfn's and pageblock skip bits are not reset by kswapd that often (at
        least in phase 3 where no concurrent activity would wake up kswapd) and
        the patches thus help the sync-after-async compaction.  It doesn't
        however show that the sync compaction would help so much with success
        rates, which can be again seen as a limitation of the benchmark
        scenario.
      
      This patch (of 6):
      
      Add two tracepoints for compaction begin and end of a zone.  Using this it
      is possible to calculate how much time a workload is spending within
      compaction and potentially debug problems related to cached pfns for
      scanning.  In combination with the direct reclaim and slab trace points it
      should be possible to estimate most allocation-related overhead for a
      workload.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eb927c0
    • M
      sched: add tracepoints related to NUMA task migration · 286549dc
      Mel Gorman 提交于
      This patch adds three tracepoints
       o trace_sched_move_numa	when a task is moved to a node
       o trace_sched_swap_numa	when a task is swapped with another task
       o trace_sched_stick_numa	when a numa-related migration fails
      
      The tracepoints allow the NUMA scheduler activity to be monitored and the
      following high-level metrics can be calculated
      
       o NUMA migrated stuck	 nr trace_sched_stick_numa
       o NUMA migrated idle	 nr trace_sched_move_numa
       o NUMA migrated swapped nr trace_sched_swap_numa
       o NUMA local swapped	 trace_sched_swap_numa src_nid == dst_nid (should never happen)
       o NUMA remote swapped	 trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
       o NUMA group swapped	 trace_sched_swap_numa src_ngid == dst_ngid
      			 Maybe a small number of these are acceptable
      			 but a high number would be a major surprise.
      			 It would be even worse if bounces are frequent.
       o NUMA avg task migs.	 Average number of migrations for tasks
       o NUMA stddev task mig	 Self-explanatory
       o NUMA max task migs.	 Maximum number of migrations for a single task
      
      In general the intent of the tracepoints is to help diagnose problems
      where automatic NUMA balancing appears to be doing an excessive amount
      of useless work.
      
      [akpm@linux-foundation.org: remove semicolon-after-if, repair coding-style]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      286549dc
    • M
      mm: numa: trace tasks that fail migration due to rate limiting · af1839d7
      Mel Gorman 提交于
      A low local/remote numa hinting fault ratio is potentially explained by
      failed migrations.  This patch adds a tracepoint that fires when
      migration fails due to migration rate limitation.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af1839d7
    • M
      mm: numa: limit scope of lock for NUMA migrate rate limiting · 1c5e9c27
      Mel Gorman 提交于
      NUMA migrate rate limiting protects a migration counter and window using
      a lock but in some cases this can be a contended lock.  It is not
      critical that the number of pages be perfect, lost updates are
      acceptable.  Reduce the importance of this lock.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c5e9c27
    • S
      mm/memblock: add memblock memory allocation apis · 26f09e9b
      Santosh Shilimkar 提交于
      Introduce memblock memory allocation APIs which allow to support PAE or
      LPAE extension on 32 bits archs where the physical memory start address
      can be beyond 4GB.  In such cases, existing bootmem APIs which operate
      on 32 bit addresses won't work and needs memblock layer which operates
      on 64 bit addresses.
      
      So we add equivalent APIs so that we can replace usage of bootmem with
      memblock interfaces.  Architectures already converted to NO_BOOTMEM use
      these new memblock interfaces.  The architectures which are still not
      converted to NO_BOOTMEM continue to function as is because we still
      maintain the fal lback option of bootmem back-end supporting these new
      interfaces.  So no functional change as such.
      
      In long run, once all the architectures moves to NO_BOOTMEM, we can get
      rid of bootmem layer completely.  This is one step to remove the core
      code dependency with bootmem and also gives path for architectures to
      move away from bootmem.
      
      The proposed interface will became active if both CONFIG_HAVE_MEMBLOCK
      and CONFIG_NO_BOOTMEM are specified by arch.  In case
      !CONFIG_NO_BOOTMEM, the memblock() wrappers will fallback to the
      existing bootmem apis so that arch's not converted to NO_BOOTMEM
      continue to work as is.
      
      The meaning of MEMBLOCK_ALLOC_ACCESSIBLE and MEMBLOCK_ALLOC_ANYWHERE
      is kept same.
      
      [akpm@linux-foundation.org: s/depricated/deprecated/]
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26f09e9b
    • G
      mm/memblock: switch to use NUMA_NO_NODE instead of MAX_NUMNODES · b1154233
      Grygorii Strashko 提交于
      It's recommended to use NUMA_NO_NODE everywhere to select "process any
      node" behavior or to indicate that "no node id specified".
      
      Hence, update __next_free_mem_range*() API's to accept both NUMA_NO_NODE
      and MAX_NUMNODES, but emit warning once on MAX_NUMNODES, and correct
      corresponding API's documentation to describe new behavior.  Also,
      update other memblock/nobootmem APIs where MAX_NUMNODES is used
      dirrectly.
      
      The change was suggested by Tejun Heo.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1154233
    • G
      mm/memblock: reorder parameters of memblock_find_in_range_node · 87029ee9
      Grygorii Strashko 提交于
      Reorder parameters of memblock_find_in_range_node to be consistent with
      other memblock APIs.
      
      The change was suggested by Tejun Heo <tj@kernel.org>.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87029ee9
    • G
      mm/bootmem: remove duplicated declaration of __free_pages_bootmem() · 10e89523
      Grygorii Strashko 提交于
      The __free_pages_bootmem is used internally by MM core and already
      defined in internal.h.  So, remove duplicated declaration.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10e89523
    • O
      introduce for_each_thread() to replace the buggy while_each_thread() · 0c740d0a
      Oleg Nesterov 提交于
      while_each_thread() and next_thread() should die, almost every lockless
      usage is wrong.
      
      1. Unless g == current, the lockless while_each_thread() is not safe.
      
         while_each_thread(g, t) can loop forever if g exits, next_thread()
         can't reach the unhashed thread in this case. Note that this can
         happen even if g is the group leader, it can exec.
      
      2. Even if while_each_thread() itself was correct, people often use
         it wrongly.
      
         It was never safe to just take rcu_read_lock() and loop unless
         you verify that pid_alive(g) == T, even the first next_thread()
         can point to the already freed/reused memory.
      
      This patch adds signal_struct->thread_head and task->thread_node to
      create the normal rcu-safe list with the stable head.  The new
      for_each_thread(g, t) helper is always safe under rcu_read_lock() as
      long as this task_struct can't go away.
      
      Note: of course it is ugly to have both task_struct->thread_node and the
      old task_struct->thread_group, we will kill it later, after we change
      the users of while_each_thread() to use for_each_thread().
      
      Perhaps we can kill it even before we convert all users, we can
      reimplement next_thread(t) using the new thread_head/thread_node.  But
      we can't do this right now because this will lead to subtle behavioural
      changes.  For example, do/while_each_thread() always sees at least one
      task, while for_each_thread() can do nothing if the whole thread group
      has died.  Or thread_group_empty(), currently its semantics is not clear
      unless thread_group_leader(p) and we need to audit the callers before we
      can change it.
      
      So this patch adds the new interface which has to coexist with the old
      one for some time, hopefully the next changes will be more or less
      straightforward and the old one will go away soon.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NSergey Dyasly <dserrg@gmail.com>
      Tested-by: NSergey Dyasly <dserrg@gmail.com>
      Reviewed-by: NSameer Nanda <snanda@chromium.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: "Ma, Xindong" <xindong.ma@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c740d0a
    • J
      mm/rmap: use rmap_walk() in page_referenced() · 9f32624b
      Joonsoo Kim 提交于
      Now, we have an infrastructure in rmap_walk() to handle difference from
      variants of rmap traversing functions.
      
      So, just use it in page_referenced().
      
      In this patch, I change following things.
      
      1. remove some variants of rmap traversing functions.
      	cf> page_referenced_ksm, page_referenced_anon,
      	page_referenced_file
      
      2. introduce new struct page_referenced_arg and pass it to
         page_referenced_one(), main function of rmap_walk, in order to count
         reference, to store vm_flags and to check finish condition.
      
      3. mechanical change to use rmap_walk() in page_referenced().
      
      [liwanp@linux.vnet.ibm.com: fix BUG at rmap_walk]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f32624b
    • J
      mm/rmap: use rmap_walk() in try_to_munlock() · e8351ac9
      Joonsoo Kim 提交于
      Now, we have an infrastructure in rmap_walk() to handle difference from
      variants of rmap traversing functions.
      
      So, just use it in try_to_munlock().
      
      In this patch, I change following things.
      
      1. remove some variants of rmap traversing functions.
      	cf> try_to_unmap_ksm, try_to_unmap_anon, try_to_unmap_file
      2. mechanical change to use rmap_walk() in try_to_munlock().
      3. copy and paste comments.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8351ac9
    • J
      mm/rmap: use rmap_walk() in try_to_unmap() · 52629506
      Joonsoo Kim 提交于
      Now, we have an infrastructure in rmap_walk() to handle difference from
      variants of rmap traversing functions.
      
      So, just use it in try_to_unmap().
      
      In this patch, I change following things.
      
      1. enable rmap_walk() if !CONFIG_MIGRATION.
      2. mechanical change to use rmap_walk() in try_to_unmap().
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52629506
    • J
      mm/rmap: extend rmap_walk_xxx() to cope with different cases · 0dd1c7bb
      Joonsoo Kim 提交于
      There are a lot of common parts in traversing functions, but there are
      also a little of uncommon parts in it.  By assigning proper function
      pointer on each rmap_walker_control, we can handle these difference
      correctly.
      
      Following are differences we should handle.
      
      1. difference of lock function in anon mapping case
      2. nonlinear handling in file mapping case
      3. prechecked condition:
      	checking memcg in page_referenced(),
      	checking VM_SHARE in page_mkclean()
      	checking temporary vma in try_to_unmap()
      4. exit condition:
      	checking page_mapped() in try_to_unmap()
      
      So, in this patch, I introduce 4 function pointers to handle above
      differences.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dd1c7bb
    • J
      mm/rmap: make rmap_walk to get the rmap_walk_control argument · 051ac83a
      Joonsoo Kim 提交于
      In each rmap traverse case, there is some difference so that we need
      function pointers and arguments to them in order to handle these
      
      For this purpose, struct rmap_walk_control is introduced in this patch,
      and will be extended in following patch.  Introducing and extending are
      separate, because it clarify changes.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      051ac83a
    • T
      memblock, mem_hotplug: make memblock skip hotpluggable regions if needed · 55ac590c
      Tang Chen 提交于
      Linux kernel cannot migrate pages used by the kernel.  As a result,
      hotpluggable memory used by the kernel won't be able to be hot-removed.
      To solve this problem, the basic idea is to prevent memblock from
      allocating hotpluggable memory for the kernel at early time, and arrange
      all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
      ZONE_MOVABLE when initializing zones.
      
      In the previous patches, we have marked hotpluggable memory regions with
      MEMBLOCK_HOTPLUG flag in memblock.memory.
      
      In this patch, we make memblock skip these hotpluggable memory regions
      in the default top-down allocation function if movable_node boot option
      is specified.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55ac590c
    • T
      memblock: make memblock_set_node() support different memblock_type · e7e8de59
      Tang Chen 提交于
      [sfr@canb.auug.org.au: fix powerpc build]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e8de59
    • T
      memblock, mem_hotplug: introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions · 66b16edf
      Tang Chen 提交于
      In find_hotpluggable_memory, once we find out a memory region which is
      hotpluggable, we want to mark them in memblock.memory.  So that we could
      control memblock allocator not to allocte hotpluggable memory for the
      kernel later.
      
      To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
      hotpluggable memory regions in memblock and a function
      memblock_mark_hotplug() to mark hotpluggable memory if we find one.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66b16edf
    • T
      memblock, numa: introduce flags field into memblock · 66a20757
      Tang Chen 提交于
      There is no flag in memblock to describe what type the memory is.
      Sometimes, we may use memblock to reserve some memory for special usage.
      And we want to know what kind of memory it is.  So we need a way to
      
      In hotplug environment, we want to reserve hotpluggable memory so the
      kernel won't be able to use it.  And when the system is up, we have to
      free these hotpluggable memory to buddy.  So we need to mark these
      memory first.
      
      In order to do so, we need to mark out these special memory in memblock.
      In this patch, we introduce a new "flags" member into memblock_region:
      
         struct memblock_region {
                 phys_addr_t base;
                 phys_addr_t size;
                 unsigned long flags;		/* This is new. */
         #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
                 int nid;
         #endif
         };
      
      This patch does the following things:
      1) Add "flags" member to memblock_region.
      2) Modify the following APIs' prototype:
      	memblock_add_region()
      	memblock_insert_region()
      3) Add memblock_reserve_region() to support reserve memory with flags, and keep
         memblock_reserve()'s prototype unmodified.
      4) Modify other APIs to support flags, but keep their prototype unmodified.
      
      The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.
      Suggested-by: NWen Congyang <wency@cn.fujitsu.com>
      Suggested-by: NLiu Jiang <jiang.liu@huawei.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66a20757
    • J
      mm: add overcommit_kbytes sysctl variable · 49f0ce5f
      Jerome Marchand 提交于
      Some applications that run on HPC clusters are designed around the
      availability of RAM and the overcommit ratio is fine tuned to get the
      maximum usage of memory without swapping.  With growing memory, the
      1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
      for these workload (on a 2TB machine it represents no less than 20GB).
      
      This patch adds the new overcommit_kbytes sysctl variable that allow a
      much finer grain.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49f0ce5f
    • M
      mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNT · aec6a888
      Mel Gorman 提交于
      Commit 4b59e6c4 ("mm, show_mem: suppress page counts in
      non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to
      suppress PFN walks on large memory machines.  Commit c78e9363 ("mm:
      do not walk all of system memory during show_mem") avoided a PFN walk in
      the generic show_mem helper which removes the requirement for
      SHOW_MEM_FILTER_PAGE_COUNT in that case.
      
      This patch removes PFN walkers from the arch-specific implementations
      that report on a per-node or per-zone granularity.  ARM and unicore32
      still do a PFN walk as they report memory usage on each bank which is a
      much finer granularity where the debugging information may still be of
      use.  As the remaining arches doing PFN walks have relatively small
      amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.
      
      [akpm@linux-foundation.org: fix parisc]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Bottomley <jejb@parisc-linux.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aec6a888
    • D
      mm, mempolicy: remove unneeded functions for UMA configs · d80be7c7
      David Rientjes 提交于
      Mempolicies only exist for CONFIG_NUMA configurations.  Therefore, a
      certain class of functions are unneeded in configurations where
      CONFIG_NUMA is disabled such as functions that duplicate existing
      mempolicies, lookup existing policies, set certain mempolicy traits, or
      test mempolicies for certain attributes.
      
      Remove the unneeded functions so that any future callers get a compile-
      time error and protect their code with CONFIG_NUMA as required.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80be7c7
    • K
      mm: create a separate slab for page->ptl allocation · b35f1819
      Kirill A. Shutemov 提交于
      If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
      is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
      so we loose 24 on each.  An average system can easily allocate few tens
      thousands of page->ptl and overhead is significant.
      
      Let's create a separate slab for page->ptl allocation to solve this.
      
      To make sure that it really works this time, some numbers from my test
      machine (just booted, no load):
      
      Before:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        kmalloc-96         31987  32190    128   30    1 : tunables  120   60    8 : slabdata   1073   1073     92
      After:
        # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
        page->ptl          27516  28143     72   53    1 : tunables  120   60    8 : slabdata    531    531      9
        kmalloc-96          3853   5280    128   30    1 : tunables  120   60    8 : slabdata    176    176      0
      
      Note that the patch is useful not only for debug case, but also for
      PREEMPT_RT, where spinlock_t is always bloated.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b35f1819
    • Y
      mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve · 943dca1a
      Yasuaki Ishimatsu 提交于
      Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
      9TB memory machine since onlining memory sections is too slow.  And we
      found out setup_zone_migrate_reserve spent >90% of the time.
      
      The problem is, setup_zone_migrate_reserve scans all pageblocks
      unconditionally, but it is only necessary if the number of reserved
      block was reduced (i.e.  memory hot remove).
      
      Moreover, maximum MIGRATE_RESERVE per zone is currently 2.  It means
      that the number of reserved pageblocks is almost always unchanged.
      
      This patch adds zone->nr_migrate_reserve_block to maintain the number of
      MIGRATE_RESERVE pageblocks and it reduces the overhead of
      setup_zone_migrate_reserve dramatically.  The following table shows time
      of onlining a memory section.
      
        Amount of memory     | 128GB | 192GB | 256GB|
        ---------------------------------------------
        linux-3.12           |  23.9 |  31.4 | 44.5 |
        This patch           |   8.3 |   8.3 |  8.6 |
        Mel's proposal patch |  10.9 |  19.2 | 31.3 |
        ---------------------------------------------
                                         (millisecond)
      
        128GB : 4 nodes and each node has 32GB of memory
        192GB : 6 nodes and each node has 32GB of memory
        256GB : 8 nodes and each node has 32GB of memory
      
        (*1) Mel proposed his idea by the following threads.
             https://lkml.org/lkml/2013/10/30/272
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reported-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      943dca1a
    • O
      mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() · 5eaf1a9e
      Oleg Nesterov 提交于
      get_huge_page_tail()->compound_head() looks confusing.  Every caller
      must check PageTail(page), otherwise atomic_inc(&page->_mapcount) is
      simply wrong if this page is compound-trans-head.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5eaf1a9e
    • A
      mm: tail page refcounting optimization for slab and hugetlbfs · 44518d2b
      Andrea Arcangeli 提交于
      This skips the _mapcount mangling for slab and hugetlbfs pages.
      
      The main trouble in doing this is to guarantee that PageSlab and
      PageHeadHuge remains constant for all get_page/put_page run on the tail
      of slab or hugetlbfs compound pages.  Otherwise if they're set during
      get_page but not set during put_page, the _mapcount of the tail page
      would underflow.
      
      PageHeadHuge will remain true until the compound page is released and
      enters the buddy allocator so it won't risk to change even if the tail
      page is the last reference left on the page.
      
      PG_slab instead is cleared before the slab frees the head page with
      put_page, so if the tail pin is released after the slab freed the page,
      we would have a problem.  But in the slab case the tail pin cannot be
      the last reference left on the page.  This is because the slab code is
      free to reuse the compound page after a kfree/kmem_cache_free without
      having to check if there's any tail pin left.  In turn all tail pins
      must be always released while the head is still pinned by the slab code
      and so we know PG_slab will be still set too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44518d2b
    • A
      mm: thp: optimize compound_trans_huge · ca641514
      Andrea Arcangeli 提交于
      Currently we don't clobber page_tail->first_page during split_huge_page,
      so compound_trans_head can be set to compound_head without adverse
      effects, and this mostly optimizes away a smp_rmb.
      
      It looks worthwhile to keep around the implementation that doesn't relay
      on page_tail->first_page not to be clobbered, because it would be
      necessary if we'll decide to enforce page->private to zero at all times
      whenever PG_private is not set, also for anonymous pages.  For anonymous
      pages enforcing such an invariant doesn't matter as anonymous pages
      don't use page->private so we can get away with this microoptimization.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca641514