1. 16 11月, 2017 14 次提交
    • K
      mm, sysctl: make NUMA stats configurable · 4518085e
      Kemi Wang 提交于
      This is the second step which introduces a tunable interface that allow
      numa stats configurable for optimizing zone_statistics(), as suggested
      by Dave Hansen and Ying Huang.
      
      =========================================================================
      
      When page allocation performance becomes a bottleneck and you can
      tolerate some possible tool breakage and decreased numa counter
      precision, you can do:
      
      	echo 0 > /proc/sys/vm/numa_stat
      
      In this case, numa counter update is ignored.  We can see about
      *4.8%*(185->176) drop of cpu cycles per single page allocation and
      reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
      drop of cpu cycles per single page allocation and reclaim on Jesper's
      page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
      (88 threads, 126G memory).
      
      Benchmark link provided by Jesper D Brouer (increase loop times to
      10000000):
      
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      =========================================================================
      
      When page allocation performance is not a bottleneck and you want all
      tooling to work, you can do:
      
      	echo 1 > /proc/sys/vm/numa_stat
      
      This is system default setting.
      
      Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
      for comments to help improve the original patch.
      
      [keescook@chromium.org: make sure mutex is a global static]
        Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
      Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.comSigned-off-by: NKemi Wang <kemi.wang@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Suggested-by: NYing Huang <ying.huang@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4518085e
    • V
      mm, page_alloc: simplify list handling in rmqueue_bulk() · 0fac3ba5
      Vlastimil Babka 提交于
      rmqueue_bulk() fills an empty pcplist with pages from the free list.  It
      tries to preserve increasing order by pfn to the caller, because it
      leads to better performance with some I/O controllers, as explained in
      commit e084b2d9 ("page-allocator: preserve PFN ordering when
      __GFP_COLD is set").
      
      To preserve the order, it's sufficient to add pages to the tail of the
      list as they are retrieved.  The current code instead adds to the head
      of the list, but then updates the list head pointer to the last added
      page, in each step.  This does result in the same order, but is
      needlessly confusing and potentially wasteful, with no apparent benefit.
      This patch simplifies the code and adjusts comment accordingly.
      
      Link: http://lkml.kernel.org/r/f6505442-98a9-12e4-b2cd-0fa83874c159@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fac3ba5
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • M
      mm: remove cold parameter from free_hot_cold_page* · 2d4894b5
      Mel Gorman 提交于
      Most callers users of free_hot_cold_page claim the pages being released
      are cache hot.  The exception is the page reclaim paths where it is
      likely that enough pages will be freed in the near future that the
      per-cpu lists are going to be recycled and the cache hotness information
      is lost.  As no one really cares about the hotness of pages being
      released to the allocator, just ditch the parameter.
      
      The APIs are renamed to indicate that it's no longer about hot/cold
      pages.  It should also be less confusing as there are subtle differences
      between them.  __free_pages drops a reference and frees a page when the
      refcount reaches zero.  free_hot_cold_page handled pages whose refcount
      was already zero which is non-obvious from the name.  free_unref_page
      should be more obvious.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      [mgorman@techsingularity.net: add pages to head, not tail]
        Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d4894b5
    • M
      mm, page_alloc: enable/disable IRQs once when freeing a list of pages · 9cca35d4
      Mel Gorman 提交于
      Patch series "Follow-up for speed up page cache truncation", v2.
      
      This series is a follow-on for Jan Kara's series "Speed up page cache
      truncation" series.  We both ended up looking at the same problem but
      saw different problems based on the same data.  This series builds upon
      his work.
      
      A variety of workloads were compared on four separate machines but each
      machine showed gains albeit at different levels.  Minimally, some of the
      differences are due to NUMA where truncating data from a remote node is
      slower than a local node.  The workloads checked were
      
      o sparse truncate microbenchmark, tiny
      o sparse truncate microbenchmark, large
      o reaim-io disk workfile
      o dbench4 (modified by mmtests to produce more stable results)
      o filebench varmail configuration for small memory size
      o bonnie, directory operations, working set size 2*RAM
      
      reaim-io, dbench and filebench all showed minor gains.  Truncation does
      not dominate those workloads but were tested to ensure no other
      regressions.  They will not be reported further.
      
      The sparse truncate microbench was written by Jan.  It creates a number
      of files and then times how long it takes to truncate each one.  The
      "tiny" configuraiton creates a number of files that easily fits in
      memory and times how long it takes to truncate files with page cache.
      The large configuration uses enough files to have data that is twice the
      size of memory and so timings there include truncating page cache and
      working set shadow entries in the radix tree.
      
      Patches 1-4 are the most relevant parts of this series.  Patches 5-8 are
      optional as they are deleting code that is essentially useless but has a
      negligible performance impact.
      
      The changelogs have more information on performance but just for bonnie
      delete options, the main comparison is
      
      bonnie
                                            4.14.0-rc5             4.14.0-rc5             4.14.0-rc5
                                                jan-v2                vanilla                 mel-v2
      Hmean     SeqCreate ops         76.20 (   0.00%)       75.80 (  -0.53%)       76.80 (   0.79%)
      Hmean     SeqCreate read        85.00 (   0.00%)       85.00 (   0.00%)       85.00 (   0.00%)
      Hmean     SeqCreate del      13752.31 (   0.00%)    12090.23 ( -12.09%)    15304.84 (  11.29%)
      Hmean     RandCreate ops        76.00 (   0.00%)       75.60 (  -0.53%)       77.00 (   1.32%)
      Hmean     RandCreate read       96.80 (   0.00%)       96.80 (   0.00%)       97.00 (   0.21%)
      Hmean     RandCreate del     13233.75 (   0.00%)    11525.35 ( -12.91%)    14446.61 (   9.16%)
      
      Jan's series is the baseline and the vanilla kernel is 12% slower where
      as this series on top gains another 11%.  This is from a different
      machine than the data in the changelogs but the detailed data was not
      collected as there was no substantial change in v2.
      
      This patch (of 8):
      
      Freeing a list of pages current enables/disables IRQs for each page
      freed.  This patch splits freeing a list of pages into two operations --
      preparing the pages for freeing and the actual freeing.  This is a
      tradeoff - we're taking two passes of the list to free in exchange for
      avoiding multiple enable/disable of IRQs.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                 janbatch-v1r1            oneirq-v1r1
      Min          Time      149.00 (   0.00%)      141.00 (   5.37%)
      1st-qrtle    Time      150.00 (   0.00%)      142.00 (   5.33%)
      2nd-qrtle    Time      151.00 (   0.00%)      142.00 (   5.96%)
      3rd-qrtle    Time      151.00 (   0.00%)      143.00 (   5.30%)
      Max-90%      Time      153.00 (   0.00%)      144.00 (   5.88%)
      Max-95%      Time      155.00 (   0.00%)      147.00 (   5.16%)
      Max-99%      Time      201.00 (   0.00%)      195.00 (   2.99%)
      Max          Time      236.00 (   0.00%)      230.00 (   2.54%)
      Amean        Time      152.65 (   0.00%)      144.37 (   5.43%)
      Stddev       Time        9.78 (   0.00%)       10.44 (  -6.72%)
      Coeff        Time        6.41 (   0.00%)        7.23 ( -12.84%)
      Best99%Amean Time      152.07 (   0.00%)      143.72 (   5.50%)
      Best95%Amean Time      150.75 (   0.00%)      142.37 (   5.56%)
      Best90%Amean Time      150.59 (   0.00%)      142.19 (   5.58%)
      Best75%Amean Time      150.36 (   0.00%)      141.92 (   5.61%)
      Best50%Amean Time      150.04 (   0.00%)      141.69 (   5.56%)
      Best25%Amean Time      149.85 (   0.00%)      141.38 (   5.65%)
      
      With a tiny number of files, each file truncated has resident page cache
      and it shows that time to truncate is roughtly 5-6% with some minor
      jitter.
      
                                            4.14.0-rc4             4.14.0-rc4
                                         janbatch-v1r1            oneirq-v1r1
      Hmean     SeqCreate ops         65.27 (   0.00%)       81.86 (  25.43%)
      Hmean     SeqCreate read        39.48 (   0.00%)       47.44 (  20.16%)
      Hmean     SeqCreate del      24963.95 (   0.00%)    26319.99 (   5.43%)
      Hmean     RandCreate ops        65.47 (   0.00%)       82.01 (  25.26%)
      Hmean     RandCreate read       42.04 (   0.00%)       51.75 (  23.09%)
      Hmean     RandCreate del     23377.66 (   0.00%)    23764.79 (   1.66%)
      
      As expected, there is a small gain for the delete operation.
      
      [mgorman@techsingularity.net: use page_private and set_page_private helpers]
        Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cca35d4
    • A
      mm/page_alloc: make sure __rmqueue() etc are always inline · 85ccc8fa
      Aaron Lu 提交于
      __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
      __rmqueue_cma_fallback() are all in page allocator's hot path and better
      be finished as soon as possible.  One way to make them faster is by making
      them inline.  But as Andrew Morton and Andi Kleen pointed out:
      
        https://lkml.org/lkml/2017/10/10/1252
        https://lkml.org/lkml/2017/10/10/1279
      
      To make sure they are inlined, we should use __always_inline for them.
      
      With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
      processes to stress buddy, the results for will-it-scale.processes with
      and without the patch are:
      
      On a 2-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7       6496131     6911823 +6.4%
        gcc-4.9.4       7225110     7731072 +7.0%
        gcc-5.4.1       7054224     7688146 +9.0%
        gcc-6.2.0       7059794     7651675 +8.4%
      
      On a 4-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7      13162890    13508193 +2.6%
        gcc-4.9.4      14997463    15484353 +3.2%
        gcc-5.4.1      14708711    15449805 +5.0%
        gcc-6.2.0      14574099    15349204 +5.3%
      
      The above 4 compilers are used because I've done the tests through
      Intel's Linux Kernel Performance(LKP) infrastructure and they are the
      available compilers there.
      
      The benefit being less on 4 sockets machine is due to the lock
      contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
      less severe than on the 2 sockets machine(85%).
      
      What the benchmark does is: it forks nr_cpu processes and then each
      process does the following:
          1 mmap() 128M anonymous space;
          2 writes to each page there to trigger actual page allocation;
          3 munmap() it.
      in a loop.
      
        https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
      
      Binary size wise, I have locally built them with different compilers:
      
      [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
         text    data     bss     dec     hex filename
        37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
        38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
        37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
        38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
        37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
        38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
        36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
        37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
      
      Text size increased about 800 bytes for mm/page_alloc.o.
      
      [aaron@aaronlu obj]$ size */*/vmlinux
         text    data     bss     dec       hex     filename
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
      
      Text size for vmlinux has no change though, probably due to function
      alignment.
      
      Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85ccc8fa
    • P
      mm: stop zeroing memory during allocation in vmemmap · f7f99100
      Pavel Tatashin 提交于
      vmemmap_alloc_block() will no longer zero the block, so zero memory at
      its call sites for everything except struct pages.  Struct page memory
      is zero'd by struct page initialization.
      
      Replace allocators in sparse-vmemmap to use the non-zeroing version.
      So, we will get the performance improvement by zeroing the memory in
      parallel when struct pages are zeroed.
      
      Add struct page zeroing as a part of initialization of other fields in
      __init_single_page().
      
      This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
      v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
      
                               BASE            FIX
      sparse_init     11.244671836s   0.007199623s
      zone_sizes_init  4.879775891s   8.355182299s
                        --------------------------
      Total           16.124447727s   8.362381922s
      
      sparse_init is where memory for struct pages is zeroed, and the zeroing
      part is moved later in this patch into __init_single_page(), which is
      called from zone_sizes_init().
      
      [akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
      Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7f99100
    • P
      mm: zero reserved and unavailable struct pages · a4a3ede2
      Pavel Tatashin 提交于
      Some memory is reserved but unavailable: not present in memblock.memory
      (because not backed by physical pages), but present in memblock.reserved.
      Such memory has backing struct pages, but they are not initialized by
      going through __init_single_page().
      
      In some cases these struct pages are accessed even if they do not
      contain any data.  One example is page_to_pfn() might access page->flags
      if this is where section information is stored (CONFIG_SPARSEMEM,
      SECTION_IN_PAGE_FLAGS).
      
      One example of such memory: trim_low_memory_range() unconditionally
      reserves from pfn 0, but e820__memblock_setup() might provide the
      exiting memory from pfn 1 (i.e.  KVM).
      
      Since struct pages are zeroed in __init_single_page(), and not during
      allocation time, we must zero such struct pages explicitly.
      
      The patch involves adding a new memblock iterator:
      	for_each_resv_unavail_range(i, p_start, p_end)
      
      Which iterates through reserved && !memory lists, and we zero struct pages
      explicitly by calling mm_zero_struct_page().
      
      ===
      
      Here is more detailed example of problem that this patch is addressing:
      
      Run tested on qemu with the following arguments:
      
      	-enable-kvm -cpu kvm64 -m 512 -smp 2
      
      This patch reports that there are 98 unavailable pages.
      
      They are: pfn 0 and pfns in range [159, 255].
      
      Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
      not reserve [159, 255] ones.
      
      e820__memblock_setup() reports linux that the following physical ranges are
      available:
          [1 , 158]
      [256, 130783]
      
      Notice, that exactly unavailable pfns are missing!
      
      Now, lets check what we have in zone 0: [1, 131039]
      
      pfn 0, is not part of the zone, but pfns [1, 158], are.
      
      However, the bigger problem we have if we do not initialize these struct
      pages is with memory hotplug.  Because, that path operates at 2M
      boundaries (section_nr).  And checks if 2M range of pages is hot
      removable.  It starts with first pfn from zone, rounds it down to 2M
      boundary (sturct pages are allocated at 2M boundaries when vmemmap is
      created), and checks if that section is hot removable.  In this case
      start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
      to struct page, and some fields are checked.  Now, if we do not zero
      struct pages, we get unpredictable results.
      
      In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
      vmemmap memory to ones, the following panic is observed with kernel test
      without this patch applied:
      
        BUG: unable to handle kernel NULL pointer dereference at          (null)
        IP: is_pageblock_removable_nolock+0x35/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT
        ...
        task: ffff88001f4e2900 task.stack: ffffc90000314000
        RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
        Call Trace:
         ? is_mem_section_removable+0x5a/0xd0
         show_mem_removable+0x6b/0xa0
         dev_attr_show+0x1b/0x50
         sysfs_kf_seq_show+0xa1/0x100
         kernfs_seq_show+0x22/0x30
         seq_read+0x1ac/0x3a0
         kernfs_fop_read+0x36/0x190
         ? security_file_permission+0x90/0xb0
         __vfs_read+0x16/0x30
         vfs_read+0x81/0x130
         SyS_read+0x44/0xa0
         entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4a3ede2
    • P
      mm: define memblock_virt_alloc_try_nid_raw · ea1f5f37
      Pavel Tatashin 提交于
      * A new variant of memblock_virt_alloc_* allocations:
      memblock_virt_alloc_try_nid_raw()
          - Does not zero the allocated memory
          - Does not panic if request cannot be satisfied
      
      * optimize early system hash allocations
      
      Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
      specify that memory that was allocated for system hash needs to be
      zeroed, otherwise the memory does not need to be zeroed, and client will
      initialize it.
      
      If memory does not need to be zero'd, call the new
      memblock_virt_alloc_raw() interface, and thus improve the boot
      performance.
      
      * debug for raw alloctor
      
      When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
      returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
      places excpect zeroed memory.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1f5f37
    • P
      mm: deferred_init_memmap improvements · 2f47a91f
      Pavel Tatashin 提交于
      Patch series "complete deferred page initialization", v12.
      
      SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
      option, which defers initializing struct pages until all cpus have been
      started so it can be done in parallel.
      
      However, this feature is sub-optimal, because the deferred page
      initialization code expects that the struct pages have already been
      zeroed, and the zeroing is done early in boot with a single thread only.
      Also, we access that memory and set flags before struct pages are
      initialized.  All of this is fixed in this patchset.
      
      In this work we do the following:
       - Never read access struct page until it was initialized
       - Never set any fields in struct pages before they are initialized
       - Zero struct page at the beginning of struct page initialization
      
      ==========================================================================
      Performance improvements on x86 machine with 8 nodes:
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                              TIME          SPEED UP
      base no deferred:       95.796233s
      fix no deferred:        79.978956s    19.77%
      
      base deferred:          77.254713s
      fix deferred:           55.050509s    40.34%
      ==========================================================================
      SPARC M6 3600 MHz with 15T of memory
                              TIME          SPEED UP
      base no deferred:       358.335727s
      fix no deferred:        302.320936s   18.52%
      
      base deferred:          237.534603s
      fix deferred:           182.103003s   30.44%
      ==========================================================================
      Raw dmesg output with timestamps:
      x86 base no deferred:    https://hastebin.com/ofunepurit.scala
      x86 base deferred:       https://hastebin.com/ifazegeyas.scala
      x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
      x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
      sparc base no deferred:  https://hastebin.com/ibobeteken.go
      sparc base deferred:     https://hastebin.com/fariqimiyu.go
      sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
      sparc fix deferred:      https://hastebin.com/xadinobutu.go
      
      This patch (of 11):
      
      deferred_init_memmap() is called when struct pages are initialized later
      in boot by slave CPUs.  This patch simplifies and optimizes this
      function, and also fixes a couple issues (described below).
      
      The main change is that now we are iterating through free memblock areas
      instead of all configured memory.  Thus, we do not have to check if the
      struct page has already been initialized.
      
      =====
      In deferred_init_memmap() where all deferred struct pages are
      initialized we have a check like this:
      
        if (page->flags) {
      	VM_BUG_ON(page_zone(page) != zone);
      	goto free_range;
        }
      
      This way we are checking if the current deferred page has already been
      initialized.  It works, because memory for struct pages has been zeroed,
      and the only way flags are not zero if it went through
      __init_single_page() before.  But, once we change the current behavior
      and won't zero the memory in memblock allocator, we cannot trust
      anything inside "struct page"es until they are initialized.  This patch
      fixes this.
      
      The deferred_init_memmap() is re-written to loop through only free
      memory ranges provided by memblock.
      
      Note, this first issue is relevant only when the following change is
      merged:
      
      =====
      This patch fixes another existing issue on systems that have holes in
      zones i.e CONFIG_HOLES_IN_ZONE is defined.
      
      In for_each_mem_pfn_range() we have code like this:
      
        if (!pfn_valid_within(pfn)
      	goto free_range;
      
      Note: 'page' is not set to NULL and is not incremented but 'pfn'
      advances.  Thus means if deferred struct pages are enabled on systems
      with these kind of holes, linux would get memory corruptions.  I have
      fixed this issue by defining a new macro that performs all the necessary
      operations when we free the current set of pages.
      
      [pasha.tatashin@oracle.com: buddy page accessed before initialized]
        Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f47a91f
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • M
      mm, page_alloc: fail has_unmovable_pages when seeing reserved pages · d7ab3672
      Michal Hocko 提交于
      Reserved pages should be completely ignored by the core mm because they
      have a special meaning for their owners.  has_unmovable_pages doesn't
      check those so we rely on other tests (reference count, or PageLRU) to
      fail on such pages.  Althought this happens to work it is safer to
      simply check for those explicitly and do not rely on the owner of the
      page to abuse those fields for special purposes.
      
      Please note that this is more of a further fortification of the code
      rahter than a fix of an existing issue.
      
      Link: http://lkml.kernel.org/r/20171013120756.jeopthigbmm3c7bl@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7ab3672
    • M
      mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages() · 4da2ce25
      Michal Hocko 提交于
      Joonsoo has noticed that "mm: drop migrate type checks from
      has_unmovable_pages" would break CMA allocator because it relies on
      has_unmovable_pages returning false even for CMA pageblocks which in
      fact don't have to be movable:
      
       alloc_contig_range
         start_isolate_page_range
           set_migratetype_isolate
             has_unmovable_pages
      
      This is a result of the code sharing between CMA and memory hotplug
      while each one has a different idea of what has_unmovable_pages should
      return.  This is unfortunate but fixing it properly would require a lot
      of code duplication.
      
      Fix the issue by introducing the requested migrate type argument and
      special case MIGRATE_CMA case where CMA page blocks are handled
      properly.  This will work for memory hotplug because it requires
      MIGRATE_MOVABLE.
      
      Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: NRan Wang <ran.wang_1@nxp.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da2ce25
    • M
      mm: drop migrate type checks from has_unmovable_pages · d7b236e1
      Michal Hocko 提交于
      Michael has noticed that the memory offline tries to migrate kernel code
      pages when doing
      
       echo 0 > /sys/devices/system/memory/memory0/online
      
      The current implementation will fail the operation after several failed
      page migration attempts but we shouldn't even attempt to migrate that
      memory and fail right away because this memory is clearly not
      migrateable.  This will become a real problem when we drop the retry
      loop counter resp.  timeout.
      
      The real problem is in has_unmovable_pages in fact.  We should fail if
      there are any non migrateable pages in the area.  In orther to guarantee
      that remove the migrate type checks because MIGRATE_MOVABLE is not
      guaranteed to contain only migrateable pages.  It is merely a heuristic.
      Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
      allocate any non-migrateable pages from the block but CMA allocations
      themselves are unlikely to migrateable.  Therefore remove both checks.
      
      [akpm@linux-foundation.org: remove unused local `mt']
      Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Tested-by: NRan Wang <ran.wang_1@nxp.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7b236e1
  2. 07 11月, 2017 1 次提交
  3. 20 10月, 2017 1 次提交
  4. 04 10月, 2017 2 次提交
  5. 09 9月, 2017 2 次提交
    • T
      mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt · f19360f0
      Tetsuo Handa 提交于
      We are by error initializing alloc_flags before gfp_allowed_mask is
      applied.  This could cause problems after pm_restrict_gfp_mask() is called
      during suspend operation.  Apply gfp_allowed_mask before initializing
      alloc_flags so that the first allocation attempt uses correct flags.
      
      Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
      Fixes: 83d4ca81 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19360f0
    • K
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang 提交于
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.comSigned-off-by: NKemi Wang <kemi.wang@intel.com>
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
  6. 07 9月, 2017 10 次提交
    • M
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko 提交于
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
    • M
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko 提交于
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • M
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko 提交于
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • M
      mm, page_alloc: remove stop_machine from build_all_zonelists · 11cd8638
      Michal Hocko 提交于
      build_all_zonelists has been (ab)using stop_machine to make sure that
      zonelists do not change while somebody is looking at them.  This is is
      just a gross hack because a) it complicates the context from which we
      can call build_all_zonelists (see 3f906ba2 ("mm/memory-hotplug:
      switch locking to a percpu rwsem")) and b) is is not really necessary
      especially after "mm, page_alloc: simplify zonelist initialization" and
      c) it doesn't really provide the protection it claims (see below).
      
      Updates of the zonelists happen very seldom, basically only when a zone
      becomes populated during memory online or when it loses all the memory
      during offline.  A racing iteration over zonelists could either miss a
      zone or try to work on one zone twice.  Both of these are something we
      can live with occasionally because there will always be at least one
      zone visible so we are not likely to fail allocation too easily for
      example.
      
      Please note that the original stop_machine approach doesn't really
      provide a better exclusion because the iteration might be interrupted
      half way (unless the whole iteration is preempt disabled which is not
      the case in most cases) so the some zones could still be seen twice or a
      zone missed.
      
      I have run the pathological online/offline of the single memblock in the
      movable zone while stressing the same small node with some memory
      pressure.
      
      Node 1, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 943, 943, 943)
      Node 1, zone    DMA32
        pages free     227310
              min      8294
              low      10367
              high     12440
              spanned  262112
              present  262112
              managed  241436
              protection: (0, 0, 0, 0)
      Node 1, zone   Normal
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 0, 1024)
      Node 1, zone  Movable
        pages free     32722
              min      85
              low      117
              high     149
              spanned  32768
              present  32768
              managed  32768
              protection: (0, 0, 0, 0)
      
      root@test1:/sys/devices/system/node/node1# while true
      do
      	echo offline > memory34/state
      	echo online_movable > memory34/state
      done
      
      root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4
      
      and it survived without any unexpected behavior.  While this is not
      really a great testing coverage it should exercise the allocation path
      quite a lot.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11cd8638
    • M
      mm, page_alloc: simplify zonelist initialization · 9d3be21b
      Michal Hocko 提交于
      build_zonelists gradually builds zonelists from the nearest to the most
      distant node.  As we do not know how many populated zones we will have
      in each node we rely on the _zoneref to terminate initialized part of
      the zonelist by a NULL zone.  While this is functionally correct it is
      quite suboptimal because we cannot allow updaters to race with zonelists
      users because they could see an empty zonelist and fail the allocation
      or hit the OOM killer in the worst case.
      
      We can do much better, though.  We can store the node ordering into an
      already existing node_order array and then give this array to
      build_zonelists_in_node_order and do the whole initialization at once.
      zonelists consumers still might see halfway initialized state but that
      should be much more tolerateable because the list will not be empty and
      they would either see some zone twice or skip over some zone(s) in the
      worst case which shouldn't lead to immediate failures.
      
      While at it let's simplify build_zonelists_node which is rather
      confusing now.  It gets an index into the zoneref array and returns the
      updated index for the next iteration.  Let's rename the function to
      build_zonerefs_node to better reflect its purpose and give it zoneref
      array to update.  The function doesn't the index anymore.  It just
      returns the number of added zones so that the caller can advance the
      zonered array start for the next update.
      
      This patch alone doesn't introduce any functional change yet, though, it
      is merely a preparatory work for later changes.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d3be21b
    • M
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko 提交于
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • M
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization · d9c9a0b9
      Michal Hocko 提交于
      __build_all_zonelists reinitializes each online cpu local node for
      CONFIG_HAVE_MEMORYLESS_NODES.  This makes sense because previously
      memory less nodes could gain some memory during memory hotplug and so
      the local node should be changed for CPUs close to such a node.  It
      makes less sense to do that unconditionally for a newly creaded NUMA
      node which is still offline and without any memory.
      
      Let's also simplify the cpu loop and use for_each_online_cpu instead of
      an explicit cpu_online check for all possible cpus.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9c9a0b9
    • M
      mm, page_alloc: remove boot pageset initialization from memory hotplug · afb6ebb3
      Michal Hocko 提交于
      boot_pageset is a boot time hack which gets superseded by normal
      pagesets later in the boot process.  It makes zero sense to reinitialize
      it again and again during memory hotplug.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afb6ebb3
    • M
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko 提交于
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • W
      mm/memory_hotplug: just build zonelist for newly added node · c1152583
      Wei Yang 提交于
      Commit 9adb62a5 ("mm/hotplug: correctly setup fallback zonelists
      when creating new pgdat") tries to build the correct zonelist for a
      newly added node, while it is not necessary to rebuild it for already
      exist nodes.
      
      In build_zonelists(), it will iterate on nodes with memory.  For a newly
      added node, it will have memory until node_states_set_node() is called
      in online_pages().
      
      This patch avoids rebuilding the zonelists for already existing nodes.
      
      build_zonelists_node() uses managed_zone(zone) checks, so it should not
      include empty zones anyway.  So effectively we avoid some pointless work
      under stop_machine().
      
      [akpm@linux-foundation.org: tweak comment text]
      [akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
      Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1152583
  7. 01 9月, 2017 1 次提交
  8. 26 8月, 2017 1 次提交
    • C
      PM/hibernate: touch NMI watchdog when creating snapshot · 556b969a
      Chen Yu 提交于
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.comSigned-off-by: NChen Yu <yu.c.chen@intel.com>
      Reported-by: NJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556b969a
  9. 19 8月, 2017 1 次提交
  10. 11 8月, 2017 2 次提交
    • J
      mm: ratelimit PFNs busy info message · 75dddef3
      Jonathan Toppins 提交于
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.comSigned-off-by: NJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: NDoug Ledford <dledford@redhat.com>
      Tested-by: NDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75dddef3
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  11. 10 8月, 2017 1 次提交
    • P
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra 提交于
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d92a8cfc
  12. 03 8月, 2017 1 次提交
    • H
      mm: take memory hotplug lock within numa_zonelist_order_handler() · 167d0f25
      Heiko Carstens 提交于
      Andre Wild reported the following warning:
      
        WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
        Modules linked in:
        CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57 #10
        Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
        task: 00000000701d8100 task.stack: 0000000073594000
        Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
        ...
        Call Trace:
         lockdep_assert_cpus_held+0x42/0x60)
         stop_machine_cpuslocked+0x62/0xf0
         build_all_zonelists+0x92/0x150
         numa_zonelist_order_handler+0x102/0x150
         proc_sys_call_handler.isra.12+0xda/0x118
         proc_sys_write+0x34/0x48
         __vfs_write+0x3c/0x178
         vfs_write+0xbc/0x1a0
         SyS_write+0x66/0xc0
         system_call+0xc4/0x2b0
         locks held by bash/1205:
         #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
         #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
         #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
        Last Breaking-Event-Address:
          lockdep_assert_cpus_held+0x48/0x60
      
      This can be easily triggered with e.g.
      
          echo n > /proc/sys/vm/numa_zonelist_order
      
      In commit 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu
      rwsem") memory hotplug locking was changed to fix a potential deadlock.
      
      This also switched the stop_machine() invocation within
      build_all_zonelists() to stop_machine_cpuslocked() which now expects
      that online cpus are locked when being called.
      
      This assumption is not true if build_all_zonelists() is being called
      from numa_zonelist_order_handler().
      
      In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
      pair to numa_zonelist_order_handler().
      
      Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
      Fixes: 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu rwsem")
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NAndre Wild <wild@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      167d0f25
  13. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  14. 11 7月, 2017 2 次提交