1. 11 8月, 2017 2 次提交
    • J
      mm: ratelimit PFNs busy info message · 75dddef3
      Jonathan Toppins 提交于
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.comSigned-off-by: NJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: NDoug Ledford <dledford@redhat.com>
      Tested-by: NDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75dddef3
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  2. 03 8月, 2017 1 次提交
    • H
      mm: take memory hotplug lock within numa_zonelist_order_handler() · 167d0f25
      Heiko Carstens 提交于
      Andre Wild reported the following warning:
      
        WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
        Modules linked in:
        CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57 #10
        Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
        task: 00000000701d8100 task.stack: 0000000073594000
        Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
        ...
        Call Trace:
         lockdep_assert_cpus_held+0x42/0x60)
         stop_machine_cpuslocked+0x62/0xf0
         build_all_zonelists+0x92/0x150
         numa_zonelist_order_handler+0x102/0x150
         proc_sys_call_handler.isra.12+0xda/0x118
         proc_sys_write+0x34/0x48
         __vfs_write+0x3c/0x178
         vfs_write+0xbc/0x1a0
         SyS_write+0x66/0xc0
         system_call+0xc4/0x2b0
         locks held by bash/1205:
         #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
         #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
         #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
        Last Breaking-Event-Address:
          lockdep_assert_cpus_held+0x48/0x60
      
      This can be easily triggered with e.g.
      
          echo n > /proc/sys/vm/numa_zonelist_order
      
      In commit 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu
      rwsem") memory hotplug locking was changed to fix a potential deadlock.
      
      This also switched the stop_machine() invocation within
      build_all_zonelists() to stop_machine_cpuslocked() which now expects
      that online cpus are locked when being called.
      
      This assumption is not true if build_all_zonelists() is being called
      from numa_zonelist_order_handler().
      
      In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
      pair to numa_zonelist_order_handler().
      
      Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
      Fixes: 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu rwsem")
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NAndre Wild <wild@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      167d0f25
  3. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  4. 11 7月, 2017 3 次提交
    • T
      mm/memory-hotplug: switch locking to a percpu rwsem · 3f906ba2
      Thomas Gleixner 提交于
      Andrey reported a potential deadlock with the memory hotplug lock and
      the cpu hotplug lock.
      
      The reason is that memory hotplug takes the memory hotplug lock and then
      calls stop_machine() which calls get_online_cpus().  That's the reverse
      lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c
      
      The problem has been there forever.  The reason why this was never
      reported is that the cpu hotplug locking had this homebrewn recursive
      reader writer semaphore construct which due to the recursion evaded the
      full lock dep coverage.  The memory hotplug code copied that construct
      verbatim and therefor has similar issues.
      
      Three steps to fix this:
      
      1) Convert the memory hotplug locking to a per cpu rwsem so the
         potential issues get reported proper by lockdep.
      
      2) Lock the online cpus in mem_hotplug_begin() before taking the memory
         hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
         code to avoid recursive locking.
      
      3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
         hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
         by invoking lru_add_drain_all_cpuslocked() instead.
      
      Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.deReported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f906ba2
    • R
      mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallback · b002529d
      Rasmus Villemoes 提交于
      Since current_order starts as MAX_ORDER-1 and is then only decremented,
      the second half of the loop condition seems superfluous.  However, if
      order is 0, we may decrement current_order past 0, making it UINT_MAX.
      This is obviously too subtle ([1], [2]).
      
      Since we need to add some comment anyway, change the two variables to
      signed, making the counting-down for loop look more familiar, and
      apparently also making gcc generate slightly smaller code.
      
      [1] https://lkml.org/lkml/2016/6/20/493
      [2] https://lkml.org/lkml/2017/6/19/345
      
      [akpm@linux-foundation.org: fix up reject fixupping]
      Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: NHao Lee <haolee.swjtu@gmail.com>
      Acked-by: NWei Yang <weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b002529d
    • V
      mm, page_alloc: fallback to smallest page when not stealing whole pageblock · 7a8f58f3
      Vlastimil Babka 提交于
      Since commit 3bc48f96 ("mm, page_alloc: split smallest stolen page
      in fallback") we pick the smallest (but sufficient) page of all that
      have been stolen from a pageblock of different migratetype.  However,
      there are cases when we decide not to steal the whole pageblock.
      
      Practically in the current implementation it means that we are trying to
      fallback for a MIGRATE_MOVABLE allocation of order X, go through the
      freelists from MAX_ORDER-1 down to X, and find free page of order Y.  If
      Y is less than pageblock_order / 2, we decide not to steal all pages
      from the pageblock.  When Y > X, it means we are potentially splitting a
      larger page than we need, as there might be other pages of order Z,
      where X <= Z < Y.  Since Y is already too small to steal whole
      pageblock, picking smallest available Z will result in the same decision
      and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or
      MIGRATE_RECLAIMABLE pageblock.
      
      This patch therefore changes the fallback algorithm so that in the
      situation described above, we switch the fallback search strategy to go
      from order X upwards to find the smallest suitable fallback.  In theory
      there shouldn't be a downside of this change wrt fragmentation.
      
      This has been tested with mmtests' stress-highalloc performing
      GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint
      statistics:
      
                                                              4.12.0-rc2      4.12.0-rc2
                                                               1-kernel4       2-kernel4
        Page alloc extfrag event                                  25640976    69680977
        Extfrag fragmenting                                       25621086    69661364
        Extfrag fragmenting for unmovable                            74409       73204
        Extfrag fragmenting unmovable placed with movable            69003       67684
        Extfrag fragmenting unmovable placed with reclaim.            5406        5520
        Extfrag fragmenting for reclaimable                           6398        8467
        Extfrag fragmenting reclaimable placed with movable            869         884
        Extfrag fragmenting reclaimable placed with unmov.            5529        7583
        Extfrag fragmenting for movable                           25540279    69579693
      
      Since we force movable allocations to steal the smallest available page
      (which we then practially always split), we steal less per fallback, so
      the number of fallbacks increases and steals potentially happen from
      different pageblocks.  This is however not an issue for movable pages
      that can be compacted.
      
      Importantly, the "unmovable placed with movable" statistics is lower,
      which is the result of less fragmentation in the unmovable pageblocks.
      The effect on reclaimable allocation is a bit unclear.
      
      Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a8f58f3
  5. 07 7月, 2017 9 次提交
    • M
      mm, memory_hotplug: drop CONFIG_MOVABLE_NODE · f70029bb
      Michal Hocko 提交于
      Commit 20b2f52b ("numa: add CONFIG_MOVABLE_NODE for
      movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
      good explanation on why it is actually useful.
      
      It makes a lot of sense to make movable node semantic opt in but we
      already have that because the feature has to be explicitly enabled on
      the kernel command line.  A config option on top only makes the
      configuration space larger without a good reason.  It also adds an
      additional ifdefery that pollutes the code.
      
      Just drop the config option and make it de-facto always enabled.  This
      shouldn't introduce any change to the semantic.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f70029bb
    • J
      mm: vmstat: move slab statistics from zone to node counters · 385386cf
      Johannes Weiner 提交于
      Patch series "mm: per-lruvec slab stats"
      
      Josef is working on a new approach to balancing slab caches and the page
      cache.  For this to work, he needs slab cache statistics on the lruvec
      level.  These patches implement that by adding infrastructure that
      allows updating and reading generic VM stat items per lruvec, then
      switches some existing VM accounting sites, including the slab
      accounting ones, to this new cgroup-aware API.
      
      I'll follow up with more patches on this, because there is actually
      substantial simplification that can be done to the memory controller
      when we replace private memcg accounting with making the existing VM
      accounting sites cgroup-aware.  But this is enough for Josef to base his
      slab reclaim work on, so here goes.
      
      This patch (of 5):
      
      To re-implement slab cache vs.  page cache balancing, we'll need the
      slab counters at the lruvec level, which, ever since lru reclaim was
      moved from the zone to the node, is the intersection of the node, not
      the zone, and the memcg.
      
      We could retain the per-zone counters for when the page allocator dumps
      its memory information on failures, and have counters on both levels -
      which on all but NUMA node 0 is usually redundant.  But let's keep it
      simple for now and just move them.  If anybody complains we can restore
      the per-zone counters.
      
      [hannes@cmpxchg.org: fix oops]
        Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      385386cf
    • V
      mm, page_alloc: pass preferred nid instead of zonelist to allocator · 04ec6264
      Vlastimil Babka 提交于
      The main allocator function __alloc_pages_nodemask() takes a zonelist
      pointer as one of its parameters.  All of its callers directly or
      indirectly obtain the zonelist via node_zonelist() using a preferred
      node id and gfp_mask.  We can make the code a bit simpler by doing the
      zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
      id instead (gfp_mask is already another parameter).
      
      There are some code size benefits thanks to removal of inlined
      node_zonelist():
      
        bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)
      
      This will also make things simpler if we proceed with converting cpusets
      to zonelists.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04ec6264
    • V
      mm, page_alloc: fix more premature OOM due to race with cpuset update · 902b6281
      Vlastimil Babka 提交于
      I would like to stress that this patchset aims to fix issues and cleanup
      the code *within the existing documented semantics*, i.e.  patch 1
      ignores mempolicy restrictions if the set of allowed nodes has no
      intersection with set of nodes allowed by cpuset.  I believe discussing
      potential changes of the semantics can be better done once we have a
      baseline with no known bugs of the current semantics.
      
      I've recently summarized the cpuset/mempolicy issues in a LSF/MM
      proposal [1] and the discussion itself [2].  I've been trying to rewrite
      the handling as proposed, with the idea that changing semantics to make
      all mempolicies static wrt cpuset updates (and discarding the relative
      and default modes) can be tried on top, as there's a high risk of being
      rejected/reverted because somebody might still care about the removed
      modes.
      
      However I haven't yet figured out how to properly:
      
      1) make mempolicies swappable instead of rebinding in place. I thought
         mbind() already works that way and uses refcounting to avoid
         use-after-free of the old policy by a parallel allocation, but turns
         out true refcounting is only done for shared (shmem) mempolicies, and
         the actual protection for mbind() comes from mmap_sem. Extending the
         refcounting means more overhead in allocator hot path. Also swapping
         whole mempolicies means that we have to allocate the new ones, which
         can fail, and reverting of the partially done work also means
         allocating (note that mbind() doesn't care and will just leave part
         of the range updated and part not updated when returning -ENOMEM...).
      
      2) make cpuset's task->mems_allowed also swappable (after converting it
         from nodemask to zonelist, which is the easy part) for mostly the
         same reasons.
      
      The good news is that while trying to do the above, I've at least
      figured out how to hopefully close the remaining premature OOM's, and do
      a buch of cleanups on top, removing quite some of the code that was also
      supposed to prevent the cpuset update races, but doesn't work anymore
      nowadays.  This should fix the most pressing concerns with this topic
      and give us a better baseline before either proceeding with the original
      proposal, or pushing a change of semantics that removes the problem 1)
      above.  I'd be then fine with trying to change the semantic first and
      rewrite later.
      
      Patchset has been tested with the LTP cpuset01 stress test.
      
      [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
      [2] https://lwn.net/Articles/717797/
      [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2
      
      This patch (of 6):
      
      Commit e47483bc ("mm, page_alloc: fix premature OOM when racing with
      cpuset mems update") has fixed known recent regressions found by LTP's
      cpuset01 testcase.  I have however found that by modifying the testcase
      to use per-vma mempolicies via bind(2) instead of per-task mempolicies
      via set_mempolicy(2), the premature OOM still happens and the issue is
      much older.
      
      The root of the problem is that the cpuset's mems_allowed and
      mempolicy's nodemask can temporarily have no intersection, thus
      get_page_from_freelist() cannot find any usable zone.  The current
      semantic for empty intersection is to ignore mempolicy's nodemask and
      honour cpuset restrictions.  This is checked in node_zonelist(), but the
      racy update can happen after we already passed the check.  Such races
      should be protected by the seqlock task->mems_allowed_seq, but it
      doesn't work here, because 1) mpol_rebind_mm() does not happen under
      seqlock for write, and doing so would lead to deadlock, as it takes
      mmap_sem for write, while the allocation can have mmap_sem for read when
      it's taking the seqlock for read.  And 2) the seqlock cookie of callers
      of node_zonelist() (alloc_pages_vma() and alloc_pages_current()) is
      different than the one of __alloc_pages_slowpath(), so there's still a
      potential race window.
      
      This patch fixes the issue by having __alloc_pages_slowpath() check for
      empty intersection of cpuset and ac->nodemask before OOM or allocation
      failure.  If it's indeed empty, the nodemask is ignored and allocation
      retried, which mimics node_zonelist().  This works fine, because almost
      all callers of __alloc_pages_nodemask are obtaining the nodemask via
      node_zonelist().  The only exception is new_node_page() from hotplug,
      where the potential violation of nodemask isn't an issue, as there's
      already a fallback allocation attempt without any nodemask.  If there's
      a future caller that needs to have its specific nodemask honoured over
      task's cpuset restrictions, we'll have to e.g.  add a gfp flag for that.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      902b6281
    • M
      mm/page_alloc.c: mark bad_range() and meminit_pfn_in_nid() as __maybe_unused · d73d3c9f
      Matthias Kaehlcke 提交于
      The functions are not used in some configurations.  Adding the attribute
      fixes the following warnings when building with clang:
      
        mm/page_alloc.c:409:19: error: function 'bad_range' is not needed and
            will not be emitted [-Werror,-Wunneeded-internal-declaration]
      
        mm/page_alloc.c:1106:30: error: unused function 'meminit_pfn_in_nid'
            [-Werror,-Wunused-function]
      
      Link: http://lkml.kernel.org/r/20170518182030.165633-1-mka@chromium.orgSigned-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d73d3c9f
    • P
      mm: adaptive hash table scaling · 9017217b
      Pavel Tatashin 提交于
      Allow hash tables to scale with memory but at slower pace, when
      HASH_ADAPT is provided every time memory quadruples the sizes of hash
      tables will only double instead of quadrupling as well.  This algorithm
      starts working only when memory size reaches a certain point, currently
      set to 64G.
      
      This is example of dentry hash table size, before and after four various
      memory configurations:
      
      MEMORY	   SCALE	 HASH_SIZE
      	old	new	old	new
          8G	 13	 13      8M      8M
         16G	 13	 13     16M     16M
         32G	 13	 13     32M     32M
         64G	 13	 13     64M     64M
        128G	 13	 14    128M     64M
        256G	 13	 14    256M    128M
        512G	 13	 15    512M    128M
       1024G	 13	 15   1024M    256M
       2048G	 13	 16   2048M    256M
       4096G	 13	 16   4096M    512M
       8192G	 13	 17   8192M    512M
      16384G	 13	 17  16384M   1024M
      32768G	 13	 18  32768M   1024M
      65536G	 13	 18  65536M   2048M
      
      The effect of this change on runtime is undetectable as filesystem
      growth is not proportional to machine memory size as is currently
      assumed.  The change effects only large memory machine.  Additional
      tuning might be needed, but that can be done by the clients of the
      kmem_cache_create interface, not the generic cache allocator itself.
      
      The adaptive hashing is disabled on 32 bit systems to avoid confusion of
      whether base should be different for smaller systems, and to avoid
      overflows.
      
      [mhocko@suse.com: drop HASH_ADAPT]
        Link: http://lkml.kernel.org/r/20170509094607.GG6481@dhcp22.suse.cz
      [pasha.tatashin@oracle.com: UL -> ULL fix]
        Link: http://lkml.kernel.org/r/1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com
      [pasha.tatashin@oracle.com: disable adaptive hash on 32 bit systems]
        Link: http://lkml.kernel.org/r/1495469329-755807-2-git-send-email-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/1488432825-92126-5-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Babu Moger <babu.moger@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9017217b
    • P
      mm: zero hash tables in allocator · 3749a8f0
      Pavel Tatashin 提交于
      Add a new flag HASH_ZERO which when provided grantees that the hash
      table that is returned by alloc_large_system_hash() is zeroed.  In most
      cases that is what is needed by the caller.  Use page level allocator's
      __GFP_ZERO flags to zero the memory.  It is using memset() which is
      efficient method to zero memory and is optimized for most platforms.
      
      Link: http://lkml.kernel.org/r/1488432825-92126-3-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3749a8f0
    • M
      mm: consider zone which is not fully populated to have holes · 2d070eab
      Michal Hocko 提交于
      __pageblock_pfn_to_page has two users currently, set_zone_contiguous
      which checks whether the given zone contains holes and
      pageblock_pfn_to_page which then carefully returns a first valid page
      from the given pfn range for the given zone.  This doesn't handle zones
      which are not fully populated though.  Memory pageblocks can be offlined
      or might not have been onlined yet.  In such a case the zone should be
      considered to have holes otherwise pfn walkers can touch and play with
      offline pages.
      
      Current callers of pageblock_pfn_to_page in compaction seem to work
      properly right now because they only isolate PageBuddy
      (isolate_freepages_block) or PageLRU resp.  __PageMovable
      (isolate_migratepages_block) which will be always false for these pages.
      It would be safer to skip these pages altogether, though.
      
      In order to do this patch adds a new memory section state
      (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
      in online_pages_range during the memory hotplug.  Similarly
      offline_mem_sections clears the bit and it is called when the memory
      range is offlined.
      
      pfn_to_online_page helper is then added which check the mem section and
      only returns a page if it is onlined already.
      
      Use the new helper in __pageblock_pfn_to_page and skip the whole page
      block in such a case.
      
      [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
       mark sections online after all struct pages are initialized in
       online_pages_range (Vlastimil)]
        Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d070eab
    • M
      mm: remove return value from init_currently_empty_zone · dc0bbf3b
      Michal Hocko 提交于
      Patch series "mm: make movable onlining suck less", v4.
      
      Movable onlining is a real hack with many downsides - mainly
      reintroduction of lowmem/highmem issues we used to have on 32b systems -
      but it is the only way to make the memory hotremove more reliable which
      is something that people are asking for.
      
      The current semantic of memory movable onlinening is really cumbersome,
      however.  The main reason for this is that the udev driven approach is
      basically unusable because udev races with the memory probing while only
      the last memory block or the one adjacent to the existing zone_movable
      are allowed to be onlined movable.  In short the criterion for the
      successful online_movable changes under udev's feet.  A reliable udev
      approach would require a 2 phase approach where the first successful
      movable online would have to check all the previous blocks and online
      them in descending order.  This is hard to be considered sane.
      
      This patchset aims at making the onlining semantic more usable.  First
      of all it allows to online memory movable as long as it doesn't clash
      with the existing ZONE_NORMAL.  That means that ZONE_NORMAL and
      ZONE_MOVABLE cannot overlap.  Currently I preserve the original ordering
      semantic so the zone always precedes the movable zone but I have plans
      to remove this restriction in future because it is not really necessary.
      
      First 3 patches are cleanups which should be ready to be merged right
      away (unless I have missed something subtle of course).
      
      Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
      
      Patch 5 deals with implicit assumptions of register_one_node on pgdat
      initialization.
      
      Patches 6-10 deal with offline holes in the zone for pfn walkers.  I
      hope I got all of them right but people familiar with compaction should
      double check this.
      
      Patch 11 is the core of the change.  In order to make it easier to
      review I have tried it to be as minimalistic as possible and the large
      code removal is moved to patch 14.
      
      Patch 12 is a trivial follow up cleanup.  Patch 13 fixes sparse warnings
      and finally patch 14 removes the unused code.
      
      I have tested the patches in kvm:
        # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
      
      and then probed the additional memory by
        (qemu) object_add memory-backend-ram,id=mem1,size=1G
        (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      
      Then I have used this simple script to probe the memory block by hand
        # cat probe_memblock.sh
        #!/bin/sh
      
        BLOCK_NR=$1
      
        # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
      
        # for i in $(seq 10); do sh probe_memblock.sh $i; done
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
        /sys/devices/system/memory/memory35/valid_zones:Normal Movable
        /sys/devices/system/memory/memory36/valid_zones:Normal Movable
        /sys/devices/system/memory/memory37/valid_zones:Normal Movable
        /sys/devices/system/memory/memory38/valid_zones:Normal Movable
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      The main difference to the original implementation is that all new
      memblocks can be both online_kernel and online_movable initially because
      there is no clash obviously.  For the comparison the original
      implementation would have
      
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal
        /sys/devices/system/memory/memory35/valid_zones:Normal
        /sys/devices/system/memory/memory36/valid_zones:Normal
        /sys/devices/system/memory/memory37/valid_zones:Normal
        /sys/devices/system/memory/memory38/valid_zones:Normal
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      Now
        # echo online_movable > /sys/devices/system/memory/memory34/state
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
        /sys/devices/system/memory/memory36/valid_zones:Movable
        /sys/devices/system/memory/memory37/valid_zones:Movable
        /sys/devices/system/memory/memory38/valid_zones:Movable
        /sys/devices/system/memory/memory39/valid_zones:Movable
      
      Block 33 can still be online both kernel and movable while all
      the remaining can be only movable.
      
      /proc/zonelist says
        Node 0, zone   Normal
          pages free     0
                min      0
                low      0
                high     0
                spanned  0
                present  0
        --
        Node 0, zone  Movable
          pages free     32753
                min      85
                low      117
                high     149
                spanned  32768
                present  32768
      
      A new memblock at a lower address will result in a new memblock (32)
      which will still allow both Normal and Movable.
      
        # sh probe_memblock.sh 0
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      and online_kernel will convert it to the zone normal properly
      while 33 can be still onlined both ways.
      
        # echo online_kernel > /sys/devices/system/memory/memory32/state
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     65441
                min      165
                low      230
                high     295
                spanned  65536
                present  65536
        --
        Node 0, zone  Movable
          pages free     32740
                min      82
                low      114
                high     146
                spanned  32768
                present  32768
      
      so both zones have one memblock spanned and present.
      
      Onlining 39 should associate this block to the movable zone
      
        # echo online > /sys/devices/system/memory/memory39/state
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     32765
                min      80
                low      112
                high     144
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     65501
                min      160
                low      225
                high     290
                spanned  196608
                present  65536
      
      so we will have a movable zone which spans 6 memblocks, 2 present and 4
      representing a hole.
      
      Offlining both movable blocks will lead to the zone with no present
      pages which is the expected behavior I believe.
      
        # echo offline > /sys/devices/system/memory/memory39/state
        # echo offline > /sys/devices/system/memory/memory34/state
        # grep -A6 "Movable\|Normal" /proc/zoneinfo
        Node 0, zone   Normal
          pages free     32735
                min      90
                low      122
                high     154
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     0
                min      0
                low      0
                high     0
                spanned  196608
                present  0
      
      As a bonus we will get a nice cleanup in the memory hotplug codebase.
      
      This patch (of 16):
      
      init_currently_empty_zone doesn't have any error to return yet it is
      still an int and callers try to be defensive and try to handle potential
      error.  Remove this nonsense and simplify all callers.
      
      This patch shouldn't have any visible effect
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0bbf3b
  6. 03 6月, 2017 2 次提交
    • M
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko 提交于
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      864b9a39
    • T
      mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once · c288983d
      Tetsuo Handa 提交于
      Roman Gushchin has reported that the OOM killer can trivially selects
      next OOM victim when a thread doing memory allocation from page fault
      path was selected as first OOM victim.
      
          allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           __alloc_pages_slowpath+0xc84/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
          Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
          allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           __alloc_pages_slowpath+0xd32/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
          ...
          allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           pagefault_out_of_memory+0x68/0x80
           mm_fault_error+0x8f/0x190
           ? handle_mm_fault+0xf3/0x210
           __do_page_fault+0x4b2/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
          Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
      
      There is a race window that the OOM reaper completes reclaiming the
      first victim's memory while nothing but mutex_trylock() prevents the
      first victim from calling out_of_memory() from pagefault_out_of_memory()
      after memory allocation for page fault path failed due to being selected
      as an OOM victim.
      
      This is a side effect of commit 9a67f648 ("mm: consolidate
      GFP_NOFAIL checks in the allocator slowpath") because that commit
      silently changed the behavior from
      
          /* Avoid allocations with no watermarks from looping endlessly */
      
      to
      
          /*
           * Give up allocations without trying memory reserves if selected
           * as an OOM victim
           */
      
      in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
      flag.  I have noticed this change but I didn't post a patch because I
      thought it is an acceptable change other than noise by warn_alloc()
      because !__GFP_NOFAIL allocations are allowed to fail.  But we
      overlooked that failing memory allocation from page fault path makes
      difference due to the race window explained above.
      
      While it might be possible to add a check to pagefault_out_of_memory()
      that prevents the first victim from calling out_of_memory() or remove
      out_of_memory() from pagefault_out_of_memory(), changing
      pagefault_out_of_memory() does not suppress noise by warn_alloc() when
      allocating thread was selected as an OOM victim.  There is little point
      with printing similar backtraces and memory information from both
      out_of_memory() and warn_alloc().
      
      Instead, if we guarantee that current thread can try allocations with no
      watermarks once when current thread looping inside
      __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
      can use memory reserves" rules and suppress noise by warn_alloc() and
      prevent memory allocations from page fault path from calling
      pagefault_out_of_memory().
      
      If we take the comment literally, this patch would do
      
        -    if (test_thread_flag(TIF_MEMDIE))
        -        goto nopage;
        +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
        +        goto nopage;
      
      because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
      given.  But if I recall correctly (I couldn't find the message), the
      condition is meant to apply to only OOM victims despite the comment.
      Therefore, this patch preserves TIF_MEMDIE check.
      
      Fixes: 9a67f648 ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
      Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NRoman Gushchin <guro@fb.com>
      Tested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.11]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c288983d
  7. 09 5月, 2017 5 次提交
    • V
      mm: introduce memalloc_noreclaim_{save,restore} · 499118e9
      Vlastimil Babka 提交于
      The previous patch ("mm: prevent potential recursive reclaim due to
      clearing PF_MEMALLOC") has shown that simply setting and clearing
      PF_MEMALLOC in current->flags can result in wrongly clearing a
      pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
      Let's introduce helpers that support proper nesting by saving the
      previous stat of the flag, similar to the existing memalloc_noio_* and
      memalloc_nofs_* helpers.  Convert existing setting/clearing of
      PF_MEMALLOC within mm to the new helpers.
      
      There are no known issues with the converted code, but the change makes
      it more robust.
      
      Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      499118e9
    • V
      mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC · 62be1511
      Vlastimil Babka 提交于
      Patch series "more robust PF_MEMALLOC handling"
      
      This series aims to unify the setting and clearing of PF_MEMALLOC, which
      prevents recursive reclaim.  There are some places that clear the flag
      unconditionally from current->flags, which may result in clearing a
      pre-existing flag.  This already resulted in a bug report that Patch 1
      fixes (without the new helpers, to make backporting easier).  Patch 2
      introduces the new helpers, modelled after existing memalloc_noio_* and
      memalloc_nofs_* helpers, and converts mm core to use them.  Patches 3
      and 4 convert non-mm code.
      
      This patch (of 4):
      
      __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
      during page migration by lock_page() (see the comment in
      __unmap_and_move()).  Then it unconditionally clears the flag, which can
      clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
      This was not a problem until commit a8161d1e ("mm, page_alloc:
      restructure direct compaction handling in slowpath"), because direct
      compation was called only after direct reclaim, which was skipped when
      PF_MEMALLOC flag was set.
      
      Even now it's only a theoretical issue, as the new callsite of
      __alloc_pages_direct_compact() is reached only for costly orders and
      when gfp_pfmemalloc_allowed() is true, which means either
      __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true.  There is no
      such known context, but let's play it safe and make
      __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
      already set.
      
      Fixes: a8161d1e ("mm, page_alloc: restructure direct compaction handling in slowpath")
      Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62be1511
    • V
      mm, compaction: restrict async compaction to pageblocks of same migratetype · 282722b0
      Vlastimil Babka 提交于
      The migrate scanner in async compaction is currently limited to
      MIGRATE_MOVABLE pageblocks.  This is a heuristic intended to reduce
      latency, based on the assumption that non-MOVABLE pageblocks are
      unlikely to contain movable pages.
      
      However, with the exception of THP's, most high-order allocations are
      not movable.  Should the async compaction succeed, this increases the
      chance that the non-MOVABLE allocations will fallback to a MOVABLE
      pageblock, making the long-term fragmentation worse.
      
      This patch attempts to help the situation by changing async direct
      compaction so that the migrate scanner only scans the pageblocks of the
      requested migratetype.  If it's a non-MOVABLE type and there are such
      pageblocks that do contain movable pages, chances are that the
      allocation can succeed within one of such pageblocks, removing the need
      for a fallback.  If that fails, the subsequent sync attempt will ignore
      this restriction.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 30%.  The number of movable allocations falling back is reduced by
      12%.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      282722b0
    • V
      mm, page_alloc: count movable pages when stealing from pageblock · 02aa0cdd
      Vlastimil Babka 提交于
      When stealing pages from pageblock of a different migratetype, we count
      how many free pages were stolen, and change the pageblock's migratetype
      if more than half of the pageblock was free.  This might be too
      conservative, as there might be other pages that are not free, but were
      allocated with the same migratetype as our allocation requested.
      
      While we cannot determine the migratetype of allocated pages precisely
      (at least without the page_owner functionality enabled), we can count
      pages that compaction would try to isolate for migration - those are
      either on LRU or __PageMovable().  The rest can be assumed to be
      MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily
      distinguish.  This counting can be done as part of free page stealing
      with little additional overhead.
      
      The page stealing code is changed so that it considers free pages plus
      pages of the "good" migratetype for the decision whether to change
      pageblock's migratetype.
      
      The result should be more accurate migratetype of pageblocks wrt the
      actual pages in the pageblocks, when stealing from semi-occupied
      pageblocks.  This should help the efficiency of page grouping by
      mobility.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 47%.  The number of movable allocations falling back to other
      pageblocks are increased by 55%, but these events don't cause permanent
      fragmentation, so the tradeoff should be positive.  Later patches also
      offset the movable fallback increase to some extent.
      
      [akpm@linux-foundation.org: merge fix]
      Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02aa0cdd
    • V
      mm, page_alloc: split smallest stolen page in fallback · 3bc48f96
      Vlastimil Babka 提交于
      The __rmqueue_fallback() function is called when there's no free page of
      requested migratetype, and we need to steal from a different one.
      
      There are various heuristics to make this event infrequent and reduce
      permanent fragmentation.  The main one is to try stealing from a
      pageblock that has the most free pages, and possibly steal them all at
      once and convert the whole pageblock.  Precise searching for such
      pageblock would be expensive, so instead the heuristics walks the free
      lists from MAX_ORDER down to requested order and assumes that the block
      with highest-order free page is likely to also have the most free pages
      in total.
      
      Chances are that together with the highest-order page, we steal also
      pages of lower orders from the same block.  But then we still split the
      highest order page.  This is wasteful and can contribute to
      fragmentation instead of avoiding it.
      
      This patch thus changes __rmqueue_fallback() to just steal the page(s)
      and put them on the freelist of the requested migratetype, and only
      report whether it was successful.  Then we pick (and eventually split)
      the smallest page with __rmqueue_smallest().  This all happens under
      zone lock, so nobody can steal it from us in the process.  This should
      reduce fragmentation due to fallbacks.  At worst we are only stealing a
      single highest-order page and waste some cycles by moving it between
      lists and then removing it, but fallback is not exactly hot path so that
      should not be a concern.  As a side benefit the patch removes some
      duplicate code by reusing __rmqueue_smallest().
      
      [vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
        Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
      Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bc48f96
  8. 04 5月, 2017 8 次提交
    • T
      mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc() · 0f7896f1
      Tetsuo Handa 提交于
      Commit c0a32fc5 ("mm: more intensive memory corruption debugging")
      changed to check debug_guardpage_minorder() > 0 when reporting
      allocation failures.  The reasoning was
      
        When we use guard page to debug memory corruption, it shrinks
        available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
        value. In such case memory allocation failures can be common and
        printing errors can flood dmesg. If somebody debug corruption,
        allocation failures are not the things he/she is interested about.
      
      but this is misguided.
      
      Allocation requests with __GFP_NOWARN flag by definition do not cause
      flooding of allocation failure messages.  Allocation requests with
      __GFP_NORETRY flag likely also have __GFP_NOWARN flag.  Costly
      allocation requests likely also have __GFP_NOWARN flag.
      
      Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
      __GFP_NOWARN flag or __GFP_HIGH flag.  Non-costly allocation requests
      with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
      small to fail" memory-allocation rule.
      
      Therefore, as a whole, shrinking available pages by
      debug_guardpage_minorder= kernel boot parameter might cause flooding of
      OOM killer messages but unlikely causes flooding of allocation failure
      messages.  Let's remove debug_guardpage_minorder() > 0 check which would
      likely be pointless.
      
      Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f7896f1
    • V
      mm: enable page poisoning early at boot · bd33ef36
      Vinayak Menon 提交于
      On SPARSEMEM systems page poisoning is enabled after buddy is up,
      because of the dependency on page extension init.  This causes the pages
      released by free_all_bootmem not to be poisoned.  This either delays or
      misses the identification of some issues because the pages have to
      undergo another cycle of alloc-free-alloc for any corruption to be
      detected.
      
      Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
      flag.  Since all the free pages will now be poisoned, the flag need not
      be verified before checking the poison during an alloc.
      
      [vinmenon@codeaurora.org: fix Kconfig]
        Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org
      Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.orgSigned-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd33ef36
    • J
      mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings · 82251963
      Johannes Weiner 提交于
      __GFP_NOWARN, which is usually added to avoid warnings from callsites
      that expect to fail and have fallbacks, currently also suppresses
      allocation stall warnings.  These trigger when an allocation is stuck
      inside the allocator for 10 seconds or longer.
      
      But there is no class of allocations that can get legitimately stuck in
      the allocator for this long.  This always indicates a problem.
      
      Always emit stall warnings.  Restrict __GFP_NOWARN to alloc failures.
      
      Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82251963
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • X
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu 提交于
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • J
      mm: remove unnecessary back-off function when retrying page reclaim · 491d79ae
      Johannes Weiner 提交于
      The backoff mechanism is not needed.  If we have MAX_RECLAIM_RETRIES
      loops without progress, we'll OOM anyway; backing off might cut one or
      two iterations off that in the rare OOM case.  If we have intermittent
      success reclaiming a few pages, the backoff function gets reset also,
      and so is of little help in these scenarios.
      
      We might want a backoff function for when there IS progress, but not
      enough to be satisfactory.  But this isn't that.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      491d79ae
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  9. 21 4月, 2017 1 次提交
  10. 08 4月, 2017 2 次提交
  11. 04 4月, 2017 1 次提交
    • S
      ftrace: Have init/main.c call ftrace directly to free init memory · b80f0f6c
      Steven Rostedt (VMware) 提交于
      Relying on free_reserved_area() to call ftrace to free init memory proved to
      not be sufficient. The issue is that on x86, when debug_pagealloc is
      enabled, the init memory is not freed, but simply set as not present. Since
      ftrace was uninformed of this, starting function tracing still tries to
      update pages that are not present according to the page tables, causing
      ftrace to bug, as well as killing the kernel itself.
      
      Instead of relying on free_reserved_area(), have init/main.c call ftrace
      directly just before it frees the init memory. Then it needs to use
      __init_begin and __init_end to know where the init memory location is.
      Looking at all archs (and testing what I can), it appears that this should
      work for each of them.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b80f0f6c
  12. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  13. 25 3月, 2017 1 次提交
  14. 09 3月, 2017 1 次提交
    • T
      mm, page_alloc: Add missing check for memory holes · b4fb8f66
      Tony Luck 提交于
      Commit 13ad59df ("mm, page_alloc: avoid page_to_pfn() when merging
      buddies") moved the check for memory holes out of page_is_buddy() and
      had the callers do the check.
      
      But this wasn't done correctly in one place which caused ia64 to crash
      very early in boot.
      
      Update to fix that and make ia64 boot again.
      
      [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
            since we already have the result of that in "buddy_pfn" ]
      
      Fixes: 13ad59df ("avoid page_to_pfn() when merging buddies")
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fb8f66
  15. 02 3月, 2017 1 次提交
  16. 28 2月, 2017 1 次提交