1. 25 2月, 2017 10 次提交
    • L
      mm: alloc_contig_range: allow to specify GFP mask · ca96b625
      Lucas Stach 提交于
      Currently alloc_contig_range assumes that the compaction should be done
      with the default GFP_KERNEL flags.  This is probably right for all
      current uses of this interface, but may change as CMA is used in more
      use-cases (including being the default DMA memory allocator on some
      platforms).
      
      Change the function prototype, to allow for passing through the GFP mask
      set by upper layers.
      
      Also respect global restrictions by applying memalloc_noio_flags to the
      passed in flags.
      
      Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.deSigned-off-by: NLucas Stach <l.stach@pengutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alexander Graf <agraf@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca96b625
    • Y
      mm/hotplug: enable memory hotplug for non-lru movable pages · 0efadf48
      Yisheng Xie 提交于
      We had considered all of the non-lru pages as unmovable before commit
      bda807d4 ("mm: migrate: support non-lru movable page migration").
      But now some of non-lru pages like zsmalloc, virtio-balloon pages also
      become movable.  So we can offline such blocks by using non-lru page
      migration.
      
      This patch straightforwardly adds non-lru migration code, which means
      adding non-lru related code to the functions which scan over pfn and
      collect pages to be migrated and isolate them before migration.
      Signed-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0efadf48
    • M
      mm, page_alloc: use static global work_struct for draining per-cpu pages · bd233f53
      Mel Gorman 提交于
      As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
      work_struct to co-ordinate the draining of per-cpu pages on the
      workqueue.  Only one task can drain at a time but this is better than
      the previous scheme that allowed multiple tasks to send IPIs at a time.
      
      One consideration is whether parallel requests should synchronise
      against each other.  This patch does not synchronise for a global drain
      as the common case for such callers is expected to be multiple parallel
      direct reclaimers competing for pages when the watermark is close to
      min.  Draining the per-cpu list is unlikely to make much progress and
      serialising the drain is of dubious merit.  Drains are synchonrised for
      callers such as memory hotplug and CMA that care about the drain being
      complete when the function returns.
      
      Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd233f53
    • V
      mm, page_alloc: don't check cpuset allowed twice in fast-path · 51047820
      Vlastimil Babka 提交于
      Since commit 682a3385 ("mm, page_alloc: inline the fast path of the
      zonelist iterator") we replace a NULL nodemask with
      cpuset_current_mems_allowed in the fast path, so that
      get_page_from_freelist() filters nodes allowed by the cpuset via
      for_next_zone_zonelist_nodemask().
      
      In that case it's pointless to additionaly check __cpuset_zone_allowed()
      in each iteration, which we can avoid by not adding ALLOC_CPUSET to
      alloc_flags in that scenario.
      
      This saves some cycles in the allocator fast path on systems with one or
      more non-root cpuset configured.  In the slow path, ALLOC_CPUSET is
      reset according to __alloc_pages_slowpath().  Without configured
      cpusets, this code is disabled by a static key.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51047820
    • V
      mm, page_alloc: remove redundant checks from alloc fastpath · df76cee6
      Vlastimil Babka 提交于
      The allocation fast path contains two similar checks for zoneref->zone
      being NULL, where zoneref points either to the first zone in the
      zonelist, or to the preferred zone.  These can be NULL either due to
      empty zonelist, or no zone being compatible with given nodemask or
      task's cpuset.
      
      These checks are unnecessary, because the zonelist walks in
      first_zones_zonelist() and get_page_from_freelist() handle a NULL
      starting zoneref->zone or preferred_zoneref->zone safely.  It's safe to
      fallback to __alloc_pages_slowpath() where we also have the check early
      enough.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df76cee6
    • M
      mm, page_alloc: only use per-cpu allocator for irq-safe requests · 374ad05a
      Mel Gorman 提交于
      Many workloads that allocate pages are not handling an interrupt at a
      time.  As allocation requests may be from IRQ context, it's necessary to
      disable/enable IRQs for every page allocation.  This cost is the bulk of
      the free path but also a significant percentage of the allocation path.
      
      This patch alters the locking and checks such that only irq-safe
      allocation requests use the per-cpu allocator.  All others acquire the
      irq-safe zone->lock and allocate from the buddy allocator.  It relies on
      disabling preemption to safely access the per-cpu structures.  It could
      be slightly modified to avoid soft IRQs using it but it's not clear it's
      worthwhile.
      
      This modification may slow allocations from IRQ context slightly but the
      main gain from the per-cpu allocator is that it scales better for
      allocations from multiple contexts.  There is an implicit assumption
      that intensive allocations from IRQ contexts on multiple CPUs from a
      single NUMA node are rare and that the fast majority of scaling issues
      are encountered in !IRQ contexts such as page faulting.  It's worth
      noting that this patch is not required for a bulk page allocator but it
      significantly reduces the overhead.
      
      The following is results from a page allocator micro-benchmark.  Only
      order-0 is interesting as higher orders do not use the per-cpu allocator
      
                                                4.10.0-rc2                 4.10.0-rc2
                                                   vanilla               irqsafe-v1r5
      Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
      Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
      Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
      Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
      Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
      Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
      Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
      Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
      Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
      Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
      Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
      Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
      Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
      Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
      Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
      Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
      Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
      Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
      Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
      Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
      Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
      Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
      Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
      Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
      Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
      Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
      Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
      Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
      Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
      Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
      Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
      Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
      Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
      Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
      Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
      Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
      Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
      Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
      Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
      Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
      Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
      Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
      Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
      Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
      Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
      
      This is the alloc, free and total overhead of allocating order-0 pages
      in batches of 1 page up to 16384 pages.  Avoiding disabling/enabling
      overhead massively reduces overhead.  Alloc overhead is roughly reduced
      by 14-20% in most cases.  The free path is reduced by 26-46% and the
      total reduction is significant.
      
      Many users require zeroing of pages from the page allocator which is the
      vast cost of allocation.  Hence, the impact on a basic page faulting
      benchmark is not that significant
      
                                    4.10.0-rc2            4.10.0-rc2
                                       vanilla          irqsafe-v1r5
      Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
      Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
      Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
      Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
      CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
      CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
      Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
      Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
      
      This is from aim9 and the most notable outcome is that fault variability
      is reduced by the patch.  The headline improvement is small as the
      overall fault cost, zeroing, page table insertion etc dominate relative
      to disabling/enabling IRQs in the per-cpu allocator.
      
      Similarly, little benefit was seen on networking benchmarks both
      localhost and between physical server/clients where other costs
      dominate.  It's possible that this will only be noticable on very high
      speed networks.
      
      Jesper Dangaard Brouer independently tested this with a separate
      microbenchmark from
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      Micro-benchmarked with [1] page_bench02:
       modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
        rmmod page_bench02 ; dmesg --notime | tail -n 4
      
      Compared to baseline: 213 cycles(tsc) 53.417 ns
       - against this     : 184 cycles(tsc) 46.056 ns
       - Saving           : -29 cycles
       - Very close to expected 27 cycles saving [see below [2]]
      
      Micro benchmarking via time_bench_sample[3], we get the cost of these
      operations:
      
       time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
       time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
       time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
       time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
       time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
       time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
       time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
       time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
       time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
       [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
       time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
       [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
       time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
       time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
       time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
      
      Thus, expected improvement is: 38-11 = 27 cycles.
      
      [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
        Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      374ad05a
    • M
      mm, page_alloc: do not depend on cpu hotplug locks inside the allocator · a459eeb7
      Michal Hocko 提交于
      Dmitry has reported the following lockdep splat
        lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
        __mutex_lock_common kernel/locking/mutex.c:521 [inline]
        mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
        pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
        __alloc_percpu+0x24/0x30 mm/percpu.c:1075
        smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
        cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
        cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
        _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      pcpu_alloc
        pcpu_alloc_mutex
      
        get_online_cpus+0x62/0x90 kernel/cpu.c:248
        drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
        __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
        __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
        __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
        __alloc_pages include/linux/gfp.h:426 [inline]
        __alloc_pages_node include/linux/gfp.h:439 [inline]
        alloc_pages_node include/linux/gfp.h:453 [inline]
        pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
        pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
        pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
        __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
        bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
        array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
        find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
        map_create kernel/bpf/syscall.c:188 [inline]
        SYSC_bpf kernel/bpf/syscall.c:870 [inline]
        SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
        entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      pcpu_alloc
        pcpu_alloc_mutex
      drain_all_pages
        get_online_cpus
          cpu_hotplug.lock
      
        cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
        _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      
      Pulling cpu hotplug locks inside the page allocator is just too
      dangerous.  Let's remove the dependency by dropping get_online_cpus()
      from drain_all_pages.  This is not so simple though because now we do
      not have a protection against cpu hotplug which means 2 things:
      
        - the work item might be executed on a different cpu in worker from
          unbound pool so it doesn't run on pinned on the cpu
      
        - we have to make sure that we do not race with page_alloc_cpu_dead
          calling drain_pages_zone
      
      Disabling preemption in drain_local_pages_wq will solve the first
      problem drain_local_pages will determine its local CPU from the WQ
      context which will be stable after that point, page_alloc_cpu_dead is
      pinned to the CPU already.  The later condition is achieved by disabling
      IRQs in drain_pages_zone.
      
      Fixes: mm, page_alloc: drain per-cpu pages from workqueue context
      Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a459eeb7
    • M
      mm, page_alloc: drain per-cpu pages from workqueue context · 0ccce3b9
      Mel Gorman 提交于
      The per-cpu page allocator can be drained immediately via
      drain_all_pages() which sends IPIs to every CPU.  In the next patch, the
      per-cpu allocator will only be used for interrupt-safe allocations which
      prevents draining it from IPI context.  This patch uses workqueues to
      drain the per-cpu lists instead.
      
      This is slower but no slowdown during intensive reclaim was measured and
      the paths that use drain_all_pages() are not that sensitive to
      performance.  This is particularly true as the path would only be
      triggered when reclaim is failing.  It also makes a some sense to avoid
      storming a machine with IPIs when it's under memory pressure.  Arguably,
      it should be further adjusted so that only one caller at a time is
      draining pages but it's beyond the scope of the current patch.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ccce3b9
    • M
      mm, page_alloc: split alloc_pages_nodemask() · 9cd75558
      Mel Gorman 提交于
      alloc_pages_nodemask does a number of preperation steps that determine
      what zones can be used for the allocation depending on a variety of
      factors.  This is fine but a hypothetical caller that wanted multiple
      order-0 pages has to do the preparation steps multiple times.  This
      patch structures __alloc_pages_nodemask such that it's relatively easy
      to build a bulk order-0 page allocator.  There is no functional change.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cd75558
    • M
      mm, page_alloc: split buffered_rmqueue() · 066b2393
      Mel Gorman 提交于
      Patch series "Use per-cpu allocator for !irq requests and prepare for a
      bulk allocator", v5.
      
      This series is motivated by a conversation led by Jesper Dangaard Brouer
      at the last LSF/MM proposing a generic page pool for DMA-coherent pages.
      Part of his motivation was due to the overhead of allocating multiple
      order-0 that led some drivers to use high-order allocations and
      splitting them.  This is very slow in some cases.
      
      The first two patches in this series restructure the page allocator such
      that it is relatively easy to introduce an order-0 bulk page allocator.
      A patch exists to do that and has been handed over to Jesper until an
      in-kernel users is created.  The third patch prevents the per-cpu
      allocator being drained from IPI context as that can potentially corrupt
      the list after patch four is merged.  The final patch alters the per-cpu
      alloctor to make it exclusive to !irq requests.  This cuts
      allocation/free overhead by roughly 30%.
      
      Performance tests from both Jesper and me are included in the patch.
      
      This patch (of 4):
      
      buffered_rmqueue removes a page from a given zone and uses the per-cpu
      list for order-0.  This is fine but a hypothetical caller that wanted
      multiple order-0 pages has to disable/reenable interrupts multiple
      times.  This patch structures buffere_rmqueue such that it's relatively
      easy to build a bulk order-0 page allocator.  There is no functional
      change.
      
      [mgorman@techsingularity.net: failed per-cpu refill may blow up]
        Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      066b2393
  2. 23 2月, 2017 13 次提交
    • D
      mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled · 685dbf6f
      David Rientjes 提交于
      The patch "mm, page_alloc: warn_alloc print nodemask" implicitly sets
      the allocation nodemask to cpuset_current_mems_allowed when there is no
      effective mempolicy.  cpuset_current_mems_allowed is only effective when
      cpusets are enabled, which is also printed by warn_alloc(), so setting
      the nodemask to cpuset_current_mems_allowed is redundant and prevents
      debugging issues where ac->nodemask is not set properly in the page
      allocator.
      
      This provides better debugging output since
      cpuset_print_current_mems_allowed() is already provided.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701181347320.142399@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      685dbf6f
    • M
      mm: help __GFP_NOFAIL allocations which do not trigger OOM killer · 6c18ba7a
      Michal Hocko 提交于
      Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer
      we are left with requests which require to loop inside the allocator
      without invoking the oom killer (e.g.  GFP_NOFS|__GFP_NOFAIL used by fs
      code) and so they might, in very unlikely situations, loop for ever -
      e.g.  other parallel request could starve them.
      
      This patch tries to limit the likelihood of such a lockup by giving
      these __GFP_NOFAIL requests a chance to move on by consuming a small
      part of memory reserves.  We are using ALLOC_HARDER which should be
      enough to prevent from the starvation by regular allocation requests,
      yet it shouldn't consume enough from the reserves to disrupt high
      priority requests (ALLOC_HIGH).
      
      While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback
      which enforces the cpusets but allows to fallback to ignore them if the
      first attempt fails.  __GFP_NOFAIL requests can be considered important
      enough to allow cpuset runaway in order for the system to move on.  It
      is highly unlikely that any of these will be GFP_USER anyway.
      
      Link: http://lkml.kernel.org/r/20161220134904.21023-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c18ba7a
    • M
      mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically · 06ad276a
      Michal Hocko 提交于
      __alloc_pages_may_oom makes sure to skip the OOM killer depending on the
      allocation request.  This includes lowmem requests, costly high order
      requests and others.  For a long time __GFP_NOFAIL acted as an override
      for all those rules.  This is not documented and it can be quite
      surprising as well.  E.g.  GFP_NOFS requests are not invoking the OOM
      killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
      the existing open coded loops around allocator to nofail request (and we
      have done that in the past) then such a change would have a non trivial
      side effect which is far from obvious.  Note that the primary motivation
      for skipping the OOM killer is to prevent from pre-mature invocation.
      
      The exception has been added by commit 82553a93 ("oom: invoke oom
      killer for __GFP_NOFAIL").  The changelog points out that the oom killer
      has to be invoked otherwise the request would be looping for ever.  But
      this argument is rather weak because the OOM killer doesn't really
      guarantee a forward progress for those exceptional cases:
      
      - it will hardly help to form costly order which in turn can result in
        the system panic because of no oom killable task in the end - I believe
        we certainly do not want to put the system down just because there is a
        nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
        the consequences.  It is much better this request would loop for ever
        than the massive system disruption
      
      - lowmem is also highly unlikely to be freed during OOM killer
      
      - GFP_NOFS request could trigger while there is still a lot of memory
        pinned by filesystems.
      
      This patch simply removes the __GFP_NOFAIL special case in order to have a
      more clear semantic without surprising side effects.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06ad276a
    • M
      mm: consolidate GFP_NOFAIL checks in the allocator slowpath · 9a67f648
      Michal Hocko 提交于
      Tetsuo Handa has pointed out that commit 0a0337e0 ("mm, oom: rework
      oom detection") has subtly changed semantic for costly high order
      requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail
      right now.  My code inspection didn't reveal any such users in the tree
      but it is true that this might lead to unexpected allocation failures
      and subsequent OOPs.
      
      __alloc_pages_slowpath wrt.  GFP_NOFAIL is hard to follow currently.
      There are few special cases but we are lacking a catch all place to be
      sure we will not miss any case where the non failing allocation might
      fail.  This patch reorganizes the code a bit and puts all those special
      cases under nopage label which is the generic go-to-fail path.  Non
      failing allocations are retried or those that cannot retry like
      non-sleeping allocation go to the failure point directly.  This should
      make the code flow much easier to follow and make it less error prone
      for future changes.
      
      While we are there we have to move the stall check up to catch
      potentially looping non-failing allocations.
      
      [akpm@linux-foundation.org: fix alloc_flags may-be-used-uninitalized]
      Link: http://lkml.kernel.org/r/20161220134904.21023-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a67f648
    • M
      lib/show_mem.c: teach show_mem to work with the given nodemask · 9af744d7
      Michal Hocko 提交于
      show_mem() allows to filter out node specific data which is irrelevant
      to the allocation request via SHOW_MEM_FILTER_NODES.  The filtering is
      done in skip_free_areas_node which skips all nodes which are not in the
      mems_allowed of the current process.  This works most of the time as
      expected because the nodemask shouldn't be outside of the allocating
      task but there are some exceptions.  E.g.  memory hotplug might want to
      request allocations from outside of the allowed nodes (see
      new_node_page).
      
      Get rid of this hardcoded behavior and push the allocation mask down the
      show_mem path and use it instead of cpuset_current_mems_allowed.  NULL
      nodemask is interpreted as cpuset_current_mems_allowed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9af744d7
    • M
      mm, page_alloc: warn_alloc print nodemask · a8e99259
      Michal Hocko 提交于
      warn_alloc is currently used for to report an allocation failure or an
      allocation stall.  We print some details of the allocation request like
      the gfp mask and the request order.  We do not print the allocation
      nodemask which is important when debugging the reason for the allocation
      failure as well.  We alreaddy print the nodemask in the OOM report.
      
      Add nodemask to warn_alloc and print it in warn_alloc as well.
      
      Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8e99259
    • M
      mm, page_alloc: do not report all nodes in show_mem · c02e50bb
      Michal Hocko 提交于
      Patch series "show_mem updates", v2.
      
      This is a mixture of one bug fix (patch 1), an enhancement (patch 2) and
      cleanups (the rest of the series).  First two patches should be really
      straightforward.  Patch 3 removes some arch specific show_mem
      implementations because I think they are quite outdated and do not
      really serve any useful purpose anymore.  I think we should really
      strive to have a consistent show_mem output regardless of the
      architecture.  If some architecture is really special and wants to dump
      something additional we should do that via an arch specific hook.
      
      The last patch adds nodemask parameter so that we do not rely on the
      hardcoded mems_allowed of the current task when doing the node
      filtering.  I consider this more a cleanup than a fix because basically
      all users use a nodemask which is a subset of mems_allowed.  There is
      only one call path in the memory hotplug which doesn't comply with this
      but that is hardly something to worry about.
      
      This patch (of 4):
      
      Commit 599d0c95 ("mm, vmscan: move LRU lists to node") has added per
      numa node statistics to show_mem but it forgot to add
      skip_free_areas_node to filter out nodes which are outside of the
      allocating task numa policy.  Add this check to not pollute the output
      with the pointless information.
      
      Link: http://lkml.kernel.org/r/20170117091543.25850-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c02e50bb
    • P
      mm: page_alloc: skip over regions of invalid pfns where possible · b92df1de
      Paul Burton 提交于
      When using a sparse memory model memmap_init_zone() when invoked with
      the MEMMAP_EARLY context will skip over pages which aren't valid - ie.
      which aren't in a populated region of the sparse memory map.  However if
      the memory map is extremely sparse then it can spend a long time
      linearly checking each PFN in a large non-populated region of the memory
      map & skipping it in turn.
      
      When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient
      information to quickly discover the next valid PFN given an invalid one
      by searching through the list of memory regions & skipping forwards to
      the first PFN covered by the memory region to the right of the
      non-populated region.  Implement this in order to speed up
      memmap_init_zone() for systems with extremely sparse memory maps.
      
      James said "I have tested this patch on a virtual model of a Samurai CPU
      with a sparse memory map.  The kernel boot time drops from 109 to
      62 seconds. "
      
      Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.comSigned-off-by: NPaul Burton <paul.burton@imgtec.com>
      Tested-by: NJames Hartley <james.hartley@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b92df1de
    • M
      oom, trace: add compaction retry tracepoint · 65190cff
      Michal Hocko 提交于
      Higher order requests oom debugging is currently quite hard.  We do have
      some compaction points which can tell us how the compaction is operating
      but there is no trace point to tell us about compaction retry logic.
      This patch adds a one which will have the following format
      
                  bash-3126  [001] ....  1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0
      
      we can see that the order 9 request is not retried even though we are in
      the highest compaction priority mode becase the last compaction attempt
      was withdrawn.  This means that compaction_zonelist_suitable must have
      returned false and there is no suitable zone to compact for this request
      and so no need to retry further.
      
      another example would be
                 <...>-3137  [001] ....    81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0
      
      in this case the order-9 compaction failed to find any suitable block.
      We do not retry anymore because this is a costly request and those do
      not go below COMPACT_PRIO_SYNC_LIGHT priority.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65190cff
    • M
      oom, trace: add oom detection tracepoints · d379f01d
      Michal Hocko 提交于
      should_reclaim_retry is the central decision point for declaring the
      OOM.  It might be really useful to expose data used for this decision
      making when debugging an unexpected oom situations.
      
      Say we have an OOM report:
      [   52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
      [   52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G        W       4.8.0-oomtrace3-00006-gb21338b386d2 #1024
      
      Now we can check the tracepoint data to see how we have ended up in this
      situation:
             mem_eater-3148  [003] ....    52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1
             mem_eater-3148  [003] ....    52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1
             mem_eater-3148  [003] ....    52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1
             mem_eater-3148  [003] ....    52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1
             mem_eater-3148  [003] ....    52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0
      
      The above shows that we can quickly deduce that the reclaim stopped
      making any progress (see no_progress_loops increased in each round) and
      while there were still some 51 reclaimable pages they couldn't be
      dropped for some reason (vmscan trace points would tell us more about
      that part).  available will represent reclaimable + free_pages scaled
      down per no_progress_loops factor.  This is essentially an optimistic
      estimate of how much memory we would have when reclaiming everything.
      This can be compared to min_wmark to get a rought idea but the
      wmark_check tells the result of the watermark check which is more
      precise (includes lowmem reserves, considers the order etc.).  As we can
      see no zone is eligible in the end and that is why we have triggered the
      oom in this situation.
      
      Please note that higher order requests might fail on the wmark_check
      even when there is much more memory available than min_wmark - e.g.
      when the memory is fragmented.  A follow up tracepoint will help to
      debug those situations.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d379f01d
    • V
      mm, page_alloc: avoid page_to_pfn() when merging buddies · 13ad59df
      Vlastimil Babka 提交于
      On architectures that allow memory holes, page_is_buddy() has to perform
      page_to_pfn() to check for the memory hole.  After the previous patch,
      we have the pfn already available in __free_one_page(), which is the
      only caller of page_is_buddy(), so move the check there and avoid
      page_to_pfn().
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13ad59df
    • V
      mm, page_alloc: don't convert pfn to idx when merging · 76741e77
      Vlastimil Babka 提交于
      In __free_one_page() we do the buddy merging arithmetics on "page/buddy
      index", which is just the lower MAX_ORDER bits of pfn.  The operations
      we do that affect the higher bits are bitwise AND and subtraction (in
      that order), where the final result will be the same with the higher
      bits left unmasked, as long as these bits are equal for both buddies -
      which must be true by the definition of a buddy.
      
      We can therefore use pfn's directly instead of "index" and skip the
      zeroing of >MAX_ORDER bits.  This can help a bit by itself, although
      compiler might be smart enough already.  It also helps the next patch to
      avoid page_to_pfn() for memory hole checks.
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76741e77
    • M
      mm: throttle show_mem() from warn_alloc() · aa187507
      Michal Hocko 提交于
      Tetsuo has been stressing OOM killer path with many parallel allocation
      requests when he has noticed that it is not all that hard to swamp
      kernel logs with warn_alloc messages caused by allocation stalls.  Even
      though the allocation stall message is triggered only once in 10s there
      might be many different tasks hitting it roughly around the same time.
      
      A big part of the output is show_mem() which can generate a lot of
      output even on a small machines.  There is no reason to show the state
      of memory counter for each allocation stall, especially when multiple of
      them are reported in a short time period.  Chances are that not much has
      changed since the last report.  This patch simply rate limits show_mem
      called from warn_alloc to only dump something once per second.  This
      should be enough to give us a clue why an allocation might be stalling
      while burst of warnings will not swamp log with too much data.
      
      While we are at it, extract all the show_mem related handling (filters)
      into a separate function warn_alloc_show_mem.  This will make the code
      cleaner and as a bonus point we can distinguish which part of warn_alloc
      got throttled due to rate limiting as ___ratelimit dumps the caller.
      
      [akpm@linux-foundation.org: reduce scope of the ratelimit_states]
      Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa187507
  3. 25 1月, 2017 5 次提交
  4. 11 1月, 2017 5 次提交
  5. 15 12月, 2016 1 次提交
    • A
      mm: add support for releasing multiple instances of a page · 44fdffd7
      Alexander Duyck 提交于
      Add a function that allows us to batch free a page that has multiple
      references outstanding.  Specifically this function can be used to drop
      a page being used in the page frag alloc cache.  With this drivers can
      make use of functionality similar to the page frag alloc cache without
      having to do any workarounds for the fact that there is no function that
      frees multiple references.
      
      Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.comSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Keguang Zhang <keguang.zhang@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Tobias Klauser <tklauser@distanz.ch>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44fdffd7
  6. 13 12月, 2016 5 次提交
    • M
      mm, page_alloc: keep pcp count and list contents in sync if struct page is corrupted · a6de734b
      Mel Gorman 提交于
      Vlastimil Babka pointed out that commit 479f854a ("mm, page_alloc:
      defer debugging checks of pages allocated from the PCP") will allow the
      per-cpu list counter to be out of sync with the per-cpu list contents if
      a struct page is corrupted.
      
      The consequence is an infinite loop if the per-cpu lists get fully
      drained by free_pcppages_bulk because all the lists are empty but the
      count is positive.  The infinite loop occurs here
      
                      do {
                              batch_free++;
                              if (++migratetype == MIGRATE_PCPTYPES)
                                      migratetype = 0;
                              list = &pcp->lists[migratetype];
                      } while (list_empty(list));
      
      What the user sees is a bad page warning followed by a soft lockup with
      interrupts disabled in free_pcppages_bulk().
      
      This patch keeps the accounting in sync.
      
      Fixes: 479f854a ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6de734b
    • M
      mm: make unreserve highatomic functions reliable · 29fac03b
      Minchan Kim 提交于
      Currently, unreserve_highatomic_pageblock bails out if it found
      highatomic pageblock regardless of really moving free pages from the one
      so that it could mitigate unreserve logic's goal which saves OOM of a
      process.
      
      This patch makes unreserve functions bail out only if it moves some
      pages out of !highatomic free list to avoid such false positive.
      
      Another potential problem is that by race between page freeing and
      reserve highatomic function, pages could be in highatomic free list even
      though the pageblock is !high atomic migratetype.  In that case,
      unreserve_highatomic_pageblock can be void if count of highatomic
      reserve is less than pageblock_nr_pages.  We could solve it simply via
      draining all of reserved pages before the OOM.  It would have a
      safeguard role to exhuast reserved pages before converging to OOM.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29fac03b
    • M
      mm: try to exhaust highatomic reserve before the OOM · 04c8716f
      Minchan Kim 提交于
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      It's weird to show that zone has enough free memory above min watermark
      but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
      pages.  As last resort, try to unreserve highatomic pages again and if
      it has moved pages to non-highatmoc free list, retry reclaim once more.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c8716f
    • M
      mm: prevent double decrease of nr_reserved_highatomic · 4855e4a7
      Minchan Kim 提交于
      There is race between page freeing and unreserved highatomic.
      
       CPU 0				    CPU 1
      
          free_hot_cold_page
            mt = get_pfnblock_migratetype
            set_pcppage_migratetype(page, mt)
          				    unreserve_highatomic_pageblock
          				    spin_lock_irqsave(&zone->lock)
          				    move_freepages_block
          				    set_pageblock_migratetype(page)
          				    spin_unlock_irqrestore(&zone->lock)
            free_pcppages_bulk
              __free_one_page(mt) <- mt is stale
      
      By above race, a page on CPU 0 could go non-highorderatomic free list
      since the pageblock's type is changed.  By that, unreserve logic of
      highorderatomic can decrease reserved count on a same pageblock severak
      times and then it will make mismatch between nr_reserved_highatomic and
      the number of reserved pageblock.
      
      So, this patch verifies whether the pageblock is highatomic or not and
      decrease the count only if the pageblock is highatomic.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4855e4a7
    • M
      mm: don't steal highatomic pageblock · 88ed365e
      Minchan Kim 提交于
      Patch series "use up highorder free pages before OOM", v3.
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      During the investigation, I found some problems with highatomic so this
      patch aims to solve the problems and the final goal is to unreserve
      every highatomic free pages before the OOM kill.
      
      This patch (of 4):
      
      In page freeing path, migratetype is racy so that a highorderatomic page
      could free into non-highorderatomic free list.  If that page is
      allocated, VM can change the pageblock from higorderatomic to something.
      In that case, highatomic pageblock accounting is broken so it doesn't
      work(e.g., VM cannot reserve highorderatomic pageblocks any more
      although it doesn't reach 1% limit).
      
      So, this patch prohibits the changing from highatomic to other type.
      It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
      array so stealing will only happen due to unexpected races which is
      really rare.  Also, such prohibiting keeps highatomic pageblock more
      longer so it would be better for highorderatomic page allocation.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88ed365e
  7. 12 11月, 2016 1 次提交