1. 21 5月, 2016 40 次提交
    • A
      mm, kasan: don't call kasan_krealloc() from ksize(). · 4ebb31a4
      Alexander Potapenko 提交于
      Instead of calling kasan_krealloc(), which replaces the memory
      allocation stack ID (if stack depot is used), just unpoison the whole
      memory chunk.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ebb31a4
    • A
      mm: kasan: initial memory quarantine implementation · 55834c59
      Alexander Potapenko 提交于
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      When the object is freed, its state changes from KASAN_STATE_ALLOC to
      KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
      instead of being returned to the allocator, therefore every subsequent
      access to that object triggers a KASAN error, and the error handler is
      able to say where the object has been allocated and deallocated.
      
      When it's time for the object to leave quarantine, its state becomes
      KASAN_STATE_FREE and it's returned to the allocator.  From now on the
      allocator may reuse it for another allocation.  Before that happens,
      it's still possible to detect a use-after free on that object (it
      retains the allocation/deallocation stacks).
      
      When the allocator reuses this object, the shadow is unpoisoned and old
      allocation/deallocation stacks are wiped.  Therefore a use of this
      object, even an incorrect one, won't trigger ASan warning.
      
      Without the quarantine, it's not guaranteed that the objects aren't
      reused immediately, that's why the probability of catching a
      use-after-free is lower than with quarantine in place.
      
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      Freed objects are first added to per-cpu quarantine queues.  When a
      cache is destroyed or memory shrinking is requested, the objects are
      moved into the global quarantine queue.  Whenever a kmalloc call allows
      memory reclaiming, the oldest objects are popped out of the global queue
      until the total size of objects in quarantine is less than 3/4 of the
      maximum quarantine size (which is a fraction of installed physical
      memory).
      
      As long as an object remains in the quarantine, KASAN is able to report
      accesses to it, so the chance of reporting a use-after-free is
      increased.  Once the object leaves quarantine, the allocator may reuse
      it, in which case the object is unpoisoned and KASAN can't detect
      incorrect accesses to it.
      
      Right now quarantine support is only enabled in SLAB allocator.
      Unification of KASAN features in SLAB and SLUB will be done later.
      
      This patch is based on the "mm: kasan: quarantine" patch originally
      prepared by Dmitry Chernenkov.  A number of improvements have been
      suggested by Andrey Ryabinin.
      
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55834c59
    • D
      mm, migrate: increment fail count on ENOMEM · dfef2ef4
      David Rientjes 提交于
      If page migration fails due to -ENOMEM, nr_failed should still be
      incremented for proper statistics.
      
      This was encountered recently when all page migration vmstats showed 0,
      and inferred that migrate_pages() was never called, although in reality
      the first page migration failed because compaction_alloc() failed to
      find a migration target.
      
      This patch increments nr_failed so the vmstat is properly accounted on
      ENOMEM.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfef2ef4
    • C
      mm/compaction.c: fix zoneindex in kcompactd() · 6cd9dc3e
      Chen Feng 提交于
      While testing the kcompactd in my platform 3G MEM only DMA ZONE.  I
      found the kcompactd never wakeup.  It seems the zoneindex has already
      minus 1 before.  So the traverse here should be <=.
      
      It fixes a regression where kswapd could previously compact, but
      kcompactd not.  Not a crash fix though.
      
      [akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh]
      Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com
      Fixes: accf6242 ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
      Signed-off-by: NChen Feng <puck.chen@hisilicon.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zhuangluan Su <suzhuangluan@hisilicon.com>
      Cc: Yiping Xu <xuyiping@hisilicon.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cd9dc3e
    • D
      mm, thp: khugepaged should scan when sleep value is written · f0508977
      David Rientjes 提交于
      If a large value is written to scan_sleep_millisecs, for example, that
      period must lapse before khugepaged will wake up for periodic
      collapsing.
      
      If this value is tuned to 1 day, for example, and then re-tuned to its
      default 10s, khugepaged will still wait for a day before scanning again.
      
      This patch causes khugepaged to wakeup immediately when the value is
      changed and then sleep until that value is rewritten or the new value
      lapses.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0508977
    • N
      MM: increase safety margin provided by PF_LESS_THROTTLE · a53eaff8
      NeilBrown 提交于
      When nfsd is exporting a filesystem over NFS which is then NFS-mounted
      on the local machine there is a risk of deadlock.  This happens when
      there are lots of dirty pages in the NFS filesystem and they cause NFSD
      to be throttled, either in throttle_vm_writeout() or in
      balance_dirty_pages().
      
      To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
      and it provides a 25% increase to the limits that affect NFSD.  Any
      process writing to an NFS filesystem will be throttled well before the
      number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
      will not deadlock on pages that it needs to write out.  At least it
      shouldn't.
      
      All processes are allowed a small excess margin to avoid performing too
      many calculations: ratelimit_pages.
      
      ratelimit_pages is set so that if a thread on every CPU uses the entire
      margin, the total will only go 3% over the limit, and this is much less
      than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
      shouldn't be a problem.  But it is.
      
      The "total memory" that these 3% and 25% are calculated against are not
      really total memory but are "global_dirtyable_memory()" which doesn't
      include anonymous memory, just free memory and page-cache memory.
      
      The "ratelimit_pages" number is based on whatever the
      global_dirtyable_memory was on the last CPU hot-plug, which might not be
      what you expect, but is probably close to the total freeable memory.
      
      The throttle threshold uses the global_dirtable_memory at the moment
      when the throttling happens, which could be much less than at the last
      CPU hotplug.  So if lots of anonymous memory has been allocated, thus
      pushing out lots of page-cache pages, then NFSD might end up being
      throttled due to dirty NFS pages because the "25%" bonus it gets is
      calculated against a rather small amount of dirtyable memory, while the
      "3%" margin that other processes are allowed to dirty without penalty is
      calculated against a much larger number.
      
      To remove this possibility of deadlock we need to make sure that the
      margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
      Simply adding ratelimit_pages isn't enough as that should be multiplied
      by the number of cpus.
      
      So add "global_wb_domain.dirty_limit / 32" as that more accurately
      reflects the current total over-shoot margin.  This ensures that the
      number of dirty NFS pages never gets so high that nfsd will be throttled
      waiting for them to be written.
      
      Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.nameSigned-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a53eaff8
    • N
      mm: check_new_page_bad() directly returns in __PG_HWPOISON case · e570f56c
      Naoya Horiguchi 提交于
      Currently we check page->flags twice for "HWPoisoned" case of
      check_new_page_bad(), which can cause a race with unpoisoning.
      
      This race unnecessarily taints kernel with "BUG: Bad page state".
      check_new_page_bad() is the only caller of bad_page() which is
      interested in __PG_HWPOISON, so let's move the hwpoison related code in
      bad_page() to it.
      
      Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e570f56c
    • S
      mm, kasan: fix to call kasan_free_pages() after poisoning page · 29b52de1
      seokhoon.yoon 提交于
      When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
      free_pages_prepare()'s codeflow is below.
      
        1)kmemcheck_free_shadow()
        2)kasan_free_pages()
          - set shadow byte of page is freed
        3)kernel_poison_pages()
        3.1) check access to page is valid or not using kasan
          ---> error occur, kasan think it is invalid access
        3.2) poison page
        4)kernel_map_pages()
      
      So kasan_free_pages() should be called after poisoning the page.
      
      Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.comSigned-off-by: Nseokhoon.yoon <iamyooon@gmail.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29b52de1
    • M
      mm: disable fault around on emulated access bit architecture · d0834a6c
      Minchan Kim 提交于
      fault_around aims to reduce minor faults of file-backed pages via
      speculative ahead pte mapping and relying on readahead logic.  However,
      on non-HW access bit architecture the benefit is highly limited because
      they should emulate the young bit with minor faults for reclaim's page
      aging algorithm.  IOW, we cannot reduce minor faults on those
      architectures.
      
      I did quick a test on my ARM machine.
      
      512M file mmap sequential every word read on eSATA drive 4 times.
      stddev is stable.
      
        = fault_around 4096 =
        elapsed time(usec): 6747645
      
        = fault_around 65536 =
        elapsed time(usec): 6709263
      
        0.5% gain.
      
      Even when I tested it with eMMC there is no gain because I guess with
      slow storage the major fault is the dominant factor.
      
      Also, fault_around has the side effect of shrinking slab more
      aggressively and causes higher vmpressure, so if such speculation fails,
      it can evict slab more which can result in page I/O (e.g., inode cache).
      In the end, it would make void any benefit of fault_around.
      
      So let's make the default "disabled" on those architectures.
      
      Link: http://lkml.kernel.org/r/20160518014229.GB21538@bboxSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0834a6c
    • K
      mm: make faultaround produce old ptes · 5c0a85fa
      Kirill A. Shutemov 提交于
      Currently, faultaround code produces young pte.  This can screw up
      vmscan behaviour[1], as it makes vmscan think that these pages are hot
      and not push them out on first round.
      
      During sparse file access faultaround gets more pages mapped and all of
      them are young.  Under memory pressure, this makes vmscan swap out anon
      pages instead, or to drop other page cache pages which otherwise stay
      resident.
      
      Modify faultaround to produce old ptes, so they can easily be reclaimed
      under memory pressure.
      
      This can to some extend defeat the purpose of faultaround on machines
      without hardware accessed bit as it will not help us with reducing the
      number of minor page faults.
      
      We may want to disable faultaround on such machines altogether, but
      that's subject for separate patchset.
      
      Minchan:
       "I tested 512M mmap sequential word read test on non-HW access bit
        system (i.e., ARM) and confirmed it doesn't increase minor fault any
        more.
      
        old: 4096 fault_around
        minor fault: 131291
        elapsed time: 6747645 usec
      
        new: 65536 fault_around
        minor fault: 131291
        elapsed time: 6709263 usec
      
        0.56% benefit"
      
      [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
      
      Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c0a85fa
    • S
      mm: use phys_addr_t for reserve_bootmem_region() arguments · 4b50bcc7
      Stefan Bader 提交于
      Since commit 92923ca3 ("mm: meminit: only set page reserved in the
      memblock region") the reserved bit is set on reserved memblock regions.
      However start and end address are passed as unsigned long.  This is only
      32bit on i386, so it can end up marking the wrong pages reserved for
      ranges at 4GB and above.
      
      This was observed on a 32bit Xen dom0 which was booted with initial
      memory set to a value below 4G but allowing to balloon in memory
      (dom0_mem=1024M for example).  This would define a reserved bootmem
      region for the additional memory (for example on a 8GB system there was
      a reverved region covering the 4GB-8GB range).  But since the addresses
      were passed on as unsigned long, this was actually marking all pages
      from 0 to 4GB as reserved.
      
      Fixes: 92923ca3 ("mm: meminit: only set page reserved in the memblock region")
      Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.comSigned-off-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b50bcc7
    • R
      mm/memblock.c: remove unnecessary always-true comparison · cd33a76b
      Richard Leitner 提交于
      Comparing an u64 variable to >= 0 returns always true and can therefore
      be removed.  This issue was detected using the -Wtype-limits gcc flag.
      
      This patch fixes following type-limits warning:
      
        mm/memblock.c: In function `__next_reserved_mem_region':
        mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
          if (*idx >= 0 && *idx < type->cnt) {
      
      Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.netSigned-off-by: NRichard Leitner <dev@g0hl1n.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd33a76b
    • V
      z3fold: the 3-fold allocator for compressed pages · 9a001fc1
      Vitaly Wool 提交于
      This patch introduces z3fold, a special purpose allocator for storing
      compressed pages.  It is designed to store up to three compressed pages
      per physical page.  It is a ZBUD derivative which allows for higher
      compression ratio keeping the simplicity and determinism of its
      predecessor.
      
      This patch comes as a follow-up to the discussions at the Embedded Linux
      Conference in San-Diego related to the talk [1].  The outcome of these
      discussions was that it would be good to have a compressed page
      allocator as stable and deterministic as zbud with with higher
      compression ratio.
      
      To keep the determinism and simplicity, z3fold, just like zbud, always
      stores an integral number of compressed pages per page, but it can store
      up to 3 pages unlike zbud which can store at most 2.  Therefore the
      compression ratio goes to around 2.6x while zbud's one is around 1.7x.
      
      The patch is based on the latest linux.git tree.
      
      This version has been updated after testing on various simulators (e.g.
      ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
      comments from Dan Streetman [3].
      
      [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
      [2] https://lkml.org/lkml/2016/4/21/799
      [3] https://lkml.org/lkml/2016/5/4/852
      
      Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.comSigned-off-by: NVitaly Wool <vitalywool@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a001fc1
    • A
      mm: thp: split_huge_pmd_address() comment improvement · d5ee7c3b
      Andrea Arcangeli 提交于
      Comment is partly wrong, this improves it by including the case of
      split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
      is set.
      
      Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5ee7c3b
    • C
      vmstat: get rid of the ugly cpu_stat_off variable · 7b8da4c7
      Christoph Lameter 提交于
      The cpu_stat_off variable is unecessary since we can check if a
      workqueue request is pending otherwise.  Removal of cpu_stat_off makes
      it pretty easy for the vmstat shepherd to ensure that the proper things
      happen.
      
      Removing the state also removes all races related to it.  Should a
      workqueue not be scheduled as needed for vmstat_update then the shepherd
      will notice and schedule it as needed.  Should a workqueue be
      unecessarily scheduled then the vmstat updater will disable it.
      
      [akpm@linux-foundation.org: fix indentation, per Michal]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.orgSigned-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b8da4c7
    • G
      memcg: fix stale mem_cgroup_force_empty() comment · 51038171
      Greg Thelen 提交于
      Commit f61c42a7 ("memcg: remove tasks/children test from
      mem_cgroup_force_empty()") removed memory reparenting from the function.
      
      Fix the function's comment.
      
      Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51038171
    • M
      mm: use existing helper to convert "on"/"off" to boolean · 2a138dc7
      Minfei Huang 提交于
      It's more convenient to use existing function helper to convert string
      "on/off" to boolean.
      
      Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.comSigned-off-by: NMinfei Huang <mnghuan@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a138dc7
    • M
      mm/swap.c: put activate_page_pvecs and other pagevecs together · a4a921aa
      Ming Li 提交于
      Put the activate_page_pvecs definition next to those of the other
      pagevecs, for clarity.
      Signed-off-by: NMing Li <mingli199x@qq.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4a921aa
    • D
      mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size · 297880f4
      David Rientjes 提交于
      The page_counter rounds limits down to page size values.  This makes
      sense, except in the case of hugetlb_cgroup where it's not possible to
      charge partial hugepages.  If the hugetlb_cgroup margin is less than the
      hugepage size being charged, it will fail as expected.
      
      Round the hugetlb_cgroup limit down to hugepage size, since it is the
      effective limit of the cgroup.
      
      For consistency, round down PAGE_COUNTER_MAX as well when a
      hugetlb_cgroup is created: this prevents error reports when a user
      cannot restore the value to the kernel default.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      297880f4
    • K
      mm: enable RLIMIT_DATA by default with workaround for valgrind · f4fcd558
      Konstantin Khlebnikov 提交于
      Since commit 84638335 ("mm: rework virtual memory accounting")
      RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
      default because of incompatibility with older versions of valgrind.
      
      Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
      Fortunately it changes only rlim_cur and keeps rlim_max for reverting
      limit back when needed.
      
      This patch checks current usage also against rlim_max if rlim_cur is
      zero.  This is safe because task anyway can increase rlim_cur up to
      rlim_max.  Size of brk is still checked against rlim_cur, so this part
      is completely compatible - zero rlim_cur forbids brk() but allows
      private mmap().
      
      Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.comSigned-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4fcd558
    • Y
      mm: fix incorrect pfn passed to untrack_pfn() in remap_pfn_range() · d5957d2f
      Yongji Xie 提交于
      We use generic hooks in remap_pfn_range() to help archs to track pfnmap
      regions.  The code is something like:
      
        int remap_pfn_range()
        {
      	...
      	track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
      	...
      	pfn -= addr >> PAGE_SHIFT;
      	...
      	untrack_pfn(vma, pfn, PAGE_ALIGN(size));
      	...
        }
      
      Here we can easily find the pfn is changed but not recovered before
      untrack_pfn() is called.  That's incorrect.
      
      There are no known runtime effects - this is from inspection.
      Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5957d2f
    • C
      mm/vmalloc: keep a separate lazy-free list · 80c4bd7a
      Chris Wilson 提交于
      When mixing lots of vmallocs and set_memory_*() (which calls
      vm_unmap_aliases()) I encountered situations where the performance
      degraded severely due to the walking of the entire vmap_area list each
      invocation.
      
      One simple improvement is to add the lazily freed vmap_area to a
      separate lockless free list, such that we then avoid having to walk the
      full list on each purge.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NRoman Pen <r.peniaev@gmail.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Pen <r.peniaev@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Shawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80c4bd7a
    • A
      mm/memblock.c: move memblock_{add,reserve}_region into memblock_{add,reserve} · f705ac4b
      Alexander Kuleshov 提交于
      memblock_add_region() and memblock_reserve_region() do nothing specific
      before the call of memblock_add_range(), only print debug output.
      
      We can do the same in memblock_add() and memblock_reserve() since both
      memblock_add_region() and memblock_reserve_region() are not used by
      anybody outside of memblock.c and memblock_{add,reserve}() have the same
      set of flags and nids.
      
      Since memblock_add_region() and memblock_reserve_region() will be
      inlined, there will not be functional changes, but will improve code
      readability a little.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f705ac4b
    • C
      mm/memory-failure.c: replace "MCE" with "Memory failure" · 495367c0
      Chen Yucong 提交于
      HWPoison was specific to some particular x86 platforms.  And it is often
      seen as high level machine check handler.  And therefore, 'MCE' is used
      for the format prefix of printk().  However, 'PowerNV' has also used
      HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
      to memory_failure.c.
      
      Additionally, 'MCE' and 'Memory failure' have different context.  The
      former belongs to exception context and the latter belongs to process
      context.  Furthermore, HWPoison can also be used for off-lining those
      sub-health pages that do not trigger any machine check exception.
      
      This patch aims to replace 'MCE' with a more appropriate prefix.
      
      [1] commit 75eb3d9b ("powerpc/powernv: Get FSP memory errors
      and plumb into memory poison infrastructure.")
      Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      495367c0
    • Y
      mm: thp: simplify the implementation of mk_huge_pmd() · 340a43be
      Yang Shi 提交于
      The implementation of mk_huge_pmd looks verbose, it could be just
      simplified to one line code.
      Signed-off-by: NYang Shi <yang.shi@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      340a43be
    • T
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa 提交于
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • M
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko 提交于
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
    • M
      mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION · 31e49bfd
      Michal Hocko 提交于
      Joonsoo has reported that he is able to trigger OOM for !costly high
      order requests (heavy fork() workload close the OOM) with the new oom
      detection rework.  This is because we rely only on should_reclaim_retry
      when the compaction is disabled and it only checks watermarks for the
      requested order and so we might trigger OOM when there is a lot of free
      memory.
      
      It is not very clear what are the usual workloads when the compaction is
      disabled.  Relying on high order allocations heavily without any
      mechanism to create those orders except for unbound amount of reclaim is
      certainly not a good idea.
      
      To prevent from potential regressions let's help this configuration
      some.  We have to sacrifice the determinsm though because there simply
      is none here possible.  should_compact_retry implementation for
      !CONFIG_COMPACTION, which was empty so far, will do watermark check for
      order-0 on all eligible zones.  This will cause retrying until either
      the reclaim cannot make any further progress or all the zones are
      depleted even for order-0 pages.  This means that the number of retries
      is basically unbounded for !costly orders but that was the case before
      the rework as well so this shouldn't regress.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.orgReported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31e49bfd
    • M
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko 提交于
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • M
      mm: consider compaction feedback also for costly allocation · 7854ea6c
      Michal Hocko 提交于
      PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
      should_reclaim_retry currently where we decide to not retry after at
      least order worth of pages were reclaimed or the watermark check for at
      least one zone would succeed after reclaiming all pages if the reclaim
      hasn't made any progress.  Compaction feedback is mostly ignored and we
      just try to make sure that the compaction did at least something before
      giving up.
      
      The first condition was added by a41f24ea ("page allocator: smarter
      retry of costly-order allocations) and it assumed that lumpy reclaim
      could have created a page of the sufficient order.  Lumpy reclaim, has
      been removed quite some time ago so the assumption doesn't hold anymore.
      Remove the check for the number of reclaimed pages and rely on the
      compaction feedback solely.  should_reclaim_retry now only makes sure
      that we keep retrying reclaim for high order pages only if they are
      hidden by watermaks so order-0 reclaim makes really sense.
      
      should_compact_retry now keeps retrying even for the costly allocations.
      The number of retries is reduced wrt.  !costly requests because they are
      less important and harder to grant and so their pressure shouldn't cause
      contention for other requests or cause an over reclaim.  We also do not
      reset no_progress_loops for costly request to make sure we do not keep
      reclaiming too agressively.
      
      This has been tested by running a process which fragments memory:
      	- compact memory
      	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
      	  of swapspace)
      	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
      	  steps until certain amount of memory is freed (250M in my test)
      	  and reduce the step to (step / 2) + 1 after reaching the end of
      	  the mapping
      	- then run a script which populates the page cache 2G (MemTotal)
      	  from /dev/zero to a new file
      And then tries to allocate
      nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
      huge pages.
      
      root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
      Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
      Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437
      
      * This is the /proc/buddyinfo after the compaction
      
      Done fragmenting. size=2013265920 freed=262144000
      Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0
      
      * /proc/buddyinfo after memory got fragmented
      
      Executing "/root/alloc_hugepages.sh"
      Eating some pagecache
      508623+0 records in
      508623+0 records out
      2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0
      
      * /proc/buddyinfo after page cache got eaten
      
      Trying to allocate 129
      129
      
      * 129 hugepages requested and all of them granted.
      
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0
      
      * /proc/buddyinfo after hugetlb allocation.
      
      10 runs will behave as follows:
      Trying to allocate 130
      130
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 132
      132
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      
      So basically 100% success for all 10 attempts.
      Without the patch numbers looked much worse:
      Trying to allocate 128
      12
      --
      Trying to allocate 129
      14
      --
      Trying to allocate 129
      7
      --
      Trying to allocate 129
      16
      --
      Trying to allocate 129
      30
      --
      Trying to allocate 129
      38
      --
      Trying to allocate 129
      19
      --
      Trying to allocate 129
      37
      --
      Trying to allocate 129
      28
      --
      Trying to allocate 129
      37
      
      Just for completness the base kernel without oom detection rework looks
      as follows:
      Trying to allocate 127
      30
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      52
      --
      Trying to allocate 128
      32
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      10
      --
      Trying to allocate 129
      32
      --
      Trying to allocate 128
      14
      --
      Trying to allocate 128
      16
      --
      Trying to allocate 129
      8
      
      As we can see the success rate is much more volatile and smaller without
      this patch. So the patch not only makes the retry logic for costly
      requests more sensible the success rate is even higher.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7854ea6c
    • M
      mm, oom: protect !costly allocations some more · 33c2d214
      Michal Hocko 提交于
      should_reclaim_retry will give up retries for higher order allocations
      if none of the eligible zones has any requested or higher order pages
      available even if we pass the watermak check for order-0.  This is done
      because there is no guarantee that the reclaimable and currently free
      pages will form the required order.
      
      This can, however, lead to situations where the high-order request (e.g.
      order-2 required for the stack allocation during fork) will trigger OOM
      too early - e.g.  after the first reclaim/compaction round.  Such a
      system would have to be highly fragmented and there is no guarantee
      further reclaim/compaction attempts would help but at least make sure
      that the compaction was active before we go OOM and keep retrying even
      if should_reclaim_retry tells us to oom if
      
      	- the last compaction round backed off or
      	- we haven't completed at least MAX_COMPACT_RETRIES active
      	  compaction rounds.
      
      The first rule ensures that the very last attempt for compaction was not
      ignored while the second guarantees that the compaction has done some
      work.  Multiple retries might be needed to prevent occasional pigggy
      backing of other contexts to steal the compacted pages before the
      current context manages to retry to allocate them.
      
      compaction_failed() is taken as a final word from the compaction that
      the retry doesn't make much sense.  We have to be careful though because
      the first compaction round is MIGRATE_ASYNC which is rather weak as it
      ignores pages under writeback and gives up too easily in other
      situations.  We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
      has been used before we give up.  With this logic in place we do not
      have to increase the migration mode unconditionally and rather do it
      only if the compaction failed for the weaker mode.  A nice side effect
      is that the stronger migration mode is used only when really needed so
      this has a potential of smaller latencies in some cases.
      
      Please note that the compaction doesn't tell us much about how
      successful it was when returning compaction_made_progress so we just
      have to blindly trust that another retry is worthwhile and cap the
      number to something reasonable to guarantee a convergence.
      
      If the given number of successful retries is not sufficient for a
      reasonable workloads we should focus on the collected compaction
      tracepoints data and try to address the issue in the compaction code.
      If this is not feasible we can increase the retries limit.
      
      [mhocko@suse.com: fix warning]
        Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c2d214
    • M
      mm: throttle on IO only when there are too many dirty and writeback pages · ede37713
      Michal Hocko 提交于
      wait_iff_congested has been used to throttle allocator before it retried
      another round of direct reclaim to allow the writeback to make some
      progress and prevent reclaim from looping over dirty/writeback pages
      without making any progress.
      
      We used to do congestion_wait before commit 0e093d99 ("writeback: do
      not sleep on the congestion queue if there are no congested BDIs or if
      significant congestion is not being encountered in the current zone")
      but that led to undesirable stalls and sleeping for the full timeout
      even when the BDI wasn't congested.  Hence wait_iff_congested was used
      instead.
      
      But it seems that even wait_iff_congested doesn't work as expected.  We
      might have a small file LRU list with all pages dirty/writeback and yet
      the bdi is not congested so this is just a cond_resched in the end and
      can end up triggering pre mature OOM.
      
      This patch replaces the unconditional wait_iff_congested by
      congestion_wait which is executed only if we _know_ that the last round
      of direct reclaim didn't make any progress and dirty+writeback pages are
      more than a half of the reclaimable pages on the zone which might be
      usable for our target allocation.  This shouldn't reintroduce stalls
      fixed by 0e093d99 because congestion_wait is called only when we are
      getting hopeless when sleeping is a better choice than OOM with many
      pages under IO.
      
      We have to preserve logic introduced by commit 373ccbe5 ("mm,
      vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
      progress") into the __alloc_pages_slowpath now that wait_iff_congested
      is not used anymore.  As the only remaining user of wait_iff_congested
      is shrink_inactive_list we can remove the WQ specific short sleep from
      wait_iff_congested because the sleep is needed to be done only once in
      the allocation retry cycle.
      
      [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
       Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ede37713
    • M
      mm, oom: rework oom detection · 0a0337e0
      Michal Hocko 提交于
      __alloc_pages_slowpath has traditionally relied on the direct reclaim
      and did_some_progress as an indicator that it makes sense to retry
      allocation rather than declaring OOM.  shrink_zones had to rely on
      zone_reclaimable if shrink_zone didn't make any progress to prevent from
      a premature OOM killer invocation - the LRU might be full of dirty or
      writeback pages and direct reclaim cannot clean those up.
      
      zone_reclaimable allows to rescan the reclaimable lists several times
      and restart if a page is freed.  This is really subtle behavior and it
      might lead to a livelock when a single freed page keeps allocator
      looping but the current task will not be able to allocate that single
      page.  OOM killer would be more appropriate than looping without any
      progress for unbounded amount of time.
      
      This patch changes OOM detection logic and pulls it out from shrink_zone
      which is too low to be appropriate for any high level decisions such as
      OOM which is per zonelist property.  It is __alloc_pages_slowpath which
      knows how many attempts have been done and what was the progress so far
      therefore it is more appropriate to implement this logic.
      
      The new heuristic is implemented in should_reclaim_retry helper called
      from __alloc_pages_slowpath.  It tries to be more deterministic and
      easier to follow.  It builds on an assumption that retrying makes sense
      only if the currently reclaimable memory + free pages would allow the
      current allocation request to succeed (as per __zone_watermark_ok) at
      least for one zone in the usable zonelist.
      
      This alone wouldn't be sufficient, though, because the writeback might
      get stuck and reclaimable pages might be pinned for a really long time
      or even depend on the current allocation context.  Therefore there is a
      backoff mechanism implemented which reduces the reclaim target after
      each reclaim round without any progress.  This means that we should
      eventually converge to only NR_FREE_PAGES as the target and fail on the
      wmark check and proceed to OOM.  The backoff is simple and linear with
      1/16 of the reclaimable pages for each round without any progress.  We
      are optimistic and reset counter for successful reclaim rounds.
      
      Costly high order pages mostly preserve their semantic and those without
      __GFP_REPEAT fail right away while those which have the flag set will
      back off after the amount of reclaimable pages reaches equivalent of the
      requested order.  The only difference is that if there was no progress
      during the reclaim we rely on zone watermark check.  This is more
      logical thing to do than previous 1<<order attempts which were a result
      of zone_reclaimable faking the progress.
      
      [vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
      [hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
      [rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
      [rientjes@google.com: shrink_zones doesn't need to return anything]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a0337e0
    • M
      mm, compaction: simplify __alloc_pages_direct_compact feedback interface · c5d01d0d
      Michal Hocko 提交于
      __alloc_pages_direct_compact communicates potential back off by two
      variables:
      	- deferred_compaction tells that the compaction returned
      	  COMPACT_DEFERRED
      	- contended_compaction is set when there is a contention on
      	  zone->lock resp. zone->lru_lock locks
      
      __alloc_pages_slowpath then backs of for THP allocation requests to
      prevent from long stalls. This is rather messy and it would be much
      cleaner to return a single compact result value and hide all the nasty
      details into __alloc_pages_direct_compact.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5d01d0d
    • M
      mm, compaction: distinguish between full and partial COMPACT_COMPLETE · c8f7de0b
      Michal Hocko 提交于
      COMPACT_COMPLETE now means that compaction and free scanner met.  This
      is not very useful information if somebody just wants to use this
      feedback and make any decisions based on that.  The current caller might
      be a poor guy who just happened to scan tiny portion of the zone and
      that could be the reason no suitable pages were compacted.  Make sure we
      distinguish the full and partial zone walks.
      
      Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
      and be optimistic in retrying.
      
      The existing users of COMPACT_COMPLETE are conservatively changed to use
      COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
      reconsidered and only defer the compaction only for COMPACT_COMPLETE
      with the new semantic.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8f7de0b
    • M
      mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED · 1d4746d3
      Michal Hocko 提交于
      try_to_compact_pages() can currently return COMPACT_SKIPPED even when
      the compaction is defered for some zone just because zone DMA is skipped
      in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
      basically unusable for the page allocator as a feedback mechanism.
      
      Make sure we distinguish those two states properly and switch their
      ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
      returned only when all eligible zones are skipped.
      
      As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
      will be more precise and we would bail out rather than reclaim.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d4746d3
    • M
      mm, compaction: cover all compaction mode in compact_zone · c46649de
      Michal Hocko 提交于
      The compiler is complaining after "mm, compaction: change COMPACT_
      constants into enum"
      
        mm/compaction.c: In function `compact_zone':
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
          switch (ret) {
          ^
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]
      
      compaction_suitable is allowed to return only COMPACT_PARTIAL,
      COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
      impossible.  Put a VM_BUG_ON to catch an impossible return value.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46649de
    • M
      mm, compaction: change COMPACT_ constants into enum · ea7ab982
      Michal Hocko 提交于
      Compaction code is doing weird dances between COMPACT_FOO -> int ->
      unsigned long
      
      But there doesn't seem to be any reason for that.  All functions which
      return/use one of those constants are not expecting any other value so it
      really makes sense to define an enum for them and make it clear that no
      other values are expected.
      
      This is a pure cleanup and shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea7ab982
    • M
      vmscan: consider classzone_idx in compaction_ready · b6459cc1
      Michal Hocko 提交于
      Motivation:
      As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
      communicate the reclaim progress is rater dubious. I tend to agree,
      not only it is really obscure, it is not hard to imagine cases where a
      single page freed in the loop keeps all the reclaimers looping without
      getting any progress because their gfp_mask wouldn't allow to get that
      page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
      rare so it doesn't happen in the practice but the current logic which we
      have is rather obscure and hard to follow a also non-deterministic.
      
      This is an attempt to make the OOM detection more deterministic and
      easier to follow because each reclaimer basically tracks its own
      progress which is implemented at the page allocator layer rather spread
      out between the allocator and the reclaim.  The more on the
      implementation is described in the first patch.
      
      I have tested several different scenarios but it should be clear that
      testing OOM killer is quite hard to be representative.  There is usually
      a tiny gap between almost OOM and full blown OOM which is often time
      sensitive.  Anyway, I have tested the following 2 scenarios and I would
      appreciate if there are more to test.
      
      Testing environment: a virtual machine with 2G of RAM and 2CPUs without
      any swap to make the OOM more deterministic.
      
      1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
         file size, removes the files and starts over again) running in
         parallel for 10s to build up a lot of dirty pages when 100 parallel
         mem_eaters (anon private populated mmap which waits until it gets
         signal) with 80M each.
      
         This causes an OOM flood of course and I have compared both patched
         and unpatched kernels. The test is considered finished after there
         are no OOM conditions detected. This should tell us whether there are
         any excessive kills or some of them premature (e.g. due to dirty pages):
      
      I have performed two runs this time each after a fresh boot.
      
      * base kernel
      $ grep "Out of memory:" base-oom-run1.log | wc -l
      78
      $ grep "Out of memory:" base-oom-run2.log | wc -l
      78
      
      $ grep "Kill process" base-oom-run1.log | tail -n1
      [   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
      $ grep "Kill process" base-oom-run2.log | tail -n1
      [   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child
      
      $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
      $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52
      
      $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
      1
      $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
      3
      
      * patched kernel
      $ grep "Out of memory:" patched-oom-run1.log | wc -l
      78
      miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
      77
      
      e grep "Kill process" patched-oom-run1.log | tail -n1
      [  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
      $ grep "Kill process" patched-oom-run2.log | tail -n1
      [  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child
      
      $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
      $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77
      
      e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
      2
      $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
      3
      
      The patched kernel run noticeably longer while invoking OOM killer same
      number of times. This means that the original implementation is much
      more aggressive and triggers the OOM killer sooner. free pages stats
      show that neither kernels went OOM too early most of the time, though. I
      guess the difference is in the backoff when retries without any progress
      do sleep for a while if there is memory under writeback or dirty which
      is highly likely considering the parallel IO.
      Both kernels have seen races where zone wasn't marked unreclaimable
      and we still hit the OOM killer. This is most likely a race where
      a task managed to exit between the last allocation attempt and the oom
      killer invocation.
      
      2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
         memory as possible without triggering the OOM killer. This required a lot
         of tuning but I've considered 3 consecutive runs in three different boots
         without OOM as a success.
      
      * base kernel
      size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
      
      * patched kernel
      size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)
      
      That means 40M more memory was usable without triggering OOM killer. The
      base kernel sometimes managed to handle the same as patched but it
      wasn't consistent and failed in at least on of the 3 runs. This seems
      like a minor improvement.
      
      I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
      memory and under memory pressure. The results are in patch 11 where the
      logic is implemented. In short I can see huge improvement there.
      
      I am certainly interested in other usecases as well as well as any
      feedback. Especially those which require higher order requests.
      
      This patch (of 14):
      
      While playing with the oom detection rework [1] I have noticed that my
      heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
      where the reclaim hasn't made any progress but did_some_progress didn't
      reflect that and compaction_suitable was backing off because no zone is
      above low wmark + 1 << order.
      
      It turned out that this is in fact an old standing bug in
      compaction_ready which ignores the requested_highidx and did the
      watermark check for 0 classzone_idx.  This succeeds for zone DMA most
      of the time as the zone is mostly unused because of lowmem protection.
      As a result costly high order allocatios always report a successfull
      progress even when there was none.  This wasn't a problem so far
      because these allocations usually fail quite early or retry only few
      times with __GFP_REPEAT but this will change after later patch in this
      series so make sure to not lie about the progress and propagate
      requested_highidx down to compaction_ready and use it for both the
      watermak check and compaction_suitable to fix this issue.
      
      [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
      [2] https://lkml.org/lkml/2015/10/12/808
      [3] https://lkml.org/lkml/2015/10/13/597Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6459cc1