1. 14 4月, 2018 2 次提交
    • M
      get_user_pages_fast(): return -EFAULT on access_ok failure · c61611f7
      Michael S. Tsirkin 提交于
      get_user_pages_fast is supposed to be a faster drop-in equivalent of
      get_user_pages.  As such, callers expect it to return a negative return
      code when passed an invalid address, and never expect it to return 0
      when passed a positive number of pages, since its documentation says:
      
       * Returns number of pages pinned. This may be fewer than the number
       * requested. If nr_pages is 0 or negative, returns 0. If no pages
       * were pinned, returns -errno.
      
      When get_user_pages_fast fall back on get_user_pages this is exactly
      what happens.  Unfortunately the implementation is inconsistent: it
      returns 0 if passed a kernel address, confusing callers: for example,
      the following is pretty common but does not appear to do the right thing
      with a kernel address:
      
              ret = get_user_pages_fast(addr, 1, writeable, &page);
              if (ret < 0)
                      return ret;
      
      Change get_user_pages_fast to return -EFAULT when supplied a kernel
      address to make it match expectations.
      
      All callers have been audited for consistency with the documented
      semantics.
      
      Link: http://lkml.kernel.org/r/1522962072-182137-4-git-send-email-mst@redhat.com
      Fixes: 5b65c467 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c61611f7
    • M
      mm/gup_benchmark: handle gup failures · 09e35a4a
      Michael S. Tsirkin 提交于
      Patch series "mm/get_user_pages_fast fixes, cleanups", v2.
      
      Turns out get_user_pages_fast and __get_user_pages_fast return different
      values on error when given a single page: __get_user_pages_fast returns
      0.  get_user_pages_fast returns either 0 or an error.
      
      Callers of get_user_pages_fast expect an error so fix it up to return an
      error consistently.
      
      Stress the difference between get_user_pages_fast and
      __get_user_pages_fast to make sure callers aren't confused.
      
      This patch (of 3):
      
      __gup_benchmark_ioctl does not handle the case where get_user_pages_fast
      fails:
      
       - a negative return code will cause a buffer overrun
      
       - returning with partial success will cause use of uninitialized
         memory.
      
      [akpm@linux-foundation.org: simplification]
      Link: http://lkml.kernel.org/r/1522962072-182137-3-git-send-email-mst@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09e35a4a
  2. 12 4月, 2018 38 次提交
    • M
      page cache: use xa_lock · b93b0163
      Matthew Wilcox 提交于
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • P
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin 提交于
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
    • M
      mm: introduce MAP_FIXED_NOREPLACE · a4ff8e86
      Michal Hocko 提交于
      Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
      
      This has started as a follow up discussion [3][4] resulting in the
      runtime failure caused by hardening patch [5] which removes MAP_FIXED
      from the elf loader because MAP_FIXED is inherently dangerous as it
      might silently clobber an existing underlying mapping (e.g.  stack).
      The reason for the failure is that some architectures enforce an
      alignment for the given address hint without MAP_FIXED used (e.g.  for
      shared or file backed mappings).
      
      One way around this would be excluding those archs which do alignment
      tricks from the hardening [6].  The patch is really trivial but it has
      been objected, rightfully so, that this screams for a more generic
      solution.  We basically want a non-destructive MAP_FIXED.
      
      The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
      address but unlike MAP_FIXED it fails with EEXIST if the given range
      conflicts with an existing one.  The flag is introduced as a completely
      new one rather than a MAP_FIXED extension because of the backward
      compatibility.  We really want a never-clobber semantic even on older
      kernels which do not recognize the flag.  Unfortunately mmap sucks
      wrt flags evaluation because we do not EINVAL on unknown flags.  On
      those kernels we would simply use the traditional hint based semantic so
      the caller can still get a different address (which sucks) but at least
      not silently corrupt an existing mapping.  I do not see a good way
      around that.  Except we won't export expose the new semantic to the
      userspace at all.
      
      It seems there are users who would like to have something like that.
      Jemalloc has been mentioned by Michael Ellerman [7]
      
      Florian Weimer has mentioned the following:
      : glibc ld.so currently maps DSOs without hints.  This means that the kernel
      : will map right next to each other, and the offsets between them a completely
      : predictable.  We would like to change that and supply a random address in a
      : window of the address space.  If there is a conflict, we do not want the
      : kernel to pick a non-random address. Instead, we would try again with a
      : random address.
      
      John Hubbard has mentioned CUDA example
      : a) Searches /proc/<pid>/maps for a "suitable" region of available
      : VA space.  "Suitable" generally means it has to have a base address
      : within a certain limited range (a particular device model might
      : have odd limitations, for example), it has to be large enough, and
      : alignment has to be large enough (again, various devices may have
      : constraints that lead us to do this).
      :
      : This is of course subject to races with other threads in the process.
      :
      : Let's say it finds a region starting at va.
      :
      : b) Next it does:
      :     p = mmap(va, ...)
      :
      : *without* setting MAP_FIXED, of course (so va is just a hint), to
      : attempt to safely reserve that region. If p != va, then in most cases,
      : this is a failure (almost certainly due to another thread getting a
      : mapping from that region before we did), and so this layer now has to
      : call munmap(), before returning a "failure: retry" to upper layers.
      :
      :     IMPROVEMENT: --> if instead, we could call this:
      :
      :             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
      :
      :         , then we could skip the munmap() call upon failure. This
      :         is a small thing, but it is useful here. (Thanks to Piotr
      :         Jaroszynski and Mark Hairgrove for helping me get that detail
      :         exactly right, btw.)
      :
      : c) After that, CUDA suballocates from p, via:
      :
      :      q = mmap(sub_region_start, ... MAP_FIXED ...)
      :
      : Interestingly enough, "freeing" is also done via MAP_FIXED, and
      : setting PROT_NONE to the subregion. Anyway, I just included (c) for
      : general interest.
      
      Atomic address range probing in the multithreaded programs in general
      sounds like an interesting thing to me.
      
      The second patch simply replaces MAP_FIXED use in elf loader by
      MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
      should follow.  Actually real MAP_FIXED usages should be docummented
      properly and they should be more of an exception.
      
      [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
      [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
      [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
      [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
      [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
      [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
      
      This patch (of 2):
      
      MAP_FIXED is used quite often to enforce mapping at the particular range.
      The main problem of this flag is, however, that it is inherently dangerous
      because it unmaps existing mappings covered by the requested range.  This
      can cause silent memory corruptions.  Some of them even with serious
      security implications.  While the current semantic might be really
      desiderable in many cases there are others which would want to enforce the
      given range but rather see a failure than a silent memory corruption on a
      clashing range.  Please note that there is no guarantee that a given range
      is obeyed by the mmap even when it is free - e.g.  arch specific code is
      allowed to apply an alignment.
      
      Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
      behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
      request with a single exception that it fails with EEXIST if the requested
      address is already covered by an existing mapping.  We still do rely on
      get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
      check for a conflicting vma after it returns.
      
      The flag is introduced as a completely new one rather than a MAP_FIXED
      extension because of the backward compatibility.  We really want a
      never-clobber semantic even on older kernels which do not recognize the
      flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
      EINVAL on unknown flags.  On those kernels we would simply use the
      traditional hint based semantic so the caller can still get a different
      address (which sucks) but at least not silently corrupt an existing
      mapping.  I do not see a good way around that.
      
      [mpe@ellerman.id.au: fix whitespace]
      [fail on clashing range with EEXIST as per Florian Weimer]
      [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
      Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.orgReviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Evans <jasone@google.com>
      Cc: David Goldblatt <davidtgoldblatt@gmail.com>
      Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4ff8e86
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
    • A
      kasan, slub: fix handling of kasan_slab_free hook · c3895391
      Andrey Konovalov 提交于
      The kasan_slab_free hook's return value denotes whether the reuse of a
      slab object must be delayed (e.g.  when the object is put into memory
      qurantine).
      
      The current way SLUB handles this hook is by ignoring its return value
      and hardcoding checks similar (but not exactly the same) to the ones
      performed in kasan_slab_free, which is prone to making mistakes.
      
      The main difference between the hardcoded checks and the ones in
      kasan_slab_free is whether we want to perform a free in case when an
      invalid-free or a double-free was detected (we don't).
      
      This patch changes the way SLUB handles this by:
      1. taking into account the return value of kasan_slab_free for each of
         the objects, that are being freed;
      2. reconstructing the freelist of objects to exclude the ones, whose
         reuse must be delayed.
      
      [andreyknvl@google.com: eliminate unnecessary branch in slab_free]
        Link: http://lkml.kernel.org/r/a62759a2545fddf69b0c034547212ca1eb1b3ce2.1520359686.git.andreyknvl@google.com
      Link: http://lkml.kernel.org/r/083f58501e54731203801d899632d76175868e97.1519400992.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3895391
    • J
      mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving · b7d349c7
      Joonsoo Kim 提交于
      There was a regression report for "mm/cma: manage the memory of the CMA
      area by using the ZONE_MOVABLE" [1] and I think that it is related to
      this problem.  CMA patchset makes the system use one more zone
      (ZONE_MOVABLE) and then increases min_free_kbytes.  It reduces usable
      memory and it could cause regression.
      
      ZONE_MOVABLE only has movable pages so we don't need to keep enough
      freepages to avoid or deal with fragmentation.  So, don't count it.
      
      This changes min_free_kbytes and thus min_watermark greatly if
      ZONE_MOVABLE is used.  It will make the user uses more memory.
      
      System:
        22GB ram, fakenuma, 2 nodes. 5 zones are used.
      
      Before:
        min_free_kbytes: 112640
      
        zone_info (min_watermark):
        Node 0, zone      DMA
                min      19
        Node 0, zone    DMA32
                min      3778
        Node 0, zone   Normal
                min      10191
        Node 0, zone  Movable
                min      0
        Node 0, zone   Device
                min      0
        Node 1, zone      DMA
                min      0
        Node 1, zone    DMA32
                min      0
        Node 1, zone   Normal
                min      14043
        Node 1, zone  Movable
                min      127
        Node 1, zone   Device
                min      0
      
      After:
        min_free_kbytes: 90112
      
        zone_info (min_watermark):
        Node 0, zone      DMA
                min      15
        Node 0, zone    DMA32
                min      3022
        Node 0, zone   Normal
                min      8152
        Node 0, zone  Movable
                min      0
        Node 0, zone   Device
                min      0
        Node 1, zone      DMA
                min      0
        Node 1, zone    DMA32
                min      0
        Node 1, zone   Normal
                min      11234
        Node 1, zone  Movable
                min      102
        Node 1, zone   Device
                min      0
      
      [1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)
      
      Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7d349c7
    • J
      mm/cma: remove ALLOC_CMA · 1d47a3ec
      Joonsoo Kim 提交于
      Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
      and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.
      
      Therefore, we don't need to maintain ALLOC_CMA at all.
      
      Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d47a3ec
    • J
      mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE · bad8c6c0
      Joonsoo Kim 提交于
      Patch series "mm/cma: manage the memory of the CMA area by using the
      ZONE_MOVABLE", v2.
      
      0. History
      
      This patchset is the follow-up of the discussion about the "Introduce
      ZONE_CMA (v7)" [1].  Please reference it if more information is needed.
      
      1. What does this patch do?
      
      This patch changes the management way for the memory of the CMA area in
      the MM subsystem.  Currently the memory of the CMA area is managed by
      the zone where their pfn is belong to.  However, this approach has some
      problems since MM subsystem doesn't have enough logic to handle the
      situation that different characteristic memories are in a single zone.
      To solve this issue, this patch try to manage all the memory of the CMA
      area by using the MOVABLE zone.  In MM subsystem's point of view,
      characteristic of the memory on the MOVABLE zone and the memory of the
      CMA area are the same.  So, managing the memory of the CMA area by using
      the MOVABLE zone will not have any problem.
      
      2. Motivation
      
      There are some problems with current approach.  See following.  Although
      these problem would not be inherent and it could be fixed without this
      conception change, it requires many hooks addition in various code path
      and it would be intrusive to core MM and would be really error-prone.
      Therefore, I try to solve them with this new approach.  Anyway,
      following is the problems of the current implementation.
      
      o CMA memory utilization
      
      First, following is the freepage calculation logic in MM.
      
       - For movable allocation: freepage = total freepage
       - For unmovable allocation: freepage = total freepage - CMA freepage
      
      Freepages on the CMA area is used after the normal freepages in the zone
      where the memory of the CMA area is belong to are exhausted.  At that
      moment that the number of the normal freepages is zero, so
      
       - For movable allocation: freepage = total freepage = CMA freepage
       - For unmovable allocation: freepage = 0
      
      If unmovable allocation comes at this moment, allocation request would
      fail to pass the watermark check and reclaim is started.  After reclaim,
      there would exist the normal freepages so freepages on the CMA areas
      would not be used.
      
      FYI, there is another attempt [2] trying to solve this problem in lkml.
      And, as far as I know, Qualcomm also has out-of-tree solution for this
      problem.
      
      Useless reclaim:
      
      There is no logic to distinguish CMA pages in the reclaim path.  Hence,
      CMA page is reclaimed even if the system just needs the page that can be
      usable for the kernel allocation.
      
      Atomic allocation failure:
      
      This is also related to the fallback allocation policy for the memory of
      the CMA area.  Consider the situation that the number of the normal
      freepages is *zero* since the bunch of the movable allocation requests
      come.  Kswapd would not be woken up due to following freepage
      calculation logic.
      
      - For movable allocation: freepage = total freepage = CMA freepage
      
      If atomic unmovable allocation request comes at this moment, it would
      fails due to following logic.
      
      - For unmovable allocation: freepage = total freepage - CMA freepage = 0
      
      It was reported by Aneesh [3].
      
      Useless compaction:
      
      Usual high-order allocation request is unmovable allocation request and
      it cannot be served from the memory of the CMA area.  In compaction,
      migration scanner try to migrate the page in the CMA area and make
      high-order page there.  As mentioned above, it cannot be usable for the
      unmovable allocation request so it's just waste.
      
      3. Current approach and new approach
      
      Current approach is that the memory of the CMA area is managed by the
      zone where their pfn is belong to.  However, these memory should be
      distinguishable since they have a strong limitation.  So, they are
      marked as MIGRATE_CMA in pageblock flag and handled specially.  However,
      as mentioned in section 2, the MM subsystem doesn't have enough logic to
      deal with this special pageblock so many problems raised.
      
      New approach is that the memory of the CMA area is managed by the
      MOVABLE zone.  MM already have enough logic to deal with special zone
      like as HIGHMEM and MOVABLE zone.  So, managing the memory of the CMA
      area by the MOVABLE zone just naturally work well because constraints
      for the memory of the CMA area that the memory should always be
      migratable is the same with the constraint for the MOVABLE zone.
      
      There is one side-effect for the usability of the memory of the CMA
      area.  The use of MOVABLE zone is only allowed for a request with
      GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
      only allowed for this gfp flag.  Before this patchset, a request with
      GFP_MOVABLE can use them.  IMO, It would not be a big issue since most
      of GFP_MOVABLE request also has GFP_HIGHMEM flag.  For example, file
      cache page and anonymous page.  However, file cache page for blockdev
      file is an exception.  Request for it has no GFP_HIGHMEM flag.  There is
      pros and cons on this exception.  In my experience, blockdev file cache
      pages are one of the top reason that causes cma_alloc() to fail
      temporarily.  So, we can get more guarantee of cma_alloc() success by
      discarding this case.
      
      Note that there is no change in admin POV since this patchset is just
      for internal implementation change in MM subsystem.  Just one minor
      difference for admin is that the memory stat for CMA area will be
      printed in the MOVABLE zone.  That's all.
      
      4. Result
      
      Following is the experimental result related to utilization problem.
      
      8 CPUs, 1024 MB, VIRTUAL MACHINE
      make -j16
      
      <Before>
        CMA area:               0 MB            512 MB
        Elapsed-time:           92.4		186.5
        pswpin:                 82		18647
        pswpout:                160		69839
      
      <After>
        CMA        :            0 MB            512 MB
        Elapsed-time:           93.1		93.4
        pswpin:                 84		46
        pswpout:                183		92
      
      akpm: "kernel test robot" reported a 26% improvement in
      vm-scalability.throughput:
      http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop
      
      [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
      [2]: https://lkml.org/lkml/2014/10/15/623
      [3]: http://www.spinics.net/lists/linux-mm/msg100562.html
      
      Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bad8c6c0
    • J
      mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request · d3cda233
      Joonsoo Kim 提交于
      Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
      important to reserve.  When ZONE_MOVABLE is used, this problem would
      theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
      allocation request which is mainly used for page cache and anon page
      allocation.  So, fix it by setting 0 to
      sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].
      
      And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
      makes code complex.  For example, if there is highmem system, following
      reserve ratio is activated for *NORMAL ZONE* which would be easyily
      misleading people.
      
       #ifdef CONFIG_HIGHMEM
       32
       #endif
      
      This patch also fixes this situation by defining
      sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
      right place.
      
      Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3cda233
    • M
      mm: unclutter THP migration · 94723aaf
      Michal Hocko 提交于
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
    • M
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko 提交于
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
    • C
      mm/swapfile.c: make pointer swap_avail_heads static · bfc6b1ca
      Colin Ian King 提交于
      The pointer swap_avail_heads is local to the source and does not need to
      be in global scope, so make it static.
      
      Cleans up sparse warning:
      
        mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180206215836.12366-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc6b1ca
    • M
      memcg: fix per_node_info cleanup · 4eaf431f
      Michal Hocko 提交于
      syzbot has triggered a NULL ptr dereference when allocation fault
      injection enforces a failure and alloc_mem_cgroup_per_node_info
      initializes memcg->nodeinfo only half way through.
      
      But __mem_cgroup_free still tries to free all per-node data and
      dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
      per-node data hasn't been initialized.
      
      The bug is quite unlikely to hit because small allocations do not fail
      and we would need quite some numa nodes to make struct
      mem_cgroup_per_node large enough to cross the costly order.
      
      Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
      Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
      Fixes: 00f3ca2c ("mm: memcontrol: per-lruvec stats infrastructure")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eaf431f
    • T
      swap: divide-by-zero when zero length swap file on ssd · a06ad633
      Tom Abraham 提交于
      Calling swapon() on a zero length swap file on SSD can lead to a
      divide-by-zero.
      
      Although creating such files isn't possible with mkswap and they woud be
      considered invalid, it would be better for the swapon code to be more
      robust and handle this condition gracefully (return -EINVAL).
      Especially since the fix is small and straightforward.
      
      To help with wear leveling on SSD, the swapon syscall calculates a
      random position in the swap file using modulo p->highest_bit, which is
      set to maxpages - 1 in read_swap_header.
      
      If the swap file is zero length, read_swap_header sets maxpages=1 and
      last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
      modulo p->highest_bit in swapon syscall.
      
      This can be prevented by having read_swap_header return zero if
      last_page is zero.
      
      Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.comSigned-off-by: NThomas Abraham <tabraham@suse.com>
      Reported-by: <Mark.Landis@Teradata.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a06ad633
    • J
      mm: memcg: make sure memory.events is uptodate when waking pollers · e27be240
      Johannes Weiner 提交于
      Commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") added per-cpu drift to all memory cgroup stats
      and events shown in memory.stat and memory.events.
      
      For memory.stat this is acceptable.  But memory.events issues file
      notifications, and somebody polling the file for changes will be
      confused when the counters in it are unchanged after a wakeup.
      
      Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
      MEMCG_OOM - are sufficiently rare and high-level that we don't need
      per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
      frequent, but they're counting invocations of reclaim, which is a
      complex operation that touches many shared cachelines.
      
      This splits memory.events from the generic VM events and tracks them in
      their own, unbuffered atomic counters.  That's also cleaner, as it
      eliminates the ugly enum nesting of VM and cgroup events.
      
      [hannes@cmpxchg.org: "array subscript is above array bounds"]
        Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
      Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e27be240
    • C
      mm/ksm.c: fix inconsistent accounting of zero pages · a38c015f
      Claudio Imbrenda 提交于
      When using KSM with use_zero_pages, we replace anonymous pages
      containing only zeroes with actual zero pages, which are not anonymous.
      We need to do proper accounting of the mm counters, otherwise we will
      get wrong values in /proc and a BUG message in dmesg when tearing down
      the mm.
      
      Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
      Fixes: e86c59b1 ("mm/ksm: improve deduplication of zero pages with colouring")
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a38c015f
    • M
      mm/z3fold.c: use gfpflags_allow_blocking · 8a97ea54
      Matthew Wilcox 提交于
      We have a perfectly good macro to determine whether the gfp flags allow
      you to sleep or not; use it instead of trying to infer it.
      
      Link: http://lkml.kernel.org/r/20180408062206.GC16007@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a97ea54
    • X
      z3fold: fix memory leak · 1ec6995d
      Xidong Wang 提交于
      In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
      released on the error path that pool->compact_wq , which holds the
      return value of create_singlethread_workqueue(), is NULL.  This will
      result in a memory leak bug.
      
      [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
      Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.comSigned-off-by: NXidong Wang <wangxidong_97@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ec6995d
    • M
      memcg, thp: do not invoke oom killer on thp charges · 2a70f6a7
      Michal Hocko 提交于
      A THP memcg charge can trigger the oom killer since 25160354 ("mm,
      thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
      We have used an explicit __GFP_NORETRY previously which ruled the OOM
      killer automagically.
      
      Memcg charge path should be semantically compliant with the allocation
      path and that means that if we do not trigger the OOM killer for costly
      orders which should do the same in the memcg charge path as well.
      Otherwise we are forcing callers to distinguish the two and use
      different gfp masks which is both non-intuitive and bug prone.  As soon
      as we get a costly high order kmalloc user we even do not have any means
      to tell the memcg specific gfp mask to prevent from OOM because the
      charging is deep within guts of the slab allocator.
      
      The unexpected memcg OOM on THP has already been fixed upstream by
      9d3c3354 ("mm, thp: do not cause memcg oom for thp") but this is a
      one-off fix rather than a generic solution.  Teach mem_cgroup_oom to
      bail out on costly order requests to fix the THP issue as well as any
      other costly OOM eligible allocations to be added in future.
      
      Also revert 9d3c3354 because special gfp for THP is no longer
      needed.
      
      Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
      Fixes: 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a70f6a7
    • R
      mm/migrate: properly preserve write attribute in special migrate entry · 07707125
      Ralph Campbell 提交于
      Use of pte_write(pte) is only valid for present pte, the common code
      which set the migration entry can be reach for both valid present pte
      and special swap entry (for device memory).  Fix the code to use the
      mpfn value which properly handle both cases.
      
      On x86 this did not have any bad side effect because pte write bit is
      below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
      in turn means we were always creating read only special migration entry.
      
      So once migration did finish we always write protected the CPU page
      table entry (moreover this is only an issue when migrating from device
      memory to system memory).  End effect is that CPU write access would
      fault again and restore write permission.
      
      This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
      take a second fault on write access. ie, double faulting the same
      address.  There is no corruption or incorrect states (it behaves as a
      COWed page from a fork with a mapcount of 1).
      
      Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07707125
    • M
      sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages · 09a913a7
      Mel Gorman 提交于
      change_pte_range is called from task work context to mark PTEs for
      receiving NUMA faulting hints.  If the marked pages are dirty then
      migration may fail.  Some filesystems cannot migrate dirty pages without
      blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
      Even when they can, it can be a waste of cycles when the pages are
      shared forcing higher scan rates.  This patch avoids marking shared
      dirty pages for hinting faults but also will skip a migration if the
      page was dirtied after the scanner updated a clean page.
      
      This is most noticeable running the NASA Parallel Benchmark when backed
      by btrfs, the default root filesystem for some distributions, but also
      noticeable when using XFS.
      
      The following are results from a 4-socket machine running a 4.16-rc4
      kernel with some scheduler patches that are pending for the next merge
      window.
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1
        Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
        Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
        Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
        Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
        Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)
      
      is.D regresses slightly in terms of absolute time but note that that
      particular load varies quite a bit from run to run.  The more relevant
      observation is the total system CPU usage.
      
                  4.16.0-rc4  4.16.0-rc4
                schedtip-20180309 nodirty-v1
        User        71471.91    70627.04
        System      11078.96     8256.13
        Elapsed       661.66      632.74
      
      That is a substantial drop in system CPU usage and overall the workload
      completes faster.  The NUMA balancing statistics are also interesting
      
        NUMA base PTE updates        111407972   139848884
        NUMA huge PMD updates           206506      264869
        NUMA page range updates      217139044   275461812
        NUMA hint faults               4300924     3719784
        NUMA hint local faults         3012539     3416618
        NUMA hint local percent             70          91
        NUMA pages migrated            1517487     1358420
      
      While more PTEs are scanned due to changes in what faults are gathered,
      it's clear that a far higher percentage of faults are local as the bulk
      of the remote hits were dirty pages that, in this case with btrfs, had
      no chance of migrating.
      
      The following is a comparison when using XFS as that is a more realistic
      filesystem choice for a data partition
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1r47
        Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
        Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
        Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
        Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
        Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)
      
      That is a reasonable gain on two relatively long-lived workloads.  While
      not presented, there is also a substantial drop in system CPu usage and
      the NUMA balancing stats show similar improvements in locality as btrfs
      did.
      
      Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09a913a7
    • T
      mm/hmm.c: remove superfluous RCU protection around radix tree lookup · 18be460e
      Tejun Heo 提交于
      hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
      actually uses the RCU protection.  The only caller is
      hmm_devmem_pages_create() which already grabs the mutex and does
      superfluous rcu_read_lock/unlock() around the function.
      
      This doesn't add anything and just adds to confusion.  Remove the RCU
      protection and open-code the radix tree lookup.  If this needs to become
      more sophisticated in the future, let's add them back when necessary.
      
      Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18be460e
    • J
      mm/hmm: use device driver encoding for HMM pfn · f88a1e90
      Jérôme Glisse 提交于
      Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
      pfn shift value allowing them to define their own encoding for HMM pfn
      that are fill inside the pfns array of the hmm_range struct.  With this
      device driver can get pfn that match their own private encoding out of HMM
      without having to do any conversion.
      
      [rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
        Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
      [rcampbell@nvidia.com: clarify fault logic for device private memory]
        Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f88a1e90
    • J
      mm/hmm: change hmm_vma_fault() to allow write fault on page basis · 2aee09d8
      Jérôme Glisse 提交于
      This changes hmm_vma_fault() to not take a global write fault flag for a
      range but instead rely on caller to populate HMM pfns array with proper
      fault flag ie HMM_PFN_VALID if driver want read fault for that address or
      HMM_PFN_VALID and HMM_PFN_WRITE for write.
      
      Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
      device private memory to be migrated back to system memory through page
      fault.
      
      This is more flexible API and it better reflects how device handles and
      reports fault.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aee09d8
    • J
      mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() · 53f5c3f4
      Jérôme Glisse 提交于
      No functional change, just create one function to handle pmd and one to
      handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f5c3f4
    • J
      mm/hmm: move hmm_pfns_clear() closer to where it is used · 33cd47dc
      Jérôme Glisse 提交于
      Move hmm_pfns_clear() closer to where it is used to make it clear it is
      not use by page table walkers.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33cd47dc
    • J
      mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE · b2744118
      Jérôme Glisse 提交于
      Make naming consistent across code, DEVICE_PRIVATE is the name use outside
      HMM code so use that one.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2744118
    • J
      mm/hmm: do not differentiate between empty entry or missing directory · 5504ed29
      Jérôme Glisse 提交于
      There is no point in differentiating between a range for which there is
      not even a directory (and thus entries) and empty entry (pte_none() or
      pmd_none() returns true).
      
      Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
      duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5504ed29
    • J
      mm/hmm: cleanup special vma handling (VM_SPECIAL) · 855ce7d2
      Jérôme Glisse 提交于
      Special vma (one with any of the VM_SPECIAL flags) can not be access by
      device because there is no consistent model across device drivers on those
      vma and their backing memory.
      
      This patch directly use hmm_range struct for hmm_pfns_special() argument
      as it is always affecting the whole vma and thus the whole range.
      
      It also make behavior consistent after this patch both hmm_vma_fault() and
      hmm_vma_get_pfns() returns -EINVAL when facing such vma.  Previously
      hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
      were filling the HMM pfn array with special entry.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      855ce7d2
    • J
      mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong · ff05c0c6
      Jérôme Glisse 提交于
      All device driver we care about are using 64bits page table entry.  In
      order to match this and to avoid useless define convert all HMM pfn to
      directly use uint64_t.  It is a first step on the road to allow driver to
      directly use pfn value return by HMM (saving memory and CPU cycles use for
      conversion between the two).
      
      Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff05c0c6
    • J
      mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture · 86586a41
      Jérôme Glisse 提交于
      Only peculiar architecture allow write without read thus assume that any
      valid pfn do allow for read.  Note we do not care for write only because
      it does make sense with thing like atomic compare and exchange or any
      other operations that allow you to get the memory value through them.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86586a41
    • J
      mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters · 08232a45
      Jérôme Glisse 提交于
      Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
      as parameter and were initializing that struct with others of their
      parameters.  Have caller of those function do this as they are likely to
      already do and only pass this struct to both function this shorten
      function signature and make it easier in the future to add new parameters
      by simply adding them to the structure.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08232a45
    • J
      mm/hmm: hmm_pfns_bad() was accessing wrong struct · c719547f
      Jérôme Glisse 提交于
      The private field of mm_walk struct point to an hmm_vma_walk struct and
      not to the hmm_range struct desired.  Fix to get proper struct pointer.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c719547f
    • J
      mm/hmm: unregister mmu_notifier when last HMM client quit · c01cbba2
      Jérôme Glisse 提交于
      This code was lost in translation at one point.  This properly call
      mmu_notifier_unregister_no_release() once last user is gone.  This fix the
      zombie mm_struct as without this patch we do not drop the refcount we have
      on it.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c01cbba2
    • R
      mm/hmm: HMM should have a callback before MM is destroyed · e1401513
      Ralph Campbell 提交于
      hmm_mirror_register() registers a callback for when the CPU pagetable is
      modified.  Normally, the device driver will call hmm_mirror_unregister()
      when the process using the device is finished.  However, if the process
      exits uncleanly, the struct_mm can be destroyed with no warning to the
      device driver.
      
      Link: http://lkml.kernel.org/r/20180323005527.758-4-jglisse@redhat.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1401513
    • S
      mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event · d51d1e64
      Steven Rostedt 提交于
      The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
      parameters! Seven of them are from the reclaim_stat structure.  This
      structure is currently local to mm/vmscan.c.  By moving it to the global
      vmstat.h header, we can also reference it from the vmscan tracepoints.
      In moving it, it brings down the overhead of passing so many arguments
      to the trace event.  In the future, we may limit the number of arguments
      that a trace event may pass (ideally just 6, but more realistically it
      may be 8).
      
      Before this patch, the code to call the trace event is this:
      
       0f 83 aa fe ff ff       jae    ffffffff811e6261 <shrink_inactive_list+0x1e1>
       48 8b 45 a0             mov    -0x60(%rbp),%rax
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       44 8b 6d d4             mov    -0x2c(%rbp),%r13d
       8b 4d d0                mov    -0x30(%rbp),%ecx
       44 8b 75 cc             mov    -0x34(%rbp),%r14d
       44 8b 7d c8             mov    -0x38(%rbp),%r15d
       48 89 45 90             mov    %rax,-0x70(%rbp)
       8b 83 b8 fe ff ff       mov    -0x148(%rbx),%eax
       8b 55 c0                mov    -0x40(%rbp),%edx
       8b 7d c4                mov    -0x3c(%rbp),%edi
       8b 75 b8                mov    -0x48(%rbp),%esi
       89 45 80                mov    %eax,-0x80(%rbp)
       65 ff 05 e4 f7 e2 7e    incl   %gs:0x7ee2f7e4(%rip)        # 15bd0 <__preempt_count>
       48 8b 05 75 5b 13 01    mov    0x1135b75(%rip),%rax        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       48 85 c0                test   %rax,%rax
       74 72                   je     ffffffff811e646a <shrink_inactive_list+0x3ea>
       48 89 c3                mov    %rax,%rbx
       4c 8b 10                mov    (%rax),%r10
       89 f8                   mov    %edi,%eax
       48 89 85 68 ff ff ff    mov    %rax,-0x98(%rbp)
       89 f0                   mov    %esi,%eax
       48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
       89 c8                   mov    %ecx,%eax
       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
       89 d0                   mov    %edx,%eax
       48 89 85 70 ff ff ff    mov    %rax,-0x90(%rbp)
       8b 45 8c                mov    -0x74(%rbp),%eax
       48 8b 7b 08             mov    0x8(%rbx),%rdi
       48 83 c3 18             add    $0x18,%rbx
       50                      push   %rax
       41 54                   push   %r12
       41 55                   push   %r13
       ff b5 78 ff ff ff       pushq  -0x88(%rbp)
       41 56                   push   %r14
       41 57                   push   %r15
       ff b5 70 ff ff ff       pushq  -0x90(%rbp)
       4c 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%r9
       4c 8b 85 60 ff ff ff    mov    -0xa0(%rbp),%r8
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       48 8b 55 90             mov    -0x70(%rbp),%rdx
       8b 75 80                mov    -0x80(%rbp),%esi
       41 ff d2                callq  *%r10
      
      After the patch:
      
       0f 83 a8 fe ff ff       jae    ffffffff811e626d <shrink_inactive_list+0x1cd>
       8b 9b b8 fe ff ff       mov    -0x148(%rbx),%ebx
       45 8b 64 24 20          mov    0x20(%r12),%r12d
       4c 8b 6d a0             mov    -0x60(%rbp),%r13
       65 ff 05 f5 f7 e2 7e    incl   %gs:0x7ee2f7f5(%rip)        # 15bd0 <__preempt_count>
       4c 8b 35 86 5b 13 01    mov    0x1135b86(%rip),%r14        # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
       4d 85 f6                test   %r14,%r14
       74 2a                   je     ffffffff811e6411 <shrink_inactive_list+0x371>
       49 8b 06                mov    (%r14),%rax
       8b 4d 8c                mov    -0x74(%rbp),%ecx
       49 8b 7e 08             mov    0x8(%r14),%rdi
       49 83 c6 18             add    $0x18,%r14
       4c 89 ea                mov    %r13,%rdx
       45 89 e1                mov    %r12d,%r9d
       4c 8d 45 b8             lea    -0x48(%rbp),%r8
       89 de                   mov    %ebx,%esi
       51                      push   %rcx
       48 8b 4d 98             mov    -0x68(%rbp),%rcx
       ff d0                   callq  *%rax
      
      Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
      Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.homeSigned-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reported-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d51d1e64
    • A
      mm/vmscan: don't mess with pgdat->flags in memcg reclaim · e3c1ac58
      Andrey Ryabinin 提交于
      memcg reclaim may alter pgdat->flags based on the state of LRU lists in
      cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
      congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
      pages.  But the worst here is PGDAT_CONGESTED, since it may force all
      direct reclaims to stall in wait_iff_congested().  Note that only kswapd
      have powers to clear any of these bits.  This might just never happen if
      cgroup limits configured that way.  So all direct reclaims will stall as
      long as we have some congested bdi in the system.
      
      Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
      pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
      it's reasonable to leave all decisions about node state to kswapd.
      
      Why only kswapd? Why not allow to global direct reclaim change these
      flags? It is because currently only kswapd can clear these flags.  I'm
      less worried about the case when PGDAT_CONGESTED falsely not set, and
      more worried about the case when it falsely set.  If direct reclaimer
      sets PGDAT_CONGESTED, do we have guarantee that after the congestion
      problem is sorted out, kswapd will be woken up and clear the flag? It
      seems like there is no such guarantee.  E.g.  direct reclaimers may
      eventually balance pgdat and kswapd simply won't wake up (see
      wakeup_kswapd()).
      
      Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
      now loses its congestion throttling mechanism.  Add per-cgroup
      congestion state and throttle cgroup2 reclaimers if memcg is in
      congestion state.
      
      Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
      bits since they alter only kswapd behavior.
      
      The problem could be easily demonstrated by creating heavy congestion in
      one cgroup:
      
          echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
          mkdir -p /sys/fs/cgroup/congester
          echo 512M > /sys/fs/cgroup/congester/memory.max
          echo $$ > /sys/fs/cgroup/congester/cgroup.procs
          /* generate a lot of diry data on slow HDD */
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
          ....
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
      
      and some job in another cgroup:
      
          mkdir /sys/fs/cgroup/victim
          echo 128M > /sys/fs/cgroup/victim/memory.max
      
          # time cat /dev/sda > /dev/null
          real    10m15.054s
          user    0m0.487s
          sys     1m8.505s
      
      According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
      of the time sleeping there.
      
      With the patch, cat don't waste time anymore:
      
          # time cat /dev/sda > /dev/null
          real    5m32.911s
          user    0m0.411s
          sys     0m56.664s
      
      [aryabinin@virtuozzo.com: congestion state should be per-node]
        Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
      [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
        Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3c1ac58