1. 04 6月, 2020 34 次提交
    • J
      mm: memcontrol: switch to native NR_ANON_THPS counter · 468c3982
      Johannes Weiner 提交于
      With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a
      small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
      native NR_ANON_THPS accounting sites.
      
      [hannes@cmpxchg.org: fixes]
        Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: Randy Dunlap <rdunlap@infradead.org>	[build-tested]
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      468c3982
    • J
      mm: memcontrol: switch to native NR_ANON_MAPPED counter · be5d0a74
      Johannes Weiner 提交于
      Memcg maintains a private MEMCG_RSS counter.  This divergence from the
      generic VM accounting means unnecessary code overhead, and creates a
      dependency for memcg that page->mapping is set up at the time of charging,
      so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
      lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
      same way we do for NR_FILE_MAPPED.
      
      With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
      counter, this patch finally eliminates the need to have page->mapping set
      up at charge time.  However, we need to have page->mem_cgroup set up by
      the time rmap runs and does the accounting, so switch the commit and the
      rmap callbacks around.
      
      v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be5d0a74
    • J
      mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters · 0d1c2072
      Johannes Weiner 提交于
      Memcg maintains private MEMCG_CACHE and NR_SHMEM counters.  This
      divergence from the generic VM accounting means unnecessary code overhead,
      and creates a dependency for memcg that page->mapping is set up at the
      time of charging, so that page types can be told apart.
      
      Convert the generic accounting sites to mod_lruvec_page_state and friends
      to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
      The page is already locked in these places, so page->mem_cgroup is stable;
      we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
      it's set up in time.
      
      Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
      NR_SHMEM accounting sites.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d1c2072
    • J
      mm: memcontrol: prepare cgroup vmstat infrastructure for native anon counters · 9da7b521
      Johannes Weiner 提交于
      Anonymous compound pages can be mapped by ptes, which means that if we
      want to track NR_MAPPED_ANON, NR_ANON_THPS on a per-cgroup basis, we have
      to be prepared to see tail pages in our accounting functions.
      
      Make mod_lruvec_page_state() and lock_page_memcg() deal with tail pages
      correctly, namely by redirecting to the head page which has the
      page->mem_cgroup set up.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-9-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9da7b521
    • J
      mm: memcontrol: convert page cache to a new mem_cgroup_charge() API · 3fea5a49
      Johannes Weiner 提交于
      The try/commit/cancel protocol that memcg uses dates back to when pages
      used to be uncharged upon removal from the page cache, and thus couldn't
      be committed before the insertion had succeeded.  Nowadays, pages are
      uncharged when they are physically freed; it doesn't matter whether the
      insertion was successful or not.  For the page cache, the transaction
      dance has become unnecessary.
      
      Introduce a mem_cgroup_charge() function that simply charges a newly
      allocated page to a cgroup and sets up page->mem_cgroup in one single
      step.  If the insertion fails, the caller doesn't have to do anything but
      free/put the page.
      
      Then switch the page cache over to this new API.
      
      Subsequent patches will also convert anon pages, but it needs a bit more
      prep work.  Right now, memcg depends on page->mapping being already set up
      at the time of charging, so that it can maintain its own MEMCG_CACHE and
      MEMCG_RSS counters.  For anon, page->mapping is set under the same pte
      lock under which the page is publishd, so a single charge point that can
      block doesn't work there just yet.
      
      The following prep patches will replace the private memcg counters with
      the generic vmstat counters, thus removing the page->mapping dependency,
      then complete the transition to the new single-point charge API and delete
      the old transactional scheme.
      
      v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
      v3: rebase on preceeding shmem simplification patch
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fea5a49
    • J
      mm: memcontrol: move out cgroup swaprate throttling · 6caa6a07
      Johannes Weiner 提交于
      The cgroup swaprate throttling is about matching new anon allocations to
      the rate of available IO when that is being throttled.  It's the io
      controller hooking into the VM, rather than a memory controller thing.
      
      Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
      drop the @memcg argument which is only used to check whether the preceding
      page charge has succeeded and the fault is proceeding.
      
      We could decouple the call from mem_cgroup_try_charge() here as well, but
      that would cause unnecessary churn: the following patches convert all
      callsites to a new charge API and we'll decouple as we go along.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6caa6a07
    • J
      mm: memcontrol: drop @compound parameter from memcg charging API · 3fba69a5
      Johannes Weiner 提交于
      The memcg charging API carries a boolean @compound parameter that tells
      whether the page we're dealing with is a hugepage.
      mem_cgroup_commit_charge() has another boolean @lrucare that indicates
      whether the page needs LRU locking or not while charging.  The majority of
      callsites know those parameters at compile time, which results in a lot of
      naked "false, false" argument lists.  This makes for cryptic code and is a
      breeding ground for subtle mistakes.
      
      Thankfully, the huge page state can be inferred from the page itself and
      doesn't need to be passed along.  This is safe because charging completes
      before the page is published and somebody may split it.
      
      Simplify the callsites by removing @compound, and let memcg infer the
      state by using hpage_nr_pages() unconditionally.  That function does
      PageTransHuge() to identify huge pages, which also helpfully asserts that
      nobody passes in tail pages by accident.
      
      The following patches will introduce a new charging API, best not to carry
      over unnecessary weight.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fba69a5
    • J
      mm/vmscan: count layzfree pages and fix nr_isolated_* mismatch · 1f318a9b
      Jaewon Kim 提交于
      Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages.
      
      If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree
      page, then the lazyfree page is changed to a normal anon page having
      SwapBacked by commit 802a3a92 ("mm: reclaim MADV_FREE pages").  Even
      with the change, reclaim context correctly counts isolated files because
      it uses is_file_lru to distinguish file.  And the change to anon is not
      happened if try_to_unmap_one is used for migration.  So migration context
      like compaction also correctly counts isolated files even though it uses
      page_is_file_lru insted of is_file_lru.  Recently page_is_file_cache was
      renamed to page_is_file_lru by commit 9de4f22a ("mm: code cleanup for
      MADV_FREE").
      
      But the nr_isolate_* mismatch problem happens on cma alloc.  There is
      reclaim_clean_pages_from_list which is being used only by cma.  It was
      introduced by commit 02c6de8d ("mm: cma: discard clean pages during
      contiguous allocation instead of migration") to reclaim clean file pages
      without migration.  The cma alloc uses both reclaim_clean_pages_from_list
      and migrate_pages, and it uses page_is_file_lru to count isolated files.
      If there are dirty lazyfree pages allocated from cma memory region, the
      pages are counted as isolated file at the beginging but are counted as
      isolated anon after finished.
      
      Mem-Info:
      Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      
      Like log above, there were too much isolated files, 37664kB, which
      triggers too_many_isolated in reclaim even when there is no actually
      isolated file in system wide.  It could be reproducible by running two
      programs, writing on MADV_FREE page and doing cma alloc, respectively.
      Although isolated anon is 0, I found that the internal value of isolated
      anon was the negative value of isolated file.
      
      Fix this by compensating the isolated count for both LRU lists.  Count
      non-discarded lazyfree pages in shrink_page_list, then compensate the
      counted number in reclaim_clean_pages_from_list.
      Reported-by: NYong-Taek Lee <ytk.lee@samsung.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f318a9b
    • M
      mm: simplify calling a compound page destructor · ff45fc3c
      Matthew Wilcox (Oracle) 提交于
      None of the three callers of get_compound_page_dtor() want to know the
      value; they just want to call the function.  Replace it with
      destroy_compound_page() which calls the dtor for them.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff45fc3c
    • A
      mm/hugetlb: define a generic fallback for arch_clear_hugepage_flags() · 5be99343
      Anshuman Khandual 提交于
      There are multiple similar definitions for arch_clear_hugepage_flags() on
      various platforms.  Lets just add it's generic fallback definition for
      platforms that do not override.  This help reduce code duplication.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1588907271-11920-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5be99343
    • A
      mm/hugetlb: define a generic fallback for is_hugepage_only_range() · b0eae98c
      Anshuman Khandual 提交于
      There are multiple similar definitions for is_hugepage_only_range() on
      various platforms.  Lets just add it's generic fallback definition for
      platforms that do not override.  This help reduce code duplication.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1588907271-11920-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0eae98c
    • A
      arm64/mm: drop __HAVE_ARCH_HUGE_PTEP_GET · be51e3fd
      Anshuman Khandual 提交于
      Patch series "mm/hugetlb: Add some new generic fallbacks", v3.
      
      This series adds the following new generic fallbacks.  Before that it
      drops __HAVE_ARCH_HUGE_PTEP_GET from arm64 platform.
      
      1. is_hugepage_only_range()
      2. arch_clear_hugepage_flags()
      
      After this arm (32 bit) remains the sole platform defining it's own
      huge_ptep_get() via __HAVE_ARCH_HUGE_PTEP_GET.
      
      This patch (of 3):
      
      Platform specific huge_ptep_get() is required only when fetching the huge
      PTE involves more than just dereferencing the page table pointer.  This is
      not the case on arm64 platform.  Hence huge_ptep_pte() can be dropped
      along with it's __HAVE_ARCH_HUGE_PTEP_GET subscription.  Before that, it
      updates the generic huge_ptep_get() with READ_ONCE() which will prevent
      known page table issues with THP on arm64.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/1588907271-11920-1-git-send-email-anshuman.khandual@arm.com
      Link: http://lkml.kernel.org/r//1506527369-19535-1-git-send-email-will.deacon@arm.com/
      Link: http://lkml.kernel.org/r/1588907271-11920-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be51e3fd
    • M
      hugetlbfs: move hugepagesz= parsing to arch independent code · 359f2544
      Mike Kravetz 提交于
      Now that architectures provide arch_hugetlb_valid_size(), parsing of
      "hugepagesz=" can be done in architecture independent code.  Create a
      single routine to handle hugepagesz= parsing and remove all arch specific
      routines.  We can also remove the interface hugetlb_bad_size() as this is
      no longer used outside arch independent code.
      
      This also provides consistent behavior of hugetlbfs command line options.
      The hugepagesz= option should only be specified once for a specific size,
      but some architectures allow multiple instances.  This appears to be more
      of an oversight when code was added by some architectures to set up ALL
      huge pages sizes.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NSandipan Das <sandipan@linux.ibm.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NMina Almasry <almasrymina@google.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      359f2544
    • M
      hugetlbfs: add arch_hugetlb_valid_size · ae94da89
      Mike Kravetz 提交于
      Patch series "Clean up hugetlb boot command line processing", v4.
      
      Longpeng(Mike) reported a weird message from hugetlb command line
      processing and proposed a solution [1].  While the proposed patch does
      address the specific issue, there are other related issues in command line
      processing.  As hugetlbfs evolved, updates to command line processing have
      been made to meet immediate needs and not necessarily in a coordinated
      manner.  The result is that some processing is done in arch specific code,
      some is done in arch independent code and coordination is problematic.
      Semantics can vary between architectures.
      
      The patch series does the following:
      - Define arch specific arch_hugetlb_valid_size routine used to validate
        passed huge page sizes.
      - Move hugepagesz= command line parsing out of arch specific code and into
        an arch independent routine.
      - Clean up command line processing to follow desired semantics and
        document those semantics.
      
      [1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com
      
      This patch (of 3):
      
      The architecture independent routine hugetlb_default_setup sets up the
      default huge pages size.  It has no way to verify if the passed value is
      valid, so it accepts it and attempts to validate at a later time.  This
      requires undocumented cooperation between the arch specific and arch
      independent code.
      
      For architectures that support more than one huge page size, provide a
      routine arch_hugetlb_valid_size to validate a huge page size.
      hugetlb_default_setup can use this to validate passed values.
      
      arch_hugetlb_valid_size will also be used in a subsequent patch to move
      processing of the "hugepagesz=" in arch specific code to a common routine
      in arch independent code.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae94da89
    • K
      khugepaged: introduce 'max_ptes_shared' tunable · 71a2c112
      Kirill A. Shutemov 提交于
      'max_ptes_shared' specifies how many pages can be shared across multiple
      processes.  Exceeding the number would block the collapse::
      
      	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
      
      A higher value may increase memory footprint for some workloads.
      
      By default, at least half of pages has to be not shared.
      
      [colin.king@canonical.com: fix several spelling mistakes]
        Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71a2c112
    • D
      mm: make deferred init's max threads arch-specific · ecd09650
      Daniel Jordan 提交于
      Using padata during deferred init has only been tested on x86, so for now
      limit it to this architecture.
      
      If another arch wants this, it can find the max thread limit that's best
      for it and override deferred_page_init_max_threads().
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecd09650
    • D
      padata: add basic support for multithreaded jobs · 004ed426
      Daniel Jordan 提交于
      Sometimes the kernel doesn't take full advantage of system memory
      bandwidth, leading to a single CPU spending excessive time in
      initialization paths where the data scales with memory size.
      
      Multithreading naturally addresses this problem.
      
      Extend padata, a framework that handles many parallel yet singlethreaded
      jobs, to also handle multithreaded jobs by adding support for splitting up
      the work evenly, specifying a minimum amount of work that's appropriate
      for one helper thread to do, load balancing between helpers, and
      coordinating them.
      
      This is inspired by work from Pavel Tatashin and Steve Sistare.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-5-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      004ed426
    • D
      padata: allocate work structures for parallel jobs from a pool · 4611ce22
      Daniel Jordan 提交于
      padata allocates per-CPU, per-instance work structs for parallel jobs.  A
      do_parallel call assigns a job to a sequence number and hashes the number
      to a CPU, where the job will eventually run using the corresponding work.
      
      This approach fit with how padata used to bind a job to each CPU
      round-robin, makes less sense after commit bfde23ce ("padata: unbind
      parallel jobs from specific CPUs") because a work isn't bound to a
      particular CPU anymore, and isn't needed at all for multithreaded jobs
      because they don't have sequence numbers.
      
      Replace the per-CPU works with a preallocated pool, which allows sharing
      them between existing padata users and the upcoming multithreaded user.
      The pool will also facilitate setting NUMA-aware concurrency limits with
      later users.
      
      The pool is sized according to the number of possible CPUs.  With this
      limit, MAX_OBJ_NUM no longer makes sense, so remove it.
      
      If the global pool is exhausted, a parallel job is run in the current task
      instead to throttle a system trying to do too much in parallel.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-4-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4611ce22
    • D
      padata: initialize earlier · f1b192b1
      Daniel Jordan 提交于
      padata will soon initialize the system's struct pages in parallel, so it
      needs to be ready by page_alloc_init_late().
      
      The error return from padata_driver_init() triggers an initcall warning,
      so add a warning to padata_init() to avoid silent failure.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Robert Elliott <elliott@hpe.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1b192b1
    • P
      mm: initialize deferred pages with interrupts enabled · 3d060856
      Pavel Tatashin 提交于
      Initializing struct pages is a long task and keeping interrupts disabled
      for the duration of this operation introduces a number of problems.
      
      1. jiffies are not updated for long period of time, and thus incorrect time
         is reported. See proposed solution and discussion here:
         lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      2. It prevents farther improving deferred page initialization by allowing
         intra-node multi-threading.
      
      We are keeping interrupts disabled to solve a rather theoretical problem
      that was never observed in real world (See 3a2d7fa8).
      
      Let's keep interrupts enabled. In case we ever encounter a scenario where
      an interrupt thread wants to allocate large amount of memory this early in
      boot we can deal with that by growing zone (see deferred_grow_zone()) by
      the needed amount before starting deferred_init_memmap() threads.
      
      Before:
      [    1.232459] node 0 initialised, 12058412 pages in 1ms
      
      After:
      [    1.632580] node 0 initialised, 12051227 pages in 436ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d060856
    • A
      mm/page_alloc: restrict and formalize compound_page_dtors[] · ae70eddd
      Anshuman Khandual 提交于
      Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and
      explicitly position them according to enum compound_dtor_id.  This
      improves protection against possible misalignment between
      compound_page_dtors[] and enum compound_dtor_id later on.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/1589795958-19317-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae70eddd
    • W
      mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention · 01c0bfe0
      Wei Yang 提交于
      Pageblock migrate type is encoded in GFP flags, just as zone_type and
      zonelist.
      
      Currently we use gfp_zone() and gfp_zonelist() to extract related
      information, it would be proper to use the same naming convention for
      migrate type.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200329080823.7735-1-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c0bfe0
    • J
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim 提交于
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • B
      mm/page_alloc.c: remove unused free_bootmem_with_active_regions · 4ca7be24
      Baoquan He 提交于
      Since commit 397dc00e ("mips: sgi-ip27: switch from DISCONTIGMEM
      to SPARSEMEM"), the last caller of free_bootmem_with_active_regions() was
      gone.  Now no user calls it any more.
      
      Let's remove it.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200402143455.5145-1-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ca7be24
    • M
      mm: rename free_area_init_node() to free_area_init_memoryless_node() · bc9331a1
      Mike Rapoport 提交于
      free_area_init_node() is only used by x86 to initialize a memory-less
      nodes.  Make its name reflect this and drop all the function parameters
      except node ID as they are anyway zero.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc9331a1
    • M
      mm: free_area_init: allow defining max_zone_pfn in descending order · 51930df5
      Mike Rapoport 提交于
      Some architectures (e.g.  ARC) have the ZONE_HIGHMEM zone below the
      ZONE_NORMAL.  Allowing free_area_init() parse max_zone_pfn array even it
      is sorted in descending order allows using free_area_init() on such
      architectures.
      
      Add top -> down traversal of max_zone_pfn array in free_area_init() and
      use the latter in ARC node/zone initialization.
      
      [rppt@kernel.org: ARC fix]
        Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org
      [rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode]
        Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com
      [akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51930df5
    • M
      mm: use free_area_init() instead of free_area_init_nodes() · 9691a071
      Mike Rapoport 提交于
      free_area_init() has effectively became a wrapper for
      free_area_init_nodes() and there is no point of keeping it.  Still
      free_area_init() name is shorter and more general as it does not imply
      necessity to initialize multiple nodes.
      
      Rename free_area_init_nodes() to free_area_init(), update the callers and
      drop old version of free_area_init().
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9691a071
    • M
      mm: free_area_init: use maximal zone PFNs rather than zone sizes · fa3354e4
      Mike Rapoport 提交于
      Currently, architectures that use free_area_init() to initialize memory
      map and node and zone structures need to calculate zone and hole sizes.
      We can use free_area_init_nodes() instead and let it detect the zone
      boundaries while the architectures will only have to supply the possible
      limits for the zones.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-5-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa3354e4
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option · 3f08a302
      Mike Rapoport 提交于
      CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
      nodes and zones structures between the systems that have region to node
      mapping in memblock and those that don't.
      
      Currently all the NUMA architectures enable this option and for the
      non-NUMA systems we can presume that all the memory belongs to node 0 and
      therefore the compile time configuration option is not required.
      
      The remaining few architectures that use DISCONTIGMEM without NUMA are
      easily updated to use memblock_add_node() instead of memblock_add() and
      thus have proper correspondence of memblock regions to NUMA nodes.
      
      Still, free_area_init_node() must have a backward compatible version
      because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
      different.  Once all the architectures will use the new semantics, the
      entire compatibility layer can be dropped.
      
      To avoid addition of extra run time memory to store node id for
      architectures that keep memblock but have only a single node, the node id
      field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
      the corresponding accessors presume that in those cases it is always 0.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f08a302
    • M
      mm: make early_pfn_to_nid() and related defintions close to each other · 6f24fbd3
      Mike Rapoport 提交于
      early_pfn_to_nid() and its helper __early_pfn_to_nid() are spread around
      include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c.
      
      Drop unused stub for __early_pfn_to_nid() and move its actual generic
      implementation close to its users.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-3-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f24fbd3
    • M
      mm: clarify __GFP_MEMALLOC usage · 574c1ae6
      Michal Hocko 提交于
      It seems that the existing documentation is not explicit about the
      expected usage and potential risks enough.  While it is calls out that
      users have to free memory when using this flag it is not really apparent
      that users have to careful to not deplete memory reserves and that they
      should implement some sort of throttling wrt.  freeing process.
      
      This is partly based on Neil's explanation [1].
      
      Let's also call out that a pre allocated pool allocator should be
      considered.
      
      [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
      
      [akpm@linux-foundation.org: coding style fixes]
      [mhocko@kernel.org: update]
        Link: http://lkml.kernel.org/r/20200406070137.GC19426@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200403083543.11552-2-mhocko@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      574c1ae6
    • D
      string.h: fix incompatibility between FORTIFY_SOURCE and KASAN · 47227d27
      Daniel Axtens 提交于
      The memcmp KASAN self-test fails on a kernel with both KASAN and
      FORTIFY_SOURCE.
      
      When FORTIFY_SOURCE is on, a number of functions are replaced with
      fortified versions, which attempt to check the sizes of the operands.
      However, these functions often directly invoke __builtin_foo() once they
      have performed the fortify check.  Using __builtins may bypass KASAN
      checks if the compiler decides to inline it's own implementation as
      sequence of instructions, rather than emit a function call that goes out
      to a KASAN-instrumented implementation.
      
      Why is only memcmp affected?
      ============================
      
      Of the string and string-like functions that kasan_test tests, only memcmp
      is replaced by an inline sequence of instructions in my testing on x86
      with gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2).
      
      I believe this is due to compiler heuristics.  For example, if I annotate
      kmalloc calls with the alloc_size annotation (and disable some fortify
      compile-time checking!), the compiler will replace every memset except the
      one in kmalloc_uaf_memset with inline instructions.  (I have some WIP
      patches to add this annotation.)
      
      Does this affect other functions in string.h?
      =============================================
      
      Yes. Anything that uses __builtin_* rather than __real_* could be
      affected. This looks like:
      
       - strncpy
       - strcat
       - strlen
       - strlcpy maybe, under some circumstances?
       - strncat under some circumstances
       - memset
       - memcpy
       - memmove
       - memcmp (as noted)
       - memchr
       - strcpy
      
      Whether a function call is emitted always depends on the compiler.  Most
      bugs should get caught by FORTIFY_SOURCE, but the missed memcmp test shows
      that this is not always the case.
      
      Isn't FORTIFY_SOURCE disabled with KASAN?
      ========================================-
      
      The string headers on all arches supporting KASAN disable fortify with
      kasan, but only when address sanitisation is _also_ disabled.  For example
      from x86:
      
       #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
       /*
        * For files that are not instrumented (e.g. mm/slub.c) we
        * should use not instrumented version of mem* functions.
        */
       #define memcpy(dst, src, len) __memcpy(dst, src, len)
       #define memmove(dst, src, len) __memmove(dst, src, len)
       #define memset(s, c, n) __memset(s, c, n)
      
       #ifndef __NO_FORTIFY
       #define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */
       #endif
      
       #endif
      
      This comes from commit 6974f0c4 ("include/linux/string.h: add the
      option of fortified string.h functions"), and doesn't work when KASAN is
      enabled and the file is supposed to be sanitised - as with test_kasan.c
      
      I'm pretty sure this is not wrong, but not as expansive it should be:
      
       * we shouldn't use __builtin_memcpy etc in files where we don't have
         instrumentation - it could devolve into a function call to memcpy,
         which will be instrumented. Rather, we should use __memcpy which
         by convention is not instrumented.
      
       * we also shouldn't be using __builtin_memcpy when we have a KASAN
         instrumented file, because it could be replaced with inline asm
         that will not be instrumented.
      
      What is correct behaviour?
      ==========================
      
      Firstly, there is some overlap between fortification and KASAN: both
      provide some level of _runtime_ checking. Only fortify provides
      compile-time checking.
      
      KASAN and fortify can pick up different things at runtime:
      
       - Some fortify functions, notably the string functions, could easily be
         modified to consider sub-object sizes (e.g. members within a struct),
         and I have some WIP patches to do this. KASAN cannot detect these
         because it cannot insert poision between members of a struct.
      
       - KASAN can detect many over-reads/over-writes when the sizes of both
         operands are unknown, which fortify cannot.
      
      So there are a couple of options:
      
       1) Flip the test: disable fortify in santised files and enable it in
          unsanitised files. This at least stops us missing KASAN checking, but
          we lose the fortify checking.
      
       2) Make the fortify code always call out to real versions. Do this only
          for KASAN, for fear of losing the inlining opportunities we get from
          __builtin_*.
      
      (We can't use kasan_check_{read,write}: because the fortify functions are
      _extern inline_, you can't include _static_ inline functions without a
      compiler warning. kasan_check_{read,write} are static inline so we can't
      use them even when they would otherwise be suitable.)
      
      Take approach 2 and call out to real versions when KASAN is enabled.
      
      Use __underlying_foo to distinguish from __real_foo: __real_foo always
      refers to the kernel's implementation of foo, __underlying_foo could be
      either the kernel implementation or the __builtin_foo implementation.
      
      This is sometimes enough to make the memcmp test succeed with
      FORTIFY_SOURCE enabled. It is at least enough to get the function call
      into the module. One more fix is needed to make it reliable: see the next
      patch.
      
      Fixes: 6974f0c4 ("include/linux/string.h: add the option of fortified string.h functions")
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NDavid Gow <davidgow@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Link: http://lkml.kernel.org/r/20200423154503.5103-3-dja@axtens.netSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47227d27
    • J
      mm/gup: introduce pin_user_pages_fast_only() · 104acc32
      John Hubbard 提交于
      This is the FOLL_PIN equivalent of __get_user_pages_fast(), except with a
      more descriptive name, and gup_flags instead of a boolean "write" in the
      argument list.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: "Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://lkml.kernel.org/r/20200519002124.2025955-4-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      104acc32
    • J
      mm/gup: refactor and de-duplicate gup_fast() code · 376a34ef
      John Hubbard 提交于
      There were two nearly identical sets of code for gup_fast() style of
      walking the page tables with interrupts disabled.  This has lead to the
      usual maintenance problems that arise from having duplicated code.
      
      There is already a core internal routine in gup.c for gup_fast(), so just
      enhance it very slightly: allow skipping the fall-back to "slow" (regular)
      get_user_pages(), via the new FOLL_FAST_ONLY flag.  Then, just call
      internal_get_user_pages_fast() from __get_user_pages_fast(), and adjust
      the API to match pre-existing API behavior.
      
      There is a change in behavior from this refactoring: the nested form of
      interrupt disabling is used in all gup_fast() variants now.  That's
      because there is only one place that interrupt disabling for page walking
      is done, and so the safer form is required.  This should, if anything,
      eliminate possible (rare) bugs, because the non-nested form of enabling
      interrupts was fragile at best.
      
      [jhubbard@nvidia.com: fixup]
        Link: http://lkml.kernel.org/r/20200521233841.1279742-1-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: "Joonas Lahtinen" <joonas.lahtinen@linux.intel.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://lkml.kernel.org/r/20200519002124.2025955-3-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      376a34ef
  2. 03 6月, 2020 6 次提交