1. 23 3月, 2022 3 次提交
    • A
      mm/migrate: fix race between lock page and clear PG_Isolated · 356ea386
      andrew.yang 提交于
      When memory is tight, system may start to compact memory for large
      continuous memory demands.  If one process tries to lock a memory page
      that is being locked and isolated for compaction, it may wait a long time
      or even forever.  This is because compaction will perform non-atomic
      PG_Isolated clear while holding page lock, this may overwrite PG_waiters
      set by the process that can't obtain the page lock and add itself to the
      waiting queue to wait for the lock to be unlocked.
      
        CPU1                            CPU2
        lock_page(page); (successful)
                                        lock_page(); (failed)
        __ClearPageIsolated(page);      SetPageWaiters(page) (may be overwritten)
        unlock_page(page);
      
      The solution is to not perform non-atomic operation on page flags while
      holding page lock.
      
      Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.comSigned-off-by: Nandrew.yang <andrew.yang@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Vlastimil Babka" <vbabka@suse.cz>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "William Kucharski" <william.kucharski@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Nicholas Tang <nicholas.tang@mediatek.com>
      Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      356ea386
    • M
      mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key · a6b40850
      Muchun Song 提交于
      The page_fixed_fake_head() is used throughout memory management and the
      conditional check requires checking a global variable, although the
      overhead of this check may be small, it increases when the memory cache
      comes under pressure.  Also, the global variable will not be modified
      after system boot, so it is very appropriate to use static key machanism.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6b40850
    • M
      mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page · e7d32485
      Muchun Song 提交于
      Patch series "Free the 2nd vmemmap page associated with each HugeTLB
      page", v7.
      
      This series can minimize the overhead of struct page for 2MB HugeTLB
      pages significantly.  It further reduces the overhead of struct page by
      12.5% for a 2MB HugeTLB compared to the previous approach, which means
      2GB per 1TB HugeTLB.  It is a nice gain.  Comments and reviews are
      welcome.  Thanks.
      
      The main implementation and details can refer to the commit log of patch
      1.  In this series, I have changed the following four helpers, the
      following table shows the impact of the overhead of those helpers.
      
      	+------------------+-----------------------+
      	|       APIs       | head page | tail page |
      	+------------------+-----------+-----------+
      	|    PageHead()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|    PageTail()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|  PageCompound()  |     N     |     N     |
      	+------------------+-----------+-----------+
      	|  compound_head() |     Y     |     N     |
      	+------------------+-----------+-----------+
      
      	Y: Overhead is increased.
      	N: Overhead is _NOT_ increased.
      
      It shows that the overhead of those helpers on a tail page don't change
      between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off".  But the
      overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
      (except PageCompound()).  So I believe that Matthew Wilcox's folio series
      will help with this.
      
      The users of PageHead() and PageTail() are much less than compound_head()
      and most users of PageTail() are VM_BUG_ON(), so I have done some tests
      about the overhead of compound_head() on head pages.
      
      I have tested the overhead of calling compound_head() on a head page,
      which is 2.11ns (Measure the call time of 10 million times
      compound_head(), and then average).
      
      For a head page whose address is not aligned with PAGE_SIZE or a
      non-compound page, the overhead of compound_head() is 2.54ns which is
      increased by 20%.  For a head page whose address is aligned with
      PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
      40%.  Most pages are the former.  I do not think the overhead is
      significant since the overhead of compound_head() itself is low.
      
      This patch (of 5):
      
      This patch minimizes the overhead of struct page for 2MB HugeTLB pages
      significantly.  It further reduces the overhead of struct page by 12.5%
      for a 2MB HugeTLB compared to the previous approach, which means 2GB per
      1TB HugeTLB (2MB type).
      
      After the feature of "Free sonme vmemmap pages of HugeTLB page" is
      enabled, the mapping of the vmemmap addresses associated with a 2MB
      HugeTLB page becomes the figure below.
      
           HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
      remaped. However, the 2nd vmemmap page frame is also can be freed to
      the buddy allocator, then we can change the mapping from the figure
      above to the figure below.
      
          HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
       |           |                     +-----------+                  | | | | | |
       |           |                     |     2     | -----------------+ | | | | |
       |           |                     +-----------+                    | | | | |
       |           |                     |     3     | -------------------+ | | | |
       |           |                     +-----------+                      | | | |
       |           |                     |     4     | ---------------------+ | | |
       |    2MB    |                     +-----------+                        | | |
       |           |                     |     5     | -----------------------+ | |
       |           |                     +-----------+                          | |
       |           |                     |     6     | -------------------------+ |
       |           |                     +-----------+                            |
       |           |                     |     7     | ---------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      After we do this, all tail vmemmap pages (1-7) are mapped to the head
      vmemmap page frame (0).  In other words, there are more than one page
      struct with PG_head associated with each HugeTLB page.  We __know__ that
      there is only one head page struct, the tail page structs with PG_head are
      fake head page structs.  We need an approach to distinguish between those
      two different types of page structs so that compound_head(), PageHead()
      and PageTail() can work properly if the parameter is the tail page struct
      but with PG_head.
      
      The following code snippet describes how to distinguish between real and
      fake head page struct.
      
      	if (test_bit(PG_head, &page->flags)) {
      		unsigned long head = READ_ONCE(page[1].compound_head);
      
      		if (head & 1) {
      			if (head == (unsigned long)page + 1)
      				==> head page struct
      			else
      				==> tail page struct
      		} else
      			==> head page struct
      	}
      
      We can safely access the field of the @page[1] with PG_head because the
      @page is a compound page composed with at least two contiguous pages.
      
      [songmuchun@bytedance.com: restore lost comment changes]
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d32485
  2. 15 1月, 2022 2 次提交
  3. 07 1月, 2022 1 次提交
  4. 03 1月, 2022 1 次提交
  5. 17 11月, 2021 2 次提交
  6. 07 11月, 2021 1 次提交
  7. 02 11月, 2021 1 次提交
  8. 29 10月, 2021 1 次提交
    • Y
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault · eac96c3e
      Yang Shi 提交于
      When handling shmem page fault the THP with corrupted subpage could be
      PMD mapped if certain conditions are satisfied.  But kernel is supposed
      to send SIGBUS when trying to map hwpoisoned page.
      
      There are two paths which may do PMD map: fault around and regular
      fault.
      
      Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") the thing was even worse in fault around path.  The THP
      could be PMD mapped as long as the VMA fits regardless what subpage is
      accessed and corrupted.  After this commit as long as head page is not
      corrupted the THP could be PMD mapped.
      
      In the regular fault path the THP could be PMD mapped as long as the
      corrupted page is not accessed and the VMA fits.
      
      This loophole could be fixed by iterating every subpage to check if any
      of them is hwpoisoned or not, but it is somewhat costly in page fault
      path.
      
      So introduce a new page flag called HasHWPoisoned on the first tail
      page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
      subpage of THP is found hwpoisoned by memory failure and after the
      refcount is bumped successfully, then cleared when the THP is freed or
      split.
      
      The soft offline path doesn't need this since soft offline handler just
      marks a subpage hwpoisoned when the subpage is migrated successfully.
      But shmem THP didn't get split then migrated at all.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac96c3e
  9. 18 10月, 2021 2 次提交
  10. 27 9月, 2021 2 次提交
  11. 09 9月, 2021 2 次提交
  12. 04 9月, 2021 1 次提交
    • V
      mm, slub: do initial checks in ___slab_alloc() with irqs enabled · 0b303fb4
      Vlastimil Babka 提交于
      As another step of shortening irq disabled sections in ___slab_alloc(), delay
      disabling irqs until we pass the initial checks if there is a cached percpu
      slab and it's suitable for our allocation.
      
      Now we have to recheck c->page after actually disabling irqs as an allocation
      in irq handler might have replaced it.
      
      Because we call pfmemalloc_match() as one of the checks, we might hit
      VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
      interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe()
      variant that lacks the PageSlab check.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      0b303fb4
  13. 02 8月, 2021 1 次提交
  14. 01 7月, 2021 2 次提交
    • D
      mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() · 82840451
      David Hildenbrand 提交于
      A driver might set a page logically offline -- PageOffline() -- and turn
      the page inaccessible in the hypervisor; after that, access to page
      content can be fatal.  One example is virtio-mem; while unplugged memory
      -- marked as PageOffline() can currently be read in the hypervisor, this
      will no longer be the case in the future; for example, when having a
      virtio-mem device backed by huge pages in the hypervisor.
      
      Some special PFN walkers -- i.e., /proc/kcore -- read content of random
      pages after checking PageOffline(); however, these PFN walkers can race
      with drivers that set PageOffline().
      
      Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.
      
      page_offline_freeze()/page_offline_thaw() allows for a subsystem to
      synchronize with such drivers, achieving that a page cannot be set
      PageOffline() while frozen.
      
      page_offline_begin()/page_offline_end() is used by drivers that care about
      such races when setting a page PageOffline().
      
      For simplicity, use a rwsem for now; neither drivers nor users are
      performance sensitive.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82840451
    • D
      fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages · 0daa322b
      David Hildenbrand 提交于
      Let's avoid reading:
      
      1) Offline memory sections: the content of offline memory sections is
         stale as the memory is effectively unused by the kernel.  On s390x with
         standby memory, offline memory sections (belonging to offline storage
         increments) are not accessible.  With virtio-mem and the hyper-v
         balloon, we can have unavailable memory chunks that should not be
         accessed inside offline memory sections.  Last but not least, offline
         memory sections might contain hwpoisoned pages which we can no longer
         identify because the memmap is stale.
      
      2) PG_offline pages: logically offline pages that are documented as
         "The content of these pages is effectively stale.  Such pages should
         not be touched (read/write/dump/save) except by their owner.".
         Examples include pages inflated in a balloon or unavailble memory
         ranges inside hotplugged memory sections with virtio-mem or the hyper-v
         balloon.
      
      3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
         As documented: "Accessing is not safe since it may cause another
         machine check.  Don't touch!"
      
      Introduce is_page_hwpoison(), adding a comment that it is inherently racy
      but best we can really do.
      
      Reading /proc/kcore now performs similar checks as when reading
      /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
      It's also similar to hibernation code, however, we don't skip hwpoisoned
      pages when processing pages in kernel/power/snapshot.c:saveable_page()
      yet.
      
      Note 1: we can race against memory offlining code, especially memory going
      offline and getting unplugged: however, we will properly tear down the
      identity mapping and handle faults gracefully when accessing this memory
      from kcore code.
      
      Note 2: we can race against drivers setting PageOffline() and turning
      memory inaccessible in the hypervisor.  We'll handle this in a follow-up
      patch.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0daa322b
  15. 30 6月, 2021 1 次提交
  16. 05 6月, 2021 1 次提交
  17. 27 2月, 2021 1 次提交
  18. 25 2月, 2021 1 次提交
  19. 16 12月, 2020 3 次提交
  20. 03 12月, 2020 1 次提交
  21. 17 10月, 2020 2 次提交
    • O
      mm,hwpoison: rework soft offline for in-use pages · 79f5f8fa
      Oscar Salvador 提交于
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net and
      would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.  Instead, we
      keep the page and send it to page_handle_poison to perform the right
      handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f5f8fa
    • O
      mm,hwpoison: rework soft offline for free pages · 06be6ff3
      Oscar Salvador 提交于
      When trying to soft-offline a free page, we need to first take it off the
      buddy allocator.  Once we know is out of reach, we can safely flag it as
      poisoned.
      
      take_page_off_buddy will be used to take a page meant to be poisoned off
      the buddy allocator.  take_page_off_buddy calls break_down_buddy_pages,
      which splits a higher-order page in case our page belongs to one.
      
      Once the page is under our control, we call page_handle_poison to set it
      as poisoned and grab a refcount on it.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06be6ff3
  22. 14 10月, 2020 2 次提交
  23. 04 9月, 2020 1 次提交
    • S
      mm: Add PG_arch_2 page flag · 4beba948
      Steven Price 提交于
      For arm64 MTE support it is necessary to be able to mark pages that
      contain user space visible tags that will need to be saved/restored e.g.
      when swapped out.
      
      To support this add a new arch specific flag (PG_arch_2). This flag is
      only available on 64-bit architectures due to the limited number of
      spare page flags on the 32-bit ones.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      [catalin.marinas@arm.com: use CONFIG_64BIT for guarding this new flag]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      4beba948
  24. 05 6月, 2020 1 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
  25. 08 4月, 2020 2 次提交
    • A
      mm: introduce Reported pages · 36e66c55
      Alexander Duyck 提交于
      In order to pave the way for free page reporting in virtualized
      environments we will need a way to get pages out of the free lists and
      identify those pages after they have been returned.  To accomplish this,
      this patch adds the concept of a Reported Buddy, which is essentially
      meant to just be the Uptodate flag used in conjunction with the Buddy page
      type.
      
      To prevent the reported pages from leaking outside of the buddy lists I
      added a check to clear the PageReported bit in the del_page_from_free_list
      function.  As a result any reported page that is split, merged, or
      allocated will have the flag cleared prior to the PageBuddy value being
      cleared.
      
      The process for reporting pages is fairly simple.  Once we free a page
      that meets the minimum order for page reporting we will schedule a worker
      thread to start 2s or more in the future.  That worker thread will begin
      working from the lowest supported page reporting order up to MAX_ORDER - 1
      pulling unreported pages from the free list and storing them in the
      scatterlist.
      
      When processing each individual free list it is necessary for the worker
      thread to release the zone lock when it needs to stop and report the full
      scatterlist of pages.  To reduce the work of the next iteration the worker
      thread will rotate the free list so that the first unreported page in the
      free list becomes the first entry in the list.
      
      It will then call a reporting function providing information on how many
      entries are in the scatterlist.  Once the function completes it will
      return the pages to the free area from which they were allocated and start
      over pulling more pages from the free areas until there are no longer
      enough pages to report on to keep the worker busy, or we have processed as
      many pages as were contained in the free area when we started processing
      the list.
      
      The worker thread will work in a round-robin fashion making its way though
      each zone requesting reporting, and through each reportable free list
      within that zone.  Once all free areas within the zone have been processed
      it will check to see if there have been any requests for reporting while
      it was processing.  If so it will reschedule the worker thread to start up
      again in roughly 2s and exit.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36e66c55
    • H
      mm: code cleanup for MADV_FREE · 9de4f22a
      Huang Ying 提交于
      Some comments for MADV_FREE is revised and added to help people understand
      the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
      page_is_file_cache() isn't consistent with its comments.  So the function
      is renamed to page_is_file_lru() to make them consistent again.  All these
      are put in one patch as one logical change.
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de4f22a
  26. 22 3月, 2020 1 次提交
  27. 07 11月, 2019 1 次提交
    • Y
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · 169226f7
      Yang Shi 提交于
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      169226f7