1. 07 11月, 2021 1 次提交
  2. 29 10月, 2021 1 次提交
    • Y
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault · eac96c3e
      Yang Shi 提交于
      When handling shmem page fault the THP with corrupted subpage could be
      PMD mapped if certain conditions are satisfied.  But kernel is supposed
      to send SIGBUS when trying to map hwpoisoned page.
      
      There are two paths which may do PMD map: fault around and regular
      fault.
      
      Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") the thing was even worse in fault around path.  The THP
      could be PMD mapped as long as the VMA fits regardless what subpage is
      accessed and corrupted.  After this commit as long as head page is not
      corrupted the THP could be PMD mapped.
      
      In the regular fault path the THP could be PMD mapped as long as the
      corrupted page is not accessed and the VMA fits.
      
      This loophole could be fixed by iterating every subpage to check if any
      of them is hwpoisoned or not, but it is somewhat costly in page fault
      path.
      
      So introduce a new page flag called HasHWPoisoned on the first tail
      page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
      subpage of THP is found hwpoisoned by memory failure and after the
      refcount is bumped successfully, then cleared when the THP is freed or
      split.
      
      The soft offline path doesn't need this since soft offline handler just
      marks a subpage hwpoisoned when the subpage is migrated successfully.
      But shmem THP didn't get split then migrated at all.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac96c3e
  3. 09 9月, 2021 2 次提交
  4. 04 9月, 2021 1 次提交
    • V
      mm, slub: do initial checks in ___slab_alloc() with irqs enabled · 0b303fb4
      Vlastimil Babka 提交于
      As another step of shortening irq disabled sections in ___slab_alloc(), delay
      disabling irqs until we pass the initial checks if there is a cached percpu
      slab and it's suitable for our allocation.
      
      Now we have to recheck c->page after actually disabling irqs as an allocation
      in irq handler might have replaced it.
      
      Because we call pfmemalloc_match() as one of the checks, we might hit
      VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
      interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe()
      variant that lacks the PageSlab check.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      0b303fb4
  5. 02 8月, 2021 1 次提交
  6. 01 7月, 2021 2 次提交
    • D
      mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() · 82840451
      David Hildenbrand 提交于
      A driver might set a page logically offline -- PageOffline() -- and turn
      the page inaccessible in the hypervisor; after that, access to page
      content can be fatal.  One example is virtio-mem; while unplugged memory
      -- marked as PageOffline() can currently be read in the hypervisor, this
      will no longer be the case in the future; for example, when having a
      virtio-mem device backed by huge pages in the hypervisor.
      
      Some special PFN walkers -- i.e., /proc/kcore -- read content of random
      pages after checking PageOffline(); however, these PFN walkers can race
      with drivers that set PageOffline().
      
      Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.
      
      page_offline_freeze()/page_offline_thaw() allows for a subsystem to
      synchronize with such drivers, achieving that a page cannot be set
      PageOffline() while frozen.
      
      page_offline_begin()/page_offline_end() is used by drivers that care about
      such races when setting a page PageOffline().
      
      For simplicity, use a rwsem for now; neither drivers nor users are
      performance sensitive.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82840451
    • D
      fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages · 0daa322b
      David Hildenbrand 提交于
      Let's avoid reading:
      
      1) Offline memory sections: the content of offline memory sections is
         stale as the memory is effectively unused by the kernel.  On s390x with
         standby memory, offline memory sections (belonging to offline storage
         increments) are not accessible.  With virtio-mem and the hyper-v
         balloon, we can have unavailable memory chunks that should not be
         accessed inside offline memory sections.  Last but not least, offline
         memory sections might contain hwpoisoned pages which we can no longer
         identify because the memmap is stale.
      
      2) PG_offline pages: logically offline pages that are documented as
         "The content of these pages is effectively stale.  Such pages should
         not be touched (read/write/dump/save) except by their owner.".
         Examples include pages inflated in a balloon or unavailble memory
         ranges inside hotplugged memory sections with virtio-mem or the hyper-v
         balloon.
      
      3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
         As documented: "Accessing is not safe since it may cause another
         machine check.  Don't touch!"
      
      Introduce is_page_hwpoison(), adding a comment that it is inherently racy
      but best we can really do.
      
      Reading /proc/kcore now performs similar checks as when reading
      /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
      It's also similar to hibernation code, however, we don't skip hwpoisoned
      pages when processing pages in kernel/power/snapshot.c:saveable_page()
      yet.
      
      Note 1: we can race against memory offlining code, especially memory going
      offline and getting unplugged: however, we will properly tear down the
      identity mapping and handle faults gracefully when accessing this memory
      from kcore code.
      
      Note 2: we can race against drivers setting PageOffline() and turning
      memory inaccessible in the hypervisor.  We'll handle this in a follow-up
      patch.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0daa322b
  7. 30 6月, 2021 1 次提交
  8. 05 6月, 2021 1 次提交
  9. 27 2月, 2021 1 次提交
  10. 25 2月, 2021 1 次提交
  11. 16 12月, 2020 3 次提交
  12. 03 12月, 2020 1 次提交
  13. 17 10月, 2020 2 次提交
    • O
      mm,hwpoison: rework soft offline for in-use pages · 79f5f8fa
      Oscar Salvador 提交于
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net and
      would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.  Instead, we
      keep the page and send it to page_handle_poison to perform the right
      handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f5f8fa
    • O
      mm,hwpoison: rework soft offline for free pages · 06be6ff3
      Oscar Salvador 提交于
      When trying to soft-offline a free page, we need to first take it off the
      buddy allocator.  Once we know is out of reach, we can safely flag it as
      poisoned.
      
      take_page_off_buddy will be used to take a page meant to be poisoned off
      the buddy allocator.  take_page_off_buddy calls break_down_buddy_pages,
      which splits a higher-order page in case our page belongs to one.
      
      Once the page is under our control, we call page_handle_poison to set it
      as poisoned and grab a refcount on it.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06be6ff3
  14. 14 10月, 2020 2 次提交
  15. 04 9月, 2020 1 次提交
    • S
      mm: Add PG_arch_2 page flag · 4beba948
      Steven Price 提交于
      For arm64 MTE support it is necessary to be able to mark pages that
      contain user space visible tags that will need to be saved/restored e.g.
      when swapped out.
      
      To support this add a new arch specific flag (PG_arch_2). This flag is
      only available on 64-bit architectures due to the limited number of
      spare page flags on the 32-bit ones.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      [catalin.marinas@arm.com: use CONFIG_64BIT for guarding this new flag]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      4beba948
  16. 05 6月, 2020 1 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
  17. 08 4月, 2020 2 次提交
    • A
      mm: introduce Reported pages · 36e66c55
      Alexander Duyck 提交于
      In order to pave the way for free page reporting in virtualized
      environments we will need a way to get pages out of the free lists and
      identify those pages after they have been returned.  To accomplish this,
      this patch adds the concept of a Reported Buddy, which is essentially
      meant to just be the Uptodate flag used in conjunction with the Buddy page
      type.
      
      To prevent the reported pages from leaking outside of the buddy lists I
      added a check to clear the PageReported bit in the del_page_from_free_list
      function.  As a result any reported page that is split, merged, or
      allocated will have the flag cleared prior to the PageBuddy value being
      cleared.
      
      The process for reporting pages is fairly simple.  Once we free a page
      that meets the minimum order for page reporting we will schedule a worker
      thread to start 2s or more in the future.  That worker thread will begin
      working from the lowest supported page reporting order up to MAX_ORDER - 1
      pulling unreported pages from the free list and storing them in the
      scatterlist.
      
      When processing each individual free list it is necessary for the worker
      thread to release the zone lock when it needs to stop and report the full
      scatterlist of pages.  To reduce the work of the next iteration the worker
      thread will rotate the free list so that the first unreported page in the
      free list becomes the first entry in the list.
      
      It will then call a reporting function providing information on how many
      entries are in the scatterlist.  Once the function completes it will
      return the pages to the free area from which they were allocated and start
      over pulling more pages from the free areas until there are no longer
      enough pages to report on to keep the worker busy, or we have processed as
      many pages as were contained in the free area when we started processing
      the list.
      
      The worker thread will work in a round-robin fashion making its way though
      each zone requesting reporting, and through each reportable free list
      within that zone.  Once all free areas within the zone have been processed
      it will check to see if there have been any requests for reporting while
      it was processing.  If so it will reschedule the worker thread to start up
      again in roughly 2s and exit.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36e66c55
    • H
      mm: code cleanup for MADV_FREE · 9de4f22a
      Huang Ying 提交于
      Some comments for MADV_FREE is revised and added to help people understand
      the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
      page_is_file_cache() isn't consistent with its comments.  So the function
      is renamed to page_is_file_lru() to make them consistent again.  All these
      are put in one patch as one logical change.
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de4f22a
  18. 22 3月, 2020 1 次提交
  19. 07 11月, 2019 1 次提交
    • Y
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · 169226f7
      Yang Shi 提交于
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      169226f7
  20. 01 8月, 2019 1 次提交
  21. 13 7月, 2019 1 次提交
  22. 06 3月, 2019 2 次提交
    • D
      mm: better document PG_reserved · 6e2e07cd
      David Hildenbrand 提交于
      The usage of PG_reserved and how PG_reserved pages are to be treated is
      buried deep down in different parts of the kernel.  Let's shine some
      light onto these details by documenting current users and expected
      behavior.
      
      Especially, clarify on the "Some of them might not even exist" case.
      These are physical memory gaps that will never be dumped as they are not
      marked as IORESOURCE_SYSRAM.  PG_reserved does in general not hinder
      anybody from dumping or swapping.  In some cases, these pages will not
      be stored in the hibernation image.
      
      Link: http://lkml.kernel.org/r/20190114125903.24845-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: <yi.z.zhang@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e2e07cd
    • D
      mm: convert PG_balloon to PG_offline · ca215086
      David Hildenbrand 提交于
      PG_balloon was introduced to implement page migration/compaction for
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca215086
  23. 29 12月, 2018 1 次提交
  24. 27 10月, 2018 3 次提交
    • A
      mm: create non-atomic version of SetPageReserved for init use · d483da5b
      Alexander Duyck 提交于
      It doesn't make much sense to use the atomic SetPageReserved at init time
      when we are using memset to clear the memory and manipulating the page
      flags via simple "&=" and "|=" operations in __init_single_page.
      
      This patch adds a non-atomic version __SetPageReserved that can be used
      during page init and shows about a 10% improvement in initialization times
      on the systems I have available for testing.  On those systems I saw
      initialization times drop from around 35 seconds to around 32 seconds to
      initialize a 3TB block of persistent memory.  I believe the main advantage
      of this is that it allows for more compiler optimization as the __set_bit
      operation can be reordered whereas the atomic version cannot.
      
      I tried adding a bit of documentation based on f1dd2cd1 ("mm,
      memory_hotplug: do not associate hotadded memory to zones until online").
      
      Ideally the reserved flag should be set earlier since there is a brief
      window where the page is initialization via __init_single_page and we have
      not set the PG_Reserved flag.  I'm leaving that for a future patch set as
      that will require a more significant refactor.
      
      Link: http://lkml.kernel.org/r/20180925202018.3576.11607.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d483da5b
    • A
      mm: provide kernel parameter to allow disabling page init poisoning · f682a97a
      Alexander Duyck 提交于
      Patch series "Address issues slowing persistent memory initialization", v5.
      
      The main thing this patch set achieves is that it allows us to initialize
      each node worth of persistent memory independently.  As a result we reduce
      page init time by about 2 minutes because instead of taking 30 to 40
      seconds per node and going through each node one at a time, we process all
      4 nodes in parallel in the case of a 12TB persistent memory setup spread
      evenly over 4 nodes.
      
      This patch (of 3):
      
      On systems with a large amount of memory it can take a significant amount
      of time to initialize all of the page structs with the PAGE_POISON_PATTERN
      value.  I have seen it take over 2 minutes to initialize a system with
      over 12TB of RAM.
      
      In order to work around the issue I had to disable CONFIG_DEBUG_VM and
      then the boot time returned to something much more reasonable as the
      arch_add_memory call completed in milliseconds versus seconds.  However in
      doing that I had to disable all of the other VM debugging on the system.
      
      In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
      on a system that has a large amount of memory I have added a new kernel
      parameter named "vm_debug" that can be set to "-" in order to disable it.
      
      Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomainReviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f682a97a
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner 提交于
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
  25. 24 8月, 2018 1 次提交
  26. 08 6月, 2018 2 次提交
  27. 06 4月, 2018 1 次提交
  28. 01 2月, 2018 1 次提交
  29. 16 11月, 2017 1 次提交