1. 09 8月, 2022 4 次提交
  2. 03 8月, 2022 1 次提交
  3. 18 7月, 2022 1 次提交
    • S
      fsdax: set a CoW flag when associate reflink mappings · 6061b69b
      Shiyang Ruan 提交于
      Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file
      mappings.  In this case, since the dax-rmap has already took the
      responsibility to look up for shared files by given dax page, the
      page->mapping is no longer to used for rmap but for marking that this dax
      page is shared.  And to make sure disassociation works fine, we use
      page->index as refcount, and clear page->mapping to the initial state when
      page->index is decreased to 0.
      
      With the help of this new flag, it is able to distinguish normal case and
      CoW case, and keep the warning in normal case.
      
      Link: https://lkml.kernel.org/r/20220603053738.1218681-8-ruansy.fnst@fujitsu.comSigned-off-by: NShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.wiliams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6061b69b
  4. 04 7月, 2022 2 次提交
  5. 13 5月, 2022 1 次提交
  6. 10 5月, 2022 2 次提交
    • M
      fs: Remove last vestiges of releasepage · 704ead2b
      Matthew Wilcox (Oracle) 提交于
      All users are now converted to release_folio
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      704ead2b
    • D
      mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages · 78fbe906
      David Hildenbrand 提交于
      The basic question we would like to have a reliable and efficient answer
      to is: is this anonymous page exclusive to a single process or might it be
      shared?  We need that information for ordinary/single pages, hugetlb
      pages, and possibly each subpage of a THP.
      
      Introduce a way to mark an anonymous page as exclusive, with the ultimate
      goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
      lose consistency with the pages mapped into the page table, resulting in
      reported memory corruptions.
      
      Most pageflags already have semantics for anonymous pages, however,
      PG_mappedtodisk should never apply to pages in the swapcache, so let's
      reuse that flag.
      
      As PG_has_hwpoisoned also uses that flag on the second tail page of a
      compound page, convert it to PG_error instead, which is marked as
      PF_NO_TAIL, so never used for tail pages.
      
      Use custom page flag modification functions such that we can do additional
      sanity checks.  The semantics we'll put into some kernel doc in the future
      are:
      
      "
        PG_anon_exclusive is *usually* only expressive in combination with a
        page table entry. Depending on the page table entry type it might
        store the following information:
      
             Is what's mapped via this page table entry exclusive to the
             single process and can be mapped writable without further
             checks? If not, it might be shared and we might have to COW.
      
        For now, we only expect PTE-mapped THPs to make use of
        PG_anon_exclusive in subpages. For other anonymous compound
        folios (i.e., hugetlb), only the head page is logically mapped and
        holds this information.
      
        For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
        set on the head page. When replacing the PMD by a page table full
        of PTEs, PG_anon_exclusive, if set on the head page, will be set on
        all tail pages accordingly. Note that converting from a PTE-mapping
        to a PMD mapping using the same compound page is currently not
        possible and consequently doesn't require care.
      
        If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
        it should only pin if the relevant PG_anon_exclusive is set. In that
        case, the pin will be fully reliable and stay consistent with the pages
        mapped into the page table, as the bit cannot get cleared (e.g., by
        fork(), KSM) while the page is pinned. For anonymous pages that
        are mapped R/W, PG_anon_exclusive can be assumed to always be set
        because such pages cannot possibly be shared.
      
        The page table lock protecting the page table entry is the primary
        synchronization mechanism for PG_anon_exclusive; GUP-fast that does
        not take the PT lock needs special care when trying to clear the
        flag.
      
        Page table entry types and PG_anon_exclusive:
        * Present: PG_anon_exclusive applies.
        * Swap: the information is lost. PG_anon_exclusive was cleared.
        * Migration: the entry holds this information instead.
                     PG_anon_exclusive was cleared.
        * Device private: PG_anon_exclusive applies.
        * Device exclusive: PG_anon_exclusive applies.
        * HW Poison: PG_anon_exclusive is stale and not changed.
      
        If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
        not allowed and the flag will stick around until the page is freed
        and folio->mapping is cleared.
      "
      
      We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
      zapping) of page table entries, page freeing code will handle that when
      also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
      information about exclusivity stick around will be an important property
      when adding sanity checks to unpinning code.
      
      Note that we properly clear the flag in free_pages_prepare() via
      PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
      so there is no need to manually clear the flag.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      78fbe906
  7. 29 4月, 2022 2 次提交
  8. 25 3月, 2022 1 次提交
  9. 23 3月, 2022 3 次提交
    • A
      mm/migrate: fix race between lock page and clear PG_Isolated · 356ea386
      andrew.yang 提交于
      When memory is tight, system may start to compact memory for large
      continuous memory demands.  If one process tries to lock a memory page
      that is being locked and isolated for compaction, it may wait a long time
      or even forever.  This is because compaction will perform non-atomic
      PG_Isolated clear while holding page lock, this may overwrite PG_waiters
      set by the process that can't obtain the page lock and add itself to the
      waiting queue to wait for the lock to be unlocked.
      
        CPU1                            CPU2
        lock_page(page); (successful)
                                        lock_page(); (failed)
        __ClearPageIsolated(page);      SetPageWaiters(page) (may be overwritten)
        unlock_page(page);
      
      The solution is to not perform non-atomic operation on page flags while
      holding page lock.
      
      Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.comSigned-off-by: Nandrew.yang <andrew.yang@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Vlastimil Babka" <vbabka@suse.cz>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "William Kucharski" <william.kucharski@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Nicholas Tang <nicholas.tang@mediatek.com>
      Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      356ea386
    • M
      mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key · a6b40850
      Muchun Song 提交于
      The page_fixed_fake_head() is used throughout memory management and the
      conditional check requires checking a global variable, although the
      overhead of this check may be small, it increases when the memory cache
      comes under pressure.  Also, the global variable will not be modified
      after system boot, so it is very appropriate to use static key machanism.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6b40850
    • M
      mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page · e7d32485
      Muchun Song 提交于
      Patch series "Free the 2nd vmemmap page associated with each HugeTLB
      page", v7.
      
      This series can minimize the overhead of struct page for 2MB HugeTLB
      pages significantly.  It further reduces the overhead of struct page by
      12.5% for a 2MB HugeTLB compared to the previous approach, which means
      2GB per 1TB HugeTLB.  It is a nice gain.  Comments and reviews are
      welcome.  Thanks.
      
      The main implementation and details can refer to the commit log of patch
      1.  In this series, I have changed the following four helpers, the
      following table shows the impact of the overhead of those helpers.
      
      	+------------------+-----------------------+
      	|       APIs       | head page | tail page |
      	+------------------+-----------+-----------+
      	|    PageHead()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|    PageTail()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|  PageCompound()  |     N     |     N     |
      	+------------------+-----------+-----------+
      	|  compound_head() |     Y     |     N     |
      	+------------------+-----------+-----------+
      
      	Y: Overhead is increased.
      	N: Overhead is _NOT_ increased.
      
      It shows that the overhead of those helpers on a tail page don't change
      between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off".  But the
      overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
      (except PageCompound()).  So I believe that Matthew Wilcox's folio series
      will help with this.
      
      The users of PageHead() and PageTail() are much less than compound_head()
      and most users of PageTail() are VM_BUG_ON(), so I have done some tests
      about the overhead of compound_head() on head pages.
      
      I have tested the overhead of calling compound_head() on a head page,
      which is 2.11ns (Measure the call time of 10 million times
      compound_head(), and then average).
      
      For a head page whose address is not aligned with PAGE_SIZE or a
      non-compound page, the overhead of compound_head() is 2.54ns which is
      increased by 20%.  For a head page whose address is aligned with
      PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
      40%.  Most pages are the former.  I do not think the overhead is
      significant since the overhead of compound_head() itself is low.
      
      This patch (of 5):
      
      This patch minimizes the overhead of struct page for 2MB HugeTLB pages
      significantly.  It further reduces the overhead of struct page by 12.5%
      for a 2MB HugeTLB compared to the previous approach, which means 2GB per
      1TB HugeTLB (2MB type).
      
      After the feature of "Free sonme vmemmap pages of HugeTLB page" is
      enabled, the mapping of the vmemmap addresses associated with a 2MB
      HugeTLB page becomes the figure below.
      
           HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
      remaped. However, the 2nd vmemmap page frame is also can be freed to
      the buddy allocator, then we can change the mapping from the figure
      above to the figure below.
      
          HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
       |           |                     +-----------+                  | | | | | |
       |           |                     |     2     | -----------------+ | | | | |
       |           |                     +-----------+                    | | | | |
       |           |                     |     3     | -------------------+ | | | |
       |           |                     +-----------+                      | | | |
       |           |                     |     4     | ---------------------+ | | |
       |    2MB    |                     +-----------+                        | | |
       |           |                     |     5     | -----------------------+ | |
       |           |                     +-----------+                          | |
       |           |                     |     6     | -------------------------+ |
       |           |                     +-----------+                            |
       |           |                     |     7     | ---------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      After we do this, all tail vmemmap pages (1-7) are mapped to the head
      vmemmap page frame (0).  In other words, there are more than one page
      struct with PG_head associated with each HugeTLB page.  We __know__ that
      there is only one head page struct, the tail page structs with PG_head are
      fake head page structs.  We need an approach to distinguish between those
      two different types of page structs so that compound_head(), PageHead()
      and PageTail() can work properly if the parameter is the tail page struct
      but with PG_head.
      
      The following code snippet describes how to distinguish between real and
      fake head page struct.
      
      	if (test_bit(PG_head, &page->flags)) {
      		unsigned long head = READ_ONCE(page[1].compound_head);
      
      		if (head & 1) {
      			if (head == (unsigned long)page + 1)
      				==> head page struct
      			else
      				==> tail page struct
      		} else
      			==> head page struct
      	}
      
      We can safely access the field of the @page[1] with PG_head because the
      @page is a compound page composed with at least two contiguous pages.
      
      [songmuchun@bytedance.com: restore lost comment changes]
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d32485
  10. 15 1月, 2022 2 次提交
  11. 07 1月, 2022 1 次提交
  12. 03 1月, 2022 1 次提交
  13. 17 11月, 2021 2 次提交
  14. 07 11月, 2021 1 次提交
  15. 02 11月, 2021 1 次提交
  16. 29 10月, 2021 1 次提交
    • Y
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault · eac96c3e
      Yang Shi 提交于
      When handling shmem page fault the THP with corrupted subpage could be
      PMD mapped if certain conditions are satisfied.  But kernel is supposed
      to send SIGBUS when trying to map hwpoisoned page.
      
      There are two paths which may do PMD map: fault around and regular
      fault.
      
      Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") the thing was even worse in fault around path.  The THP
      could be PMD mapped as long as the VMA fits regardless what subpage is
      accessed and corrupted.  After this commit as long as head page is not
      corrupted the THP could be PMD mapped.
      
      In the regular fault path the THP could be PMD mapped as long as the
      corrupted page is not accessed and the VMA fits.
      
      This loophole could be fixed by iterating every subpage to check if any
      of them is hwpoisoned or not, but it is somewhat costly in page fault
      path.
      
      So introduce a new page flag called HasHWPoisoned on the first tail
      page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
      subpage of THP is found hwpoisoned by memory failure and after the
      refcount is bumped successfully, then cleared when the THP is freed or
      split.
      
      The soft offline path doesn't need this since soft offline handler just
      marks a subpage hwpoisoned when the subpage is migrated successfully.
      But shmem THP didn't get split then migrated at all.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac96c3e
  17. 18 10月, 2021 2 次提交
  18. 27 9月, 2021 2 次提交
  19. 09 9月, 2021 2 次提交
  20. 04 9月, 2021 1 次提交
    • V
      mm, slub: do initial checks in ___slab_alloc() with irqs enabled · 0b303fb4
      Vlastimil Babka 提交于
      As another step of shortening irq disabled sections in ___slab_alloc(), delay
      disabling irqs until we pass the initial checks if there is a cached percpu
      slab and it's suitable for our allocation.
      
      Now we have to recheck c->page after actually disabling irqs as an allocation
      in irq handler might have replaced it.
      
      Because we call pfmemalloc_match() as one of the checks, we might hit
      VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
      interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe()
      variant that lacks the PageSlab check.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      0b303fb4
  21. 02 8月, 2021 1 次提交
  22. 01 7月, 2021 2 次提交
    • D
      mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() · 82840451
      David Hildenbrand 提交于
      A driver might set a page logically offline -- PageOffline() -- and turn
      the page inaccessible in the hypervisor; after that, access to page
      content can be fatal.  One example is virtio-mem; while unplugged memory
      -- marked as PageOffline() can currently be read in the hypervisor, this
      will no longer be the case in the future; for example, when having a
      virtio-mem device backed by huge pages in the hypervisor.
      
      Some special PFN walkers -- i.e., /proc/kcore -- read content of random
      pages after checking PageOffline(); however, these PFN walkers can race
      with drivers that set PageOffline().
      
      Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.
      
      page_offline_freeze()/page_offline_thaw() allows for a subsystem to
      synchronize with such drivers, achieving that a page cannot be set
      PageOffline() while frozen.
      
      page_offline_begin()/page_offline_end() is used by drivers that care about
      such races when setting a page PageOffline().
      
      For simplicity, use a rwsem for now; neither drivers nor users are
      performance sensitive.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82840451
    • D
      fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages · 0daa322b
      David Hildenbrand 提交于
      Let's avoid reading:
      
      1) Offline memory sections: the content of offline memory sections is
         stale as the memory is effectively unused by the kernel.  On s390x with
         standby memory, offline memory sections (belonging to offline storage
         increments) are not accessible.  With virtio-mem and the hyper-v
         balloon, we can have unavailable memory chunks that should not be
         accessed inside offline memory sections.  Last but not least, offline
         memory sections might contain hwpoisoned pages which we can no longer
         identify because the memmap is stale.
      
      2) PG_offline pages: logically offline pages that are documented as
         "The content of these pages is effectively stale.  Such pages should
         not be touched (read/write/dump/save) except by their owner.".
         Examples include pages inflated in a balloon or unavailble memory
         ranges inside hotplugged memory sections with virtio-mem or the hyper-v
         balloon.
      
      3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
         As documented: "Accessing is not safe since it may cause another
         machine check.  Don't touch!"
      
      Introduce is_page_hwpoison(), adding a comment that it is inherently racy
      but best we can really do.
      
      Reading /proc/kcore now performs similar checks as when reading
      /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
      It's also similar to hibernation code, however, we don't skip hwpoisoned
      pages when processing pages in kernel/power/snapshot.c:saveable_page()
      yet.
      
      Note 1: we can race against memory offlining code, especially memory going
      offline and getting unplugged: however, we will properly tear down the
      identity mapping and handle faults gracefully when accessing this memory
      from kcore code.
      
      Note 2: we can race against drivers setting PageOffline() and turning
      memory inaccessible in the hypervisor.  We'll handle this in a follow-up
      patch.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0daa322b
  23. 30 6月, 2021 1 次提交
  24. 05 6月, 2021 1 次提交
  25. 27 2月, 2021 1 次提交
  26. 25 2月, 2021 1 次提交