- 23 3月, 2022 3 次提交
-
-
由 andrew.yang 提交于
When memory is tight, system may start to compact memory for large continuous memory demands. If one process tries to lock a memory page that is being locked and isolated for compaction, it may wait a long time or even forever. This is because compaction will perform non-atomic PG_Isolated clear while holding page lock, this may overwrite PG_waiters set by the process that can't obtain the page lock and add itself to the waiting queue to wait for the lock to be unlocked. CPU1 CPU2 lock_page(page); (successful) lock_page(); (failed) __ClearPageIsolated(page); SetPageWaiters(page) (may be overwritten) unlock_page(page); The solution is to not perform non-atomic operation on page flags while holding page lock. Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.comSigned-off-by: Nandrew.yang <andrew.yang@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Vlastimil Babka" <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: "William Kucharski" <william.kucharski@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The page_fixed_fake_head() is used throughout memory management and the conditional check requires checking a global variable, although the overhead of this check may be small, it increases when the memory cache comes under pressure. Also, the global variable will not be modified after system boot, so it is very appropriate to use static key machanism. Link: https://lkml.kernel.org/r/20211101031651.75851-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7. This series can minimize the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks. The main implementation and details can refer to the commit log of patch 1. In this series, I have changed the following four helpers, the following table shows the impact of the overhead of those helpers. +------------------+-----------------------+ | APIs | head page | tail page | +------------------+-----------+-----------+ | PageHead() | Y | N | +------------------+-----------+-----------+ | PageTail() | Y | N | +------------------+-----------+-----------+ | PageCompound() | N | N | +------------------+-----------+-----------+ | compound_head() | Y | N | +------------------+-----------+-----------+ Y: Overhead is increased. N: Overhead is _NOT_ increased. It shows that the overhead of those helpers on a tail page don't change between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the overhead on a head page will be increased when "hugetlb_free_vmemmap=on" (except PageCompound()). So I believe that Matthew Wilcox's folio series will help with this. The users of PageHead() and PageTail() are much less than compound_head() and most users of PageTail() are VM_BUG_ON(), so I have done some tests about the overhead of compound_head() on head pages. I have tested the overhead of calling compound_head() on a head page, which is 2.11ns (Measure the call time of 10 million times compound_head(), and then average). For a head page whose address is not aligned with PAGE_SIZE or a non-compound page, the overhead of compound_head() is 2.54ns which is increased by 20%. For a head page whose address is aligned with PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by 40%. Most pages are the former. I do not think the overhead is significant since the overhead of compound_head() itself is low. This patch (of 5): This patch minimizes the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB (2MB type). After the feature of "Free sonme vmemmap pages of HugeTLB page" is enabled, the mapping of the vmemmap addresses associated with a 2MB HugeTLB page becomes the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and remaped. However, the 2nd vmemmap page frame is also can be freed to the buddy allocator, then we can change the mapping from the figure above to the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | | 2 | -----------------+ | | | | | | | +-----------+ | | | | | | | | 3 | -------------------+ | | | | | | +-----------+ | | | | | | | 4 | ---------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | -----------------------+ | | | | +-----------+ | | | | | 6 | -------------------------+ | | | +-----------+ | | | | 7 | ---------------------------+ | | +-----------+ | | | | | | +-----------+ After we do this, all tail vmemmap pages (1-7) are mapped to the head vmemmap page frame (0). In other words, there are more than one page struct with PG_head associated with each HugeTLB page. We __know__ that there is only one head page struct, the tail page structs with PG_head are fake head page structs. We need an approach to distinguish between those two different types of page structs so that compound_head(), PageHead() and PageTail() can work properly if the parameter is the tail page struct but with PG_head. The following code snippet describes how to distinguish between real and fake head page struct. if (test_bit(PG_head, &page->flags)) { unsigned long head = READ_ONCE(page[1].compound_head); if (head & 1) { if (head == (unsigned long)page + 1) ==> head page struct else ==> tail page struct } else ==> head page struct } We can safely access the field of the @page[1] with PG_head because the @page is a compound page composed with at least two contiguous pages. [songmuchun@bytedance.com: restore lost comment changes] Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 1月, 2022 2 次提交
-
-
由 Naoya Horiguchi 提交于
After recent soft-offline rework, error pages can be taken off from buddy allocator, but the existing unpoison_memory() does not properly undo the operation. Moreover, due to the recent change on __get_hwpoison_page(), get_page_unless_zero() is hardly called for hwpoisoned pages. So __get_hwpoison_page() highly likely returns -EBUSY (meaning to fail to grab page refcount) and unpoison just clears PG_hwpoison without releasing a refcount. That does not lead to a critical issue like kernel panic, but unpoisoned pages never get back to buddy (leaked permanently), which is not good. To (partially) fix this, we need to identify "taken off" pages from other types of hwpoisoned pages. We can't use refcount or page flags for this purpose, so a pseudo flag is defined by hacking ->private field. Someone might think that put_page() is enough to cancel taken-off pages, but the normal free path contains some operations not suitable for the current purpose, and can fire VM_BUG_ON(). Note that unpoison_memory() is now supposed to be cancel hwpoison events injected only by madvise() or /sys/devices/system/memory/{hard,soft}_offline_page, not by MCE injection, so please don't try to use unpoison when testing with MCE injection. [lkp@intel.com: report build failure for ARCH=i386] Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: NYang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Ding Hui <dinghui@sangfor.com.cn> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Changcheng Deng 提交于
Return statements in functions returning bool should use true/false instead of 1/0. Link: https://lkml.kernel.org/r/20211126073327.74815-1-deng.changcheng@zte.com.cnSigned-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn> Reported-by: NZeal Robot <zealci@zte.com.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 1月, 2022 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
All members of struct slab can now be removed from struct page. This shrinks the definition of struct page by 30 LOC, making it easier to understand. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
-
- 03 1月, 2022 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
Move the PG_uptodate documentation to be documentation for folio_test_uptodate() and expand on it a little. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
-
- 17 11月, 2021 2 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
There's no need for this predicate; callers can just use !folio_test_large(). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
-
由 Matthew Wilcox (Oracle) 提交于
This is a better name. Also add kernel-doc. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
-
- 07 11月, 2021 1 次提交
-
-
由 Marco Elver 提交于
PagePoisoned() accesses page->flags which can be updated concurrently: | BUG: KCSAN: data-race in next_uptodate_page / unlock_page | | write (marked) to 0xffffea00050f37c0 of 8 bytes by task 1872 on cpu 1: | instrument_atomic_write include/linux/instrumented.h:87 [inline] | clear_bit_unlock_is_negative_byte include/asm-generic/bitops/instrumented-lock.h:74 [inline] | unlock_page+0x102/0x1b0 mm/filemap.c:1465 | filemap_map_pages+0x6c6/0x890 mm/filemap.c:3057 | ... | read to 0xffffea00050f37c0 of 8 bytes by task 1873 on cpu 0: | PagePoisoned include/linux/page-flags.h:204 [inline] | PageReadahead include/linux/page-flags.h:382 [inline] | next_uptodate_page+0x456/0x830 mm/filemap.c:2975 | ... | CPU: 0 PID: 1873 Comm: systemd-udevd Not tainted 5.11.0-rc4-00001-gf9ce0be7 #1 To avoid the compiler tearing or otherwise optimizing the access, use READ_ONCE() to access flags. Link: https://lore.kernel.org/all/20210826144157.GA26950@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20210913113542.2658064-1-elver@google.comReported-by: Nkernel test robot <oliver.sang@intel.com> Signed-off-by: NMarco Elver <elver@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NWill Deacon <will@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 11月, 2021 1 次提交
-
-
由 Linus Torvalds 提交于
I had missed a semantic conflict between commit d389a4a8 ("mm: Add folio flag manipulation functions") from the folio tree, and commit eac96c3e ("mm: filemap: check if THP has hwpoisoned subpage for PMD page fault") that added a new set of page flags. My build tests had too many options enabled, which hid this issue. But if you didn't have MEMORY_FAILURE or TRANSPARENT_HUGEPAGE enabled, you'd end up with build errors like this: include/linux/page-flags.h:806:29: error: macro "PAGEFLAG_FALSE" requires 2 arguments, but only 1 given 806 | PAGEFLAG_FALSE(HasHWPoisoned) | ^ due to the missing lowercase name used for folio function naming. Fixes: 49f8275c ("Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache") Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org> Reported-by: NYang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 10月, 2021 1 次提交
-
-
由 Yang Shi 提交于
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped. In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits. This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path. So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split. The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all. Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com Fixes: 800d8c63 ("shmem: add huge pages support") Signed-off-by: NYang Shi <shy828301@gmail.com> Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 10月, 2021 2 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
Rename set_page_writeback() to folio_start_writeback() to match folio_end_writeback(). Do not bother with wrappers that return void; callers are perfectly capable of ignoring return values. Add wrappers for set_page_writeback(), set_page_writeback_keepwrite() and test_set_page_writeback() for compatibililty with existing filesystems. The main advantage of this patch is getting the statistics right, although it does eliminate a couple of calls to compound_head(). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Howells <dhowells@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz>
-
由 Matthew Wilcox (Oracle) 提交于
test_clear_page_writeback() is actually an mm-internal function, although it's named as if it's a pagecache function. Move it to mm/internal.h, rename it to __folio_end_writeback() and change the return type to bool. The conversion from page to folio is mostly about accounting the number of pages being written back, although it does eliminate a couple of calls to compound_head(). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Howells <dhowells@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz>
-
- 27 9月, 2021 2 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
These new functions are the folio analogues of the various PageFlags functions. If CONFIG_DEBUG_VM_PGFLAGS is enabled, we check the folio is not a tail page at every invocation. This will also catch the PagePoisoned case as a poisoned page has every bit set, which would include PageTail. This saves 1684 bytes of text with the distro-derived config that I'm testing due to removing a double call to compound_head() in PageSwapCache(). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NJeff Layton <jlayton@kernel.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Reviewed-by: NDavid Howells <dhowells@redhat.com> Acked-by: NMike Rapoport <rppt@linux.ibm.com>
-
由 Matthew Wilcox (Oracle) 提交于
A struct folio is a new abstraction to replace the venerable struct page. A function which takes a struct folio argument declares that it will operate on the entire (possibly compound) page, not just PAGE_SIZE bytes. In return, the caller guarantees that the pointer it is passing does not point to a tail page. No change to generated code. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NJeff Layton <jlayton@kernel.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Howells <dhowells@redhat.com> Acked-by: NMike Rapoport <rppt@linux.ibm.com>
-
- 09 9月, 2021 2 次提交
-
-
由 SeongJae Park 提交于
PG_idle and PG_young allow the two PTE Accessed bit users, Idle Page Tracking and the reclaim logic concurrently work while not interfering with each other. That is, when they need to clear the Accessed bit, they set PG_young to represent the previous state of the bit, respectively. And when they need to read the bit, if the bit is cleared, they further read the PG_young to know whether the other has cleared the bit meanwhile or not. For yet another user of the PTE Accessed bit, we could add another page flag, or extend the mechanism to use the flags. For the DAMON usecase, however, we don't need to do that just yet. IDLE_PAGE_TRACKING and DAMON are mutually exclusive, so there's only ever going to be one user of the current set of flags. In this commit, we split out the CONFIG options to allow for the use of PG_young and PG_idle outside of idle page tracking. In the next commit, DAMON's reference implementation of the virtual memory address space monitoring primitives will use it. [sjpark@amazon.de: set PAGE_EXTENSION for non-64BIT] Link: https://lkml.kernel.org/r/20210806095153.6444-1-sj38.park@gmail.com [akpm@linux-foundation.org: tweak Kconfig text] [sjpark@amazon.de: hide PAGE_IDLE_FLAG from users] Link: https://lkml.kernel.org/r/20210813081238.34705-1-sj38.park@gmail.com Link: https://lkml.kernel.org/r/20210716081449.22187-5-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sjpark@amazon.de> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NFernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing PAGEFLAGS_MASK to make the code clear to get the page flags. Link: https://lkml.kernel.org/r/20210819150712.59948-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 9月, 2021 1 次提交
-
-
由 Vlastimil Babka 提交于
As another step of shortening irq disabled sections in ___slab_alloc(), delay disabling irqs until we pass the initial checks if there is a cached percpu slab and it's suitable for our allocation. Now we have to recheck c->page after actually disabling irqs as an allocation in irq handler might have replaced it. Because we call pfmemalloc_match() as one of the checks, we might hit VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe() variant that lacks the PageSlab check. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@techsingularity.net>
-
- 02 8月, 2021 1 次提交
-
-
由 Marc Zyngier 提交于
Now that arm64 has stopped using kvm_is_transparent_hugepage(), we can remove it, as well as PageTransCompoundMap() which was only used by the former. Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210726153552.1535838-5-maz@kernel.org
-
- 01 7月, 2021 2 次提交
-
-
由 David Hildenbrand 提交于
A driver might set a page logically offline -- PageOffline() -- and turn the page inaccessible in the hypervisor; after that, access to page content can be fatal. One example is virtio-mem; while unplugged memory -- marked as PageOffline() can currently be read in the hypervisor, this will no longer be the case in the future; for example, when having a virtio-mem device backed by huge pages in the hypervisor. Some special PFN walkers -- i.e., /proc/kcore -- read content of random pages after checking PageOffline(); however, these PFN walkers can race with drivers that set PageOffline(). Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing. page_offline_freeze()/page_offline_thaw() allows for a subsystem to synchronize with such drivers, achieving that a page cannot be set PageOffline() while frozen. page_offline_begin()/page_offline_end() is used by drivers that care about such races when setting a page PageOffline(). For simplicity, use a rwsem for now; neither drivers nor users are performance sensitive. Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Roman Gushchin <guro@fb.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Steven Price <steven.price@arm.com> Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Hildenbrand 提交于
Let's avoid reading: 1) Offline memory sections: the content of offline memory sections is stale as the memory is effectively unused by the kernel. On s390x with standby memory, offline memory sections (belonging to offline storage increments) are not accessible. With virtio-mem and the hyper-v balloon, we can have unavailable memory chunks that should not be accessed inside offline memory sections. Last but not least, offline memory sections might contain hwpoisoned pages which we can no longer identify because the memmap is stale. 2) PG_offline pages: logically offline pages that are documented as "The content of these pages is effectively stale. Such pages should not be touched (read/write/dump/save) except by their owner.". Examples include pages inflated in a balloon or unavailble memory ranges inside hotplugged memory sections with virtio-mem or the hyper-v balloon. 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal. As documented: "Accessing is not safe since it may cause another machine check. Don't touch!" Introduce is_page_hwpoison(), adding a comment that it is inherently racy but best we can really do. Reading /proc/kcore now performs similar checks as when reading /proc/vmcore for kdump via makedumpfile: problematic pages are exclude. It's also similar to hibernation code, however, we don't skip hwpoisoned pages when processing pages in kernel/power/snapshot.c:saveable_page() yet. Note 1: we can race against memory offlining code, especially memory going offline and getting unplugged: however, we will properly tear down the identity mapping and handle faults gracefully when accessing this memory from kcore code. Note 2: we can race against drivers setting PageOffline() and turning memory inaccessible in the hypervisor. We'll handle this in a follow-up patch. Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Roman Gushchin <guro@fb.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Steven Price <steven.price@arm.com> Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 6月, 2021 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
If you pass a const pointer to compound_head(), you get a const pointer back; if you pass a mutable pointer, you get a mutable pointer back. Also remove an unnecessary forward definition of struct page; we're about to dereference page->compound_head, so it must already have been defined. Link: https://lkml.kernel.org/r/20210416231531.2521383-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 6月, 2021 1 次提交
-
-
由 Peter Collingbourne 提交于
Poisoning freed pages protects against kernel use-after-free. The likelihood of such a bug involving kernel pages is significantly higher than that for user pages. At the same time, poisoning freed pages can impose a significant performance cost, which cannot always be justified for user pages given the lower probability of finding a bug. Therefore, disable freed user page poisoning when using HW tags. We identify "user" pages via the flag set GFP_HIGHUSER_MOVABLE, which indicates a strong likelihood of not being directly accessible to the kernel. Signed-off-by: NPeter Collingbourne <pcc@google.com> Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com> Link: https://linux-review.googlesource.com/id/I716846e2de8ef179f44e835770df7e6307be96c9 Link: https://lore.kernel.org/r/20210602235230.3928842-5-pcc@google.comSigned-off-by: NWill Deacon <will@kernel.org>
-
- 27 2月, 2021 1 次提交
-
-
由 Guo Ren 提交于
The "If" was wrongly spelled as "It". Link: https://lkml.kernel.org/r/1608959036-91409-1-git-send-email-guoren@kernel.orgSigned-off-by: NGuo Ren <guoren@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 2月, 2021 1 次提交
-
-
由 Mike Kravetz 提交于
Use the new hugetlb page specific flag HPageMigratable to replace the page_huge_active interfaces. By it's name, page_huge_active implied that a huge page was on the active list. However, that is not really what code checking the flag wanted to know. It really wanted to determine if the huge page could be migrated. This happens when the page is actually added to the page cache and/or task page table. This is the reasoning behind the name change. The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not really necessary as we KNOW the page is a hugetlb page. Therefore, they are removed. The routine page_huge_active checked for PageHeadHuge before testing the active bit. This is unnecessary in the case where we hold a reference or lock and know it is a hugetlb head page. page_huge_active is also called without holding a reference or lock (scan_movable_pages), and can race with code freeing the page. The extra check in page_huge_active shortened the race window, but did not prevent the race. Offline code calling scan_movable_pages already deals with these races, so removing the check is acceptable. Add comment to racy code. [songmuchun@bytedance.com: remove set_page_huge_active() declaration from include/linux/hugetlb.h] Link: https://lkml.kernel.org/r/CAMZfGtUda+KoAZscU0718TN61cSFwp4zy=y2oZ=+6Z2TAZZwng@mail.gmail.com Link: https://lkml.kernel.org/r/20210122195231.324857-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 12月, 2020 3 次提交
-
-
由 Alex Shi 提交于
Currently lru_lock still guards both lru list and page's lru bit, that's ok. but if we want to use specific lruvec lock on the page, we need to pin down the page's lruvec/memcg during locking. Just taking lruvec lock first may be undermined by the page's memcg charge/migration. To fix this problem, we will clear the lru bit out of locking and use it as pin down action to block the page isolation in memcg changing. So now a standard steps of page isolation is following: 1, get_page(); #pin the page avoid to be free 2, TestClearPageLRU(); #block other isolation like memcg change 3, spin_lock on lru_lock; #serialize lru list access 4, delete page from lru list; This patch start with the first part: TestClearPageLRU, which combines PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This function will be used as page isolation precondition to prevent other isolations some where else. Then there are may !PageLRU page on lru list, need to remove BUG() checking accordingly. There 2 rules for lru bit now: 1, the lru bit still indicate if a page on lru list, just in some temporary moment(isolating), the page may have no lru bit when it's on lru list. but the page still must be on lru list when the lru bit set. 2, have to remove lru bit before delete it from lru list. As Andrew Morton mentioned this change would dirty cacheline for a page which isn't on the LRU. But the loss would be acceptable in Rong Chen <rong.a.chen@intel.com> report: https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/ Link: https://lkml.kernel.org/r/1604566549-62481-15-git-send-email-alex.shi@linux.alibaba.comSuggested-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
We haven't had 'dontuse' flags since 2002. Replace this obsolete warning with a hopefully more useful one. Link: https://lkml.kernel.org/r/20201027025823.3704-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miaohe Lin 提交于
They are not used anymore. Link: https://lkml.kernel.org/r/20201009135914.64826-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 12月, 2020 1 次提交
-
-
由 Roman Gushchin 提交于
PageKmemcg flag is currently defined as a page type (like buddy, offline, table and guard). Semantically it means that the page was accounted as a kernel memory by the page allocator and has to be uncharged on the release. As a side effect of defining the flag as a page type, the accounted page can't be mapped to userspace (look at page_has_type() and comments above). In particular, this blocks the accounting of vmalloc-backed memory used by some bpf maps, because these maps do map the memory to userspace. One option is to fix it by complicating the access to page->mapcount, which provides some free bits for page->page_type. But it's way better to move this flag into page->memcg_data flags. Indeed, the flag makes no sense without enabled memory cgroups and memory cgroup pointer set in particular. This commit replaces PageKmemcg() and __SetPageKmemcg() with PageMemcgKmem() and an open-coded OR operation setting the memcg pointer with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted, as the whole memcg_data is zeroed at once. As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be compiled out. Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com
-
- 17 10月, 2020 2 次提交
-
-
由 Oscar Salvador 提交于
This patch changes the way we set and handle in-use poisoned pages. Until now, poisoned pages were released to the buddy allocator, trusting that the checks that take place at allocation time would act as a safe net and would skip that page. This has proved to be wrong, as we got some pfn walkers out there, like compaction, that all they care is the page to be in a buddy freelist. Although this might not be the only user, having poisoned pages in the buddy allocator seems a bad idea as we should only have free pages that are ready and meant to be used as such. Before explaining the taken approach, let us break down the kind of pages we can soft offline. - Anonymous THP (after the split, they end up being 4K pages) - Hugetlb - Order-0 pages (that can be either migrated or invalited) * Normal pages (order-0 and anon-THP) - If they are clean and unmapped page cache pages, we invalidate then by means of invalidate_inode_page(). - If they are mapped/dirty, we do the isolate-and-migrate dance. Either way, do not call put_page directly from those paths. Instead, we keep the page and send it to page_handle_poison to perform the right handling. page_handle_poison sets the HWPoison flag and does the last put_page. Down the chain, we placed a check for HWPoison page in free_pages_prepare, that just skips any poisoned page, so those pages do not end up in any pcplist/freelist. After that, we set the refcount on the page to 1 and we increment the poisoned pages counter. If we see that the check in free_pages_prepare creates trouble, we can always do what we do for free pages: - wait until the page hits buddy's freelists - take it off, and flag it The downside of the above approach is that we could race with an allocation, so by the time we want to take the page off the buddy, the page has been already allocated so we cannot soft offline it. But the user could always retry it. * Hugetlb pages - We isolate-and-migrate them After the migration has been successful, we call dissolve_free_huge_page, and we set HWPoison on the page if we succeed. Hugetlb has a slightly different handling though. While for non-hugetlb pages we cared about closing the race with an allocation, doing so for hugetlb pages requires quite some additional and intrusive code (we would need to hook in free_huge_page and some other places). So I decided to not make the code overly complicated and just fail normally if the page we allocated in the meantime. We can always build on top of this. As a bonus, because of the way we handle now in-use pages, we no longer need the put-as-isolation-migratetype dance, that was guarding for poisoned pages to end up in pcplists. Signed-off-by: NOscar Salvador <osalvador@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oscar Salvador 提交于
When trying to soft-offline a free page, we need to first take it off the buddy allocator. Once we know is out of reach, we can safely flag it as poisoned. take_page_off_buddy will be used to take a page meant to be poisoned off the buddy allocator. take_page_off_buddy calls break_down_buddy_pages, which splits a higher-order page in case our page belongs to one. Once the page is under our control, we call page_handle_poison to set it as poisoned and grab a refcount on it. Signed-off-by: NOscar Salvador <osalvador@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 10月, 2020 2 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
Introduce the new page policy of PF_SECOND which lets us use the normal pageflags generation machinery to create the various DoubleMap manipulation functions. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NZi Yan <ziy@nvidia.com> Link: https://lkml.kernel.org/r/20200629151933.15671-3-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
Patch series "Fix PageDoubleMap". This is a purely theoretical problem for now as none of the filesystems which use PG_private_2 (ie PG_fscache) are being converted at this time, but it's confusing to leave it like this. This patch (of 2): PG_private_2 is defined as being PF_ANY (applicable to tail pages as well as regular & head pages). That means that the first tail page of a double-map page will appear to have Private2 set. Use the Workingset bit instead which is defined as PF_HEAD so any attempt to access the Workingset bit on a tail page will redirect to the head page's Workingset bit. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NZi Yan <ziy@nvidia.com> Link: https://lkml.kernel.org/r/20200629151933.15671-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200629151933.15671-2-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 9月, 2020 1 次提交
-
-
由 Steven Price 提交于
For arm64 MTE support it is necessary to be able to mark pages that contain user space visible tags that will need to be saved/restored e.g. when swapped out. To support this add a new arch specific flag (PG_arch_2). This flag is only available on 64-bit architectures due to the limited number of spare page flags on the 32-bit ones. Signed-off-by: NSteven Price <steven.price@arm.com> [catalin.marinas@arm.com: use CONFIG_64BIT for guarding this new flag] Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org>
-
- 05 6月, 2020 1 次提交
-
-
由 David Hildenbrand 提交于
virtio-mem wants to allow to offline memory blocks of which some parts were unplugged (allocated via alloc_contig_range()), especially, to later offline and remove completely unplugged memory blocks. The important part is that PageOffline() has to remain set until the section is offline, so these pages will never get accessed (e.g., when dumping). The pages should not be handed back to the buddy (which would require clearing PageOffline() and result in issues if offlining fails and the pages are suddenly in the buddy). Let's allow to do that by allowing to isolate any PageOffline() page when offlining. This way, we can reach the memory hotplug notifier MEM_GOING_OFFLINE, where the driver can signal that he is fine with offlining this page by dropping its reference count. PageOffline() pages with a reference count of 0 can then be skipped when offlining the pages (like if they were free, however they are not in the buddy). Anybody who uses PageOffline() pages and does not agree to offline them (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not decrement the reference count and make offlining fail when trying to migrate such an unmovable page. So there should be no observable change. Same applies to balloon compaction users (movable PageOffline() pages), the pages will simply be migrated. Note 1: If offlining fails, a driver has to increment the reference count again in MEM_CANCEL_OFFLINE. Note 2: A driver that makes use of this has to be aware that re-onlining the memory block has to be handled by hooking into onlining code (online_page_callback_t), resetting the page PageOffline() and not giving them to the buddy. Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Qian Cai <cai@lca.pw> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: NDavid Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
- 08 4月, 2020 2 次提交
-
-
由 Alexander Duyck 提交于
In order to pave the way for free page reporting in virtualized environments we will need a way to get pages out of the free lists and identify those pages after they have been returned. To accomplish this, this patch adds the concept of a Reported Buddy, which is essentially meant to just be the Uptodate flag used in conjunction with the Buddy page type. To prevent the reported pages from leaking outside of the buddy lists I added a check to clear the PageReported bit in the del_page_from_free_list function. As a result any reported page that is split, merged, or allocated will have the flag cleared prior to the PageBuddy value being cleared. The process for reporting pages is fairly simple. Once we free a page that meets the minimum order for page reporting we will schedule a worker thread to start 2s or more in the future. That worker thread will begin working from the lowest supported page reporting order up to MAX_ORDER - 1 pulling unreported pages from the free list and storing them in the scatterlist. When processing each individual free list it is necessary for the worker thread to release the zone lock when it needs to stop and report the full scatterlist of pages. To reduce the work of the next iteration the worker thread will rotate the free list so that the first unreported page in the free list becomes the first entry in the list. It will then call a reporting function providing information on how many entries are in the scatterlist. Once the function completes it will return the pages to the free area from which they were allocated and start over pulling more pages from the free areas until there are no longer enough pages to report on to keep the worker busy, or we have processed as many pages as were contained in the free area when we started processing the list. The worker thread will work in a round-robin fashion making its way though each zone requesting reporting, and through each reportable free list within that zone. Once all free areas within the zone have been processed it will check to see if there have been any requests for reporting while it was processing. If so it will reschedule the worker thread to start up again in roughly 2s and exit. Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Some comments for MADV_FREE is revised and added to help people understand the MADV_FREE code, especially the page flag, PG_swapbacked. This makes page_is_file_cache() isn't consistent with its comments. So the function is renamed to page_is_file_lru() to make them consistent again. All these are put in one patch as one logical change. Suggested-by: NDavid Hildenbrand <david@redhat.com> Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Suggested-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: N"Huang, Ying" <ying.huang@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@kernel.org> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 3月, 2020 1 次提交
-
-
由 Qian Cai 提交于
Commit bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped out") supported writing THP to a swap device but forgot to upgrade an older commit df8c94d1 ("page-flags: define behavior of FS/IO-related flags on compound pages") which could trigger a crash during THP swapping out with DEBUG_VM_PGFLAGS=y, kernel BUG at include/linux/page-flags.h:317! page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) end_swap_bio_write() SetPageError(page) VM_BUG_ON_PAGE(1 && PageCompound(page)) <IRQ> bio_endio+0x297/0x560 dec_pending+0x218/0x430 [dm_mod] clone_endio+0xe4/0x2c0 [dm_mod] bio_endio+0x297/0x560 blk_update_request+0x201/0x920 scsi_end_request+0x6b/0x4b0 scsi_io_completion+0x509/0x7e0 scsi_finish_command+0x1ed/0x2a0 scsi_softirq_done+0x1c9/0x1d0 __blk_mqnterrupt+0xf/0x20 </IRQ> Fix by checking PF_NO_TAIL in those places instead. Fixes: bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped out") Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: N"Huang, Ying" <ying.huang@intel.com> Acked-by: NRafael Aquini <aquini@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200310235846.1319-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 11月, 2019 1 次提交
-
-
由 Yang Shi 提交于
We have a usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below: 7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 12 And some benchmarks do worse than with anonymous THPs. By digging into the code we figured out that commit 127393fb ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But the _mapcount < 0 check doesn't work for page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry. So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time. Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate. With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages are PMD mapped by EPT as the below: 7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 557056 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 271 And the benchmarks are as same as anonymous THPs. [yang.shi@linux.alibaba.com: v4] Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com Fixes: dd78fedd ("rmap: support file thp") Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reported-by: NGang Deng <gavin.dg@linux.alibaba.com> Tested-by: NGang Deng <gavin.dg@linux.alibaba.com> Suggested-by: NHugh Dickins <hughd@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-