- 11 4月, 2020 2 次提交
-
-
由 Arjun Roy 提交于
Add the ability to insert multiple pages at once to a user VM with lower PTE spinlock operations. The intention of this patch-set is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. [akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument] [arjunroy@google.com: add missing page_count() check to vm_insert_pages()] Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com [arjunroy@google.com: vm_insert_pages() checks if pte_index defined] Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.comSigned-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arjun Roy 提交于
Add helper methods for vm_insert_page()/insert_page() to prepare for vm_insert_pages(), which batch-inserts pages to reduce spinlock operations when inserting multiple consecutive pages into the user page table. The intention of this patch-set is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 4月, 2020 7 次提交
-
-
由 chenqiwu 提交于
The parameter of remap_pfn_range() @pfn passed from the caller is actually a page-frame number converted by corresponding physical address of kernel memory, the original comment is ambiguous that may mislead the users. Meanwhile, there is an ambiguous typo "VMM" in the comment of vm_area_struct. So fixing them will make the code more readable. Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
UFFD_EVENT_FORK support for uffd-wp should be already there, except that we should clean the uffd-wp bit if uffd fork event is not enabled. Detect that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked by VM_UFFD_WP. Do this for both small PTEs and huge PMDs. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJerome Glisse <jglisse@redhat.com> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for change_protection() when used with uffd-wp and make sure the two new flags are exclusively used. Then, - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW when a range of memory is write protected by uffd - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover _PAGE_RW when write protection is resolved from userspace And use this new interface in mwriteprotect_range() to replace the old MM_CP_DIRTY_ACCT. Do this change for both PTEs and huge PMDs. Then we can start to identify which PTE/PMD is write protected by general (e.g., COW or soft dirty tracking), and which is for userfaultfd-wp. Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we can be even more strict when detecting uffd-wp page faults in either do_wp_page() or wp_huge_pmd(). After we're with _PAGE_UFFD_WP, a special case is when a page is both protected by the general COW logic and also userfault-wp. Here the userfault-wp will have higher priority and will be handled first. Only after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle the general COW. These are the steps on what will happen with such a page: 1. CPU accesses write protected shared page (so both protected by general COW and uffd-wp), blocked by uffd-wp first because in do_wp_page we'll handle uffd-wp first, so it has higher priority than general COW. 2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT to remove the uffd-wp bit upon the PTE/PMD. However here we still keep the write bit cleared. Notify the blocked CPU. 3. The blocked CPU resumes the page fault process with a fault retry, during retry it'll notice it was not with the uffd-wp bit this time but it is still write protected by general COW, then it'll go though the COW path in the fault handler, copy the page, apply write bit where necessary, and retry again. 4. The CPU will be able to access this page with write bit set. Suggested-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
There are several cases write protection fault happens. It could be a write to zero page, swaped page or userfault write protected page. When the fault happens, there is no way to know if userfault write protect the page before. Here we just blindly issue a userfault notification for vma with VM_UFFD_WP regardless if app write protects it yet. Application should be ready to handle such wp fault. In the swapin case, always swapin as readonly. This will cause false positive userfaults. We need to decide later if to eliminate them with a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY). hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be handled by a swap entry bit like anonymous memory. The main problem with no easy solution to eliminate the false positives, will be if/when userfaultfd is extended to real filesystem pagecache. When the pagecache is freed by reclaim we can't leave the radix tree pinned if the inode and in turn the radix tree is reclaimed as well. The estimation is that full accuracy and lack of false positives could be easily provided only to anonymous memory (as long as there's no fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs and hugetlbfs, it's most certainly worth to achieve it but in a later incremental patch. [peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NJerome Glisse <jglisse@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
Commit e496cf3d ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE") notes that it should be reverted when the PowerPC problem was fixed. The commit fixing the PowerPC problem (953c66c2) did not revert the commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an oversight, so remove the Kconfig symbol and undo the work of commit e496cf3d. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anshuman Khandual 提交于
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Acked-by: NGuo Ren <guoren@kernel.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paulburton@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 4月, 2020 2 次提交
-
-
由 Wang Wenhu 提交于
The param "start" actually referes to the physical memory start, which is to be mapped into virtual area vma. And it is the field vma->vm_start which stands for the start of the area. Most of the time, we do not read through whole implementation of a function but only the definition and essential comments. Accurate comments are definitely the base stone. Signed-off-by: NWang Wenhu <wenhu.wang@vivo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200318052206.105104-1-wenhu.wang@vivo.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 WANG Wenhu 提交于
It really made me scratch my head. Replace the comment with an accurate and consistent description. The parameter pfn actually refers to the page frame number which is right-shifted by PAGE_SHIFT from the physical address. Signed-off-by: NWANG Wenhu <wenhu.wang@vivo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200310073955.43415-1-wenhu.wang@vivo.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 3月, 2020 1 次提交
-
-
The functions wp_huge_pmd() and wp_huge_pud() currently relies on the huge_fault() callback to split huge page table entries if needed. However for module users that requires export of the split_huge_xxx() functionality which may be undesired. Instead split pre-existing huge page-table entries on VM_FAULT_FALLBACK return. We currently only do COW and write-notify on the PTE level, so if the huge_fault() handler returns VM_FAULT_FALLBACK on wp faults, split the huge pages and page-table entries. Also do this for huge PUDs if there is no huge_fault() handler and the vma is not anonymous, similar to how it's done for PMDs. Note that fs/dax.c still does the splitting in the huge_fault() handler, but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd() if needed. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NThomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: NChristian König <christian.koenig@amd.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 06 3月, 2020 1 次提交
-
-
由 Kirill A. Shutemov 提交于
Jeff Moyer has reported that one of xfstests triggers a warning when run on DAX-enabled filesystem: WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50 ... wp_page_copy+0x98c/0xd50 (unreliable) do_wp_page+0xd8/0xad0 __handle_mm_fault+0x748/0x1b90 handle_mm_fault+0x120/0x1f0 __do_page_fault+0x240/0xd70 do_page_fault+0x38/0xd0 handle_page_fault+0x10/0x30 The warning happens on failed __copy_from_user_inatomic() which tries to copy data into a CoW page. This happens because of race between MADV_DONTNEED and CoW page fault: CPU0 CPU1 handle_mm_fault() do_wp_page() wp_page_copy() do_wp_page() madvise(MADV_DONTNEED) zap_page_range() zap_pte_range() ptep_get_and_clear_full() <TLB flush> __copy_from_user_inatomic() sees empty PTE and fails WARN_ON_ONCE(1) clear_page() The solution is to re-try __copy_from_user_inatomic() under PTL after checking that PTE is matches the orig_pte. The second copy attempt can still fail, like due to non-readable PTE, but there's nothing reasonable we can do about, except clearing the CoW page. Reported-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NJeff Moyer <jmoyer@redhat.com> Cc: <stable@vger.kernel.org> Cc: Justin He <Justin.He@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 1月, 2020 2 次提交
-
-
由 Thomas Hellstrom 提交于
TTM graphics buffer objects may, transparently to user-space, move between IO and system memory. When that happens, all PTEs pointing to the old location are zapped before the move and then faulted in again if needed. When that happens, the page protection caching mode- and encryption bits may change and be different from those of struct vm_area_struct::vm_page_prot. We were using an ugly hack to set the page protection correctly. Fix that and instead export and use vmf_insert_mixed_prot() or use vmf_insert_pfn_prot(). Also get the default page protection from struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot(). This way we catch modifications done by the vm system for drivers that want write-notification. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Thomas Hellstrom 提交于
The TTM module today uses a hack to be able to set a different page protection than struct vm_area_struct::vm_page_prot. To be able to do this properly, add the needed vm functionality as vmf_insert_mixed_prot(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com> Acked-by: NChristian König <christian.koenig@amd.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 18 12月, 2019 1 次提交
-
-
由 Daniel Axtens 提交于
apply_to_page_range() takes an address range, and if any parts of it are not covered by the existing page table hierarchy, it allocates memory to fill them in. In some use cases, this is not what we want - we want to be able to operate exclusively on PTEs that are already in the tables. Add apply_to_existing_page_range() for this. Adjust the walker functions for apply_to_page_range to take 'create', which switches them between the old and new modes. This will be used in KASAN vmalloc. [akpm@linux-foundation.org: reduce code duplication] [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/] [akpm@linux-foundation.org: initialize __apply_to_page_range::err] Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 12月, 2019 1 次提交
-
-
由 Thomas Gleixner 提交于
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the pte_unmap_same() and SLUB code over to use CONFIG_PREEMPTION. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NChistoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20191015191821.11479-26-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 12月, 2019 2 次提交
-
-
由 Mike Rapoport 提交于
There are no architectures that use include/asm-generic/4level-fixup.h therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define. Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com> Cc: Anatoly Pugachev <matorola@gmail.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Peter Rosin <peda@axentia.se> Cc: Richard Weinberger <richard@nod.at> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Sam Creasey <sammy@sammy.net> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yu Zhao 提交于
For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero page or not. We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock() makes sure of it. This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody complains about it. Gerald Schaefer asked: : Maybe the description could also mention the symptom of this bug? : I would assume that it affects anon/dirty accounting in gather_pte_stats(), : for huge mappings, if zero page mappings are not correctly recognized. I came across this while I was looking at the code, so I'm not aware of any symptom. Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 12月, 2019 1 次提交
-
-
由 Wei Yang 提交于
There are several places emphasise the effect of __SetPageUptodate(), while the comment seems to have a typo in two places. Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 12月, 2019 4 次提交
-
-
由 Thomas Hellstrom 提交于
A huge pud page can theoretically be faulted in racing with pmd_alloc() in __handle_mm_fault(). That will lead to pmd_alloc() returning an invalid pmd pointer. Fix this by adding a pud_trans_unstable() function similar to pmd_trans_unstable() and check whether the pud is really stable before using the pmd pointer. Race: Thread 1: Thread 2: Comment create_huge_pud() Fallback - not taken. create_huge_pud() Taken. pmd_alloc() Returns an invalid pointer. This will result in user-visible huge page data corruption. Note that this was caught during a code audit rather than a real experienced problem. It looks to me like the only implementation that currently creates huge pud pagetable entries is dev_dax_huge_fault() which doesn't appear to care much about private (COW) mappings or write-tracking which is, I believe, a prerequisite for create_huge_pud() falling back on thread 1, but not in thread 2. Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org Fixes: a00cc7d9 ("mm, x86: add support for PUD-sized transparent hugepages") Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joel Fernandes (Google) 提交于
When a process updates the RSS of a different process, the rss_stat tracepoint appears in the context of the process doing the update. This can confuse userspace that the RSS of process doing the update is updated, while in reality a different process's RSS was updated. This issue happens in reclaim paths such as with direct reclaim or background reclaim. This patch adds more information to the tracepoint about whether the mm being updated belongs to the current process's context (curr field). We also include a hash of the mm pointer so that the process who the mm belongs to can be uniquely identified (mm_id field). Also vsprintf.c is refactored a bit to allow reuse of hashing code. [akpm@linux-foundation.org: remove unused local `str'] [joelaf@google.com: inline call to ptr_to_hashval] Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.orgSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org> Reported-by: NIoannis Ilkos <ilkos@google.com> Acked-by: Petr Mladek <pmladek@suse.com> [lib/vsprintf.c] Cc: Tim Murray <timmurray@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Carmen Jackson <carmenjackson@google.com> Cc: Mayank Gupta <mayankgupta@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Minchan Kim <minchan@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joel Fernandes (Google) 提交于
Useful to track how RSS is changing per TGID to detect spikes in RSS and memory hogs. Several Android teams have been using this patch in various kernel trees for half a year now. Many reported to me it is really useful so I'm posting it upstream. Initial patch developed by Tim Murray. Changes I made from original patch: o Prevent any additional space consumed by mm_struct. Regarding the fact that the RSS may change too often thus flooding the traces - note that, there is some "hysterisis" with this already. That is - We update the counter only if we receive 64 page faults due to SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range, the RSS is updated immediately which can become noisy/flooding. In a previous discussion, we agreed that BPF or ftrace can be used to rate limit the signal if this becomes an issue. Also note that I added wrappers to trace_rss_stat to prevent compiler errors where linux/mm.h is included from tracing code, causing errors such as: CC kernel/trace/power-traces.o In file included from ./include/trace/define_trace.h:102, from ./include/trace/events/kmem.h:342, from ./include/linux/mm.h:31, from ./include/linux/ring_buffer.h:5, from ./include/linux/trace_events.h:6, from ./include/trace/events/power.h:12, from kernel/trace/power-traces.c:15: ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type struct trace_entry ent; \ Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.orgCo-developed-by: NTim Murray <timmurray@google.com> Signed-off-by: NTim Murray <timmurray@google.com> Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Carmen Jackson <carmenjackson@google.com> Cc: Mayank Gupta <mayankgupta@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Minchan Kim <minchan@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
One of our services is observing hanging ps/top/etc under heavy write IO, and the task states show this is an mmap_sem priority inversion: A write fault is holding the mmap_sem in read-mode and waiting for (heavily cgroup-limited) IO in balance_dirty_pages(): balance_dirty_pages+0x724/0x905 balance_dirty_pages_ratelimited+0x254/0x390 fault_dirty_shared_page.isra.96+0x4a/0x90 do_wp_page+0x33e/0x400 __handle_mm_fault+0x6f0/0xfa0 handle_mm_fault+0xe4/0x200 __do_page_fault+0x22b/0x4a0 page_fault+0x45/0x50 Somebody tries to change the address space, contending for the mmap_sem in write-mode: call_rwsem_down_write_failed_killable+0x13/0x20 do_mprotect_pkey+0xa8/0x330 SyS_mprotect+0xf/0x20 do_syscall_64+0x5b/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 The waiting writer locks out all subsequent readers to avoid lock starvation, and several threads can be seen hanging like this: call_rwsem_down_read_failed+0x14/0x30 proc_pid_cmdline_read+0xa0/0x480 __vfs_read+0x23/0x140 vfs_read+0x87/0x130 SyS_read+0x42/0x90 do_syscall_64+0x5b/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 To fix this, do what we do for cache read faults already: drop the mmap_sem before calling into anything IO bound, in this case the balance_dirty_pages() function, and return VM_FAULT_RETRY. Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 10月, 2019 1 次提交
-
-
由 Jia He 提交于
When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. To reproduce the bug, the cmd is as follows after you deployed everything: make -C src/test/vmmalloc_fork/ TEST_TIME=60m check Below call trace is from arm64 do_page_fault for debugging purpose: [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003, pmd=000000023d4b3003, pte=360000298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fixes it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case (by Kirill). [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_forkSigned-off-by: NJia He <justin.he@arm.com> Reported-by: NYibo Cai <Yibo.Cai@arm.com> Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
-
- 25 9月, 2019 3 次提交
-
-
由 Kefeng Wang 提交于
Using %px to show the actual address in print_bad_pte() to help us to debug issue. Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
In our testing (camera recording), Miguel and Wei found unmap_page_range() takes above 6ms with preemption disabled easily. When I see that, the reason is it holds page table spinlock during entire 512 page operation in a PMD. 6.2ms is never trivial for user experince if RT task couldn't run in the time because it could make frame drop or glitch audio problem. I had a time to benchmark it via adding some trace_printk hooks between pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing device is 2018 premium mobile device. I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the task runs on little core even though it doesn't have any IPI and LRU lock contention. It's already too heavy. If I remove activate_page, 35-40% overhead of zap_pte_range is gone so most of overhead(about 0.7ms) comes from activate_page via mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could accumulate up to several ms. So this patch adds a check for need_resched() in the loop, and a preemption point if necessary. Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.comSigned-off-by: NMinchan Kim <minchan@kernel.org> Reported-by: NMiguel de Dios <migueldedios@google.com> Reported-by: NWei Wang <wvw@google.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
Since ptent will not be changed after previous assignment of entry, it is not necessary to do the assignment again. Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Acked-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 8月, 2019 1 次提交
-
-
由 Darrick J. Wong 提交于
Don't let userspace write to an active swap file because the kernel effectively has a long term lease on the storage and things could get seriously corrupted if we let this happen. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 19 7月, 2019 1 次提交
-
-
由 Yang Shi 提交于
transhuge_vma_suitable() was only available for shmem THP, but anonymous THP has the same check except pgoff check. And, it will be used for THP eligible check in the later patch, so make it available for all kind of THPs. This also helps reduce code duplication slightly. Since anonymous THP doesn't have to check pgoff, so make pgoff check shmem vma only. And regroup some functions in include/linux/mm.h to solve compile issue since transhuge_vma_suitable() needs call vma_is_anonymous() which was defined after huge_mm.h is included. [akpm@linux-foundation.org: fix typo] [yang.shi@linux.alibaba.com: v4] Link: http://lkml.kernel.org/r/1563400758-124759-2-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1560401041-32207-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 7月, 2019 2 次提交
-
-
git://people.freedesktop.org/~thomash/linux由 Dave Airlie 提交于
This reverts commit 031e610a, reversing changes made to 52d2d44e. The mm changes in there we premature and not fully ack or reviewed by core mm folks, I dropped the ball by merging them via this tree, so lets take em all back out. Signed-off-by: NDave Airlie <airlied@redhat.com>
-
由 Dave Airlie 提交于
This reverts commit 6dfc43d3. Going to revert the whole vmwwgfx pull. Signed-off-by: NDave Airlie <airlied@redhat.com>
-
- 15 7月, 2019 1 次提交
-
-
由 Dave Airlie 提交于
mm/pgtable: drop pgtable_t variable from pte_fn_t functions drops the token came in via the hmm tree, this caused lots of conflicts, but applying this cleanup patch should reduce it to something easier to handle. Just accept the token is unused at this point. Signed-off-by: NDave Airlie <airlied@redhat.com>
-
- 13 7月, 2019 5 次提交
-
-
由 Konstantin Khlebnikov 提交于
This function is used by ptrace and proc files like /proc/pid/cmdline and /proc/pid/environ. Access_remote_vm never returns error codes, all errors are ignored and only size of successfully read data is returned. So, if current task was killed we'll simply return 0 (bytes read). Mmap_sem could be locked for a long time or forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Acked-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miguel Ojeda 提交于
If the caller asks us for offset == num, we should already fail in the first check, i.e. the one testing for offsets beyond the object. At the moment, we are failing on the second test anyway, since count cannot be 0. Still, to agree with the comment of the first test, we should first test it there. Link: http://lkml.kernel.org/r/20190528193004.GA7744@gmail.comSigned-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anshuman Khandual 提交于
Drop the pgtable_t variable from all implementation for pte_fn_t as none of them use it. apply_to_pte_range() should stop computing it as well. Should help us save some cycles. Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Acked-by: NMatthew Wilcox <willy@infradead.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 ----- ----- do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL __swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of reference count is used to implement get/put_swap_device(). >From get_swap_device() to put_swap_device(), RCU reader side is locked, so synchronize_rcu() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages and swapoff are fixed too via calling synchronize_rcu() between clearing PageSwapCache() and freeing swap cache data structure. Another possible method to fix this is to use preempt_off() + stop_machine() to prevent the swap device from being swapoff when its data structure is being accessed. The overhead in hot-path of both methods is similar. The advantages of RCU based method are, 1. stop_machine() may disturb the normal execution code path on other CPUs. 2. File cache uses RCU to protect its radix tree. If the similar mechanism is used for swap cache too, it is easier to share code between them. 3. RCU is used to protect swap cache in total_swapcache_pages() and exit_swap_address_space() already. The two mechanisms can be merged to simplify the logic. Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com Fixes: 235b6217 ("mm/swap: add cluster lock") Signed-off-by: N"Huang, Ying" <ying.huang@intel.com> Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com> Not-nacked-by: NHugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Make the success case use the same cleanup path as the failure case. Link: http://lkml.kernel.org/r/20190523134024.GC24093@localhost.localdomainSigned-off-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 7月, 2019 2 次提交
-
-
由 Christoph Hellwig 提交于
This replaces the hacky ->fault callback, which is currently directly called from common code through a hmm specific data structure as an exercise in layering violations. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRalph Campbell <rcampbell@nvidia.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Reviewed-by: NDan Williams <dan.j.williams@intel.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Christoph Hellwig 提交于
The code hasn't been used since it was added to the tree, and doesn't appear to actually be usable. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NDan Williams <dan.j.williams@intel.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-