1. 23 3月, 2022 3 次提交
    • P
      mm: rename zap_skip_check_mapping() to should_zap_page() · 254ab940
      Peter Xu 提交于
      The previous name is against the natural way people think.  Invert the
      meaning and also the return value.  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20220216094810.60572-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      254ab940
    • P
      mm: don't skip swap entry even if zap_details specified · 5abfd71d
      Peter Xu 提交于
      Patch series "mm: Rework zap ptes on swap entries", v5.
      
      Patch 1 should fix a long standing bug for zap_pte_range() on
      zap_details usage.  The risk is we could have some swap entries skipped
      while we should have zapped them.
      
      Migration entries are not the major concern because file backed memory
      always zap in the pattern that "first time without page lock, then
      re-zap with page lock" hence the 2nd zap will always make sure all
      migration entries are already recovered.
      
      However there can be issues with real swap entries got skipped
      errornoously.  There's a reproducer provided in commit message of patch
      1 for that.
      
      Patch 2-4 are cleanups that are based on patch 1.  After the whole
      patchset applied, we should have a very clean view of zap_pte_range().
      
      Only patch 1 needs to be backported to stable if necessary.
      
      This patch (of 4):
      
      The "details" pointer shouldn't be the token to decide whether we should
      skip swap entries.
      
      For example, when the callers specified details->zap_mapping==NULL, it
      means the user wants to zap all the pages (including COWed pages), then
      we need to look into swap entries because there can be private COWed
      pages that was swapped out.
      
      Skipping some swap entries when details is non-NULL may lead to wrongly
      leaving some of the swap entries while we should have zapped them.
      
      A reproducer of the problem:
      
      ===8<===
              #define _GNU_SOURCE         /* See feature_test_macros(7) */
              #include <stdio.h>
              #include <assert.h>
              #include <unistd.h>
              #include <sys/mman.h>
              #include <sys/types.h>
      
              int page_size;
              int shmem_fd;
              char *buffer;
      
              void main(void)
              {
                      int ret;
                      char val;
      
                      page_size = getpagesize();
                      shmem_fd = memfd_create("test", 0);
                      assert(shmem_fd >= 0);
      
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                      MAP_PRIVATE, shmem_fd, 0);
                      assert(buffer != MAP_FAILED);
      
                      /* Write private page, swap it out */
                      buffer[page_size] = 1;
                      madvise(buffer, page_size * 2, MADV_PAGEOUT);
      
                      /* This should drop private buffer[page_size] already */
                      ret = ftruncate(shmem_fd, page_size);
                      assert(ret == 0);
                      /* Recover the size */
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      /* Re-read the data, it should be all zero */
                      val = buffer[page_size];
                      if (val == 0)
                              printf("Good\n");
                      else
                              printf("BUG\n");
              }
      ===8<===
      
      We don't need to touch up the pmd path, because pmd never had a issue with
      swap entries.  For example, shmem pmd migration will always be split into
      pte level, and same to swapping on anonymous.
      
      Add another helper should_zap_cows() so that we can also check whether we
      should zap private mappings when there's no page pointer specified.
      
      This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
      we should do the same check upon migration entry, hwpoison entry and
      genuine swap entries too.
      
      To be explicit, we should still remember to keep the private entries if
      even_cows==false, and always zap them when even_cows==true.
      
      The issue seems to exist starting from the initial commit of git.
      
      [peterx@redhat.com: comment tweaks]
        Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5abfd71d
    • M
      mm: hugetlb: fix missing cache flush in copy_huge_page_from_user() · e763243c
      Muchun Song 提交于
      userfaultfd calls copy_huge_page_from_user() which does not do any cache
      flushing for the target page.  Then the target page will be mapped to
      the user space with a different address (user address), which might have
      an alias issue with the kernel address used to copy the data from the
      user to.
      
      Fix this issue by flushing dcache in copy_huge_page_from_user().
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com
      Fixes: fa4d75c1 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e763243c
  2. 20 1月, 2022 1 次提交
  3. 15 1月, 2022 3 次提交
  4. 08 1月, 2022 1 次提交
  5. 07 11月, 2021 6 次提交
  6. 29 10月, 2021 1 次提交
    • Y
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault · eac96c3e
      Yang Shi 提交于
      When handling shmem page fault the THP with corrupted subpage could be
      PMD mapped if certain conditions are satisfied.  But kernel is supposed
      to send SIGBUS when trying to map hwpoisoned page.
      
      There are two paths which may do PMD map: fault around and regular
      fault.
      
      Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") the thing was even worse in fault around path.  The THP
      could be PMD mapped as long as the VMA fits regardless what subpage is
      accessed and corrupted.  After this commit as long as head page is not
      corrupted the THP could be PMD mapped.
      
      In the regular fault path the THP could be PMD mapped as long as the
      corrupted page is not accessed and the VMA fits.
      
      This loophole could be fixed by iterating every subpage to check if any
      of them is hwpoisoned or not, but it is somewhat costly in page fault
      path.
      
      So introduce a new page flag called HasHWPoisoned on the first tail
      page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
      subpage of THP is found hwpoisoned by memory failure and after the
      refcount is bumped successfully, then cleared when the THP is freed or
      split.
      
      The soft offline path doesn't need this since soft offline handler just
      marks a subpage hwpoisoned when the subpage is migrated successfully.
      But shmem THP didn't get split then migrated at all.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac96c3e
  7. 18 10月, 2021 1 次提交
  8. 01 10月, 2021 1 次提交
  9. 27 9月, 2021 2 次提交
  10. 13 9月, 2021 1 次提交
    • D
      afs: Fix mmap coherency vs 3rd-party changes · 6e0e99d5
      David Howells 提交于
      Fix the coherency management of mmap'd data such that 3rd-party changes
      become visible as soon as possible after the callback notification is
      delivered by the fileserver.  This is done by the following means:
      
       (1) When we break a callback on a vnode specified by the CB.CallBack call
           from the server, we queue a work item (vnode->cb_work) to go and
           clobber all the PTEs mapping to that inode.
      
           This causes the CPU to trip through the ->map_pages() and
           ->page_mkwrite() handlers if userspace attempts to access the page(s)
           again.
      
           (Ideally, this would be done in the service handler for CB.CallBack,
           but the server is waiting for our reply before considering, and we
           have a list of vnodes, all of which need breaking - and the process of
           getting the mmap_lock and stripping the PTEs on all CPUs could be
           quite slow.)
      
       (2) Call afs_validate() from the ->map_pages() handler to check to see if
           the file has changed and to get a new callback promise from the
           server.
      
      Also handle the fileserver telling us that it's dropping all callbacks,
      possibly after it's been restarted by sending us a CB.InitCallBackState*
      call by the following means:
      
       (3) Maintain a per-cell list of afs files that are currently mmap'd
           (cell->fs_open_mmaps).
      
       (4) Add a work item to each server that is invoked if there are any open
           mmaps when CB.InitCallBackState happens.  This work item goes through
           the aforementioned list and invokes the vnode->cb_work work item for
           each one that is currently using this server.
      
           This causes the PTEs to be cleared, causing ->map_pages() or
           ->page_mkwrite() to be called again, thereby calling afs_validate()
           again.
      
      I've chosen to simply strip the PTEs at the point of notification reception
      rather than invalidate all the pages as well because (a) it's faster, (b)
      we may get a notification for other reasons than the data being altered (in
      which case we don't want to clobber the pagecache) and (c) we need to ask
      the server to find out - and I don't want to wait for the reply before
      holding up userspace.
      
      This was tested using the attached test program:
      
      	#include <stdbool.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      	#include <sys/mman.h>
      	int main(int argc, char *argv[])
      	{
      		size_t size = getpagesize();
      		unsigned char *p;
      		bool mod = (argc == 3);
      		int fd;
      		if (argc != 2 && argc != 3) {
      			fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
      			exit(2);
      		}
      		fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
      		if (fd < 0) {
      			perror(argv[1]);
      			exit(1);
      		}
      
      		p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
      			 MAP_SHARED, fd, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      		for (;;) {
      			if (mod) {
      				p[0]++;
      				msync(p, size, MS_ASYNC);
      				fsync(fd);
      			}
      			printf("%02x", p[0]);
      			fflush(stdout);
      			sleep(1);
      		}
      	}
      
      It runs in two modes: in one mode, it mmaps a file, then sits in a loop
      reading the first byte, printing it and sleeping for a second; in the
      second mode it mmaps a file, then sits in a loop incrementing the first
      byte and flushing, then printing and sleeping.
      
      Two instances of this program can be run on different machines, one doing
      the reading and one doing the writing.  The reader should see the changes
      made by the writer, but without this patch, they aren't because validity
      checking is being done lazily - only on entry to the filesystem.
      
      Testing the InitCallBackState change is more complicated.  The server has
      to be taken offline, the saved callback state file removed and then the
      server restarted whilst the reading-mode program continues to run.  The
      client machine then has to poke the server to trigger the InitCallBackState
      call.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
      6e0e99d5
  11. 24 7月, 2021 1 次提交
  12. 02 7月, 2021 4 次提交
    • A
      mm: device exclusive memory access · b756a3b5
      Alistair Popple 提交于
      Some devices require exclusive write access to shared virtual memory (SVM)
      ranges to perform atomic operations on that memory.  This requires CPU
      page tables to be updated to deny access whilst atomic operations are
      occurring.
      
      In order to do this introduce a new swap entry type
      (SWP_DEVICE_EXCLUSIVE).  When a SVM range needs to be marked for exclusive
      access by a device all page table mappings for the particular range are
      replaced with device exclusive swap entries.  This causes any CPU access
      to the page to result in a fault.
      
      Faults are resovled by replacing the faulting entry with the original
      mapping.  This results in MMU notifiers being called which a driver uses
      to update access permissions such as revoking atomic access.  After
      notifiers have been called the device will no longer have exclusive access
      to the region.
      
      Walking of the page tables to find the target pages is handled by
      get_user_pages() rather than a direct page table walk.  A direct page
      table walk similar to what migrate_vma_collect()/unmap() does could also
      have been utilised.  However this resulted in more code similar in
      functionality to what get_user_pages() provides as page faulting is
      required to make the PTEs present and to break COW.
      
      [dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
        Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b756a3b5
    • A
      mm/memory.c: allow different return codes for copy_nonpresent_pte() · 9a5cc85c
      Alistair Popple 提交于
      Currently if copy_nonpresent_pte() returns a non-zero value it is assumed
      to be a swap entry which requires further processing outside the loop in
      copy_pte_range() after dropping locks.  This prevents other values being
      returned to signal conditions such as failure which a subsequent change
      requires.
      
      Instead make copy_nonpresent_pte() return an error code if further
      processing is required and read the value for the swap entry in the main
      loop under the ptl.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5cc85c
    • A
      mm/swapops: rework swap entry manipulation code · 4dd845b5
      Alistair Popple 提交于
      Both migration and device private pages use special swap entries that are
      manipluated by a range of inline functions.  The arguments to these are
      somewhat inconsistent so rework them to remove flag type arguments and to
      make the arguments similar for both read and write entry creation.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dd845b5
    • A
      mm: remove special swap entry functions · af5cdaf8
      Alistair Popple 提交于
      Patch series "Add support for SVM atomics in Nouveau", v11.
      
      Introduction
      ============
      
      Some devices have features such as atomic PTE bits that can be used to
      implement atomic access to system memory.  To support atomic operations to
      a shared virtual memory page such a device needs access to that page which
      is exclusive of the CPU.  This series introduces a mechanism to
      temporarily unmap pages granting exclusive access to a device.
      
      These changes are required to support OpenCL atomic operations in Nouveau
      to shared virtual memory (SVM) regions allocated with the
      CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
      OpenCL SVM feature is available at
      https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
      OpenCL_API.html#_shared_virtual_memory .
      
      Implementation
      ==============
      
      Exclusive device access is implemented by adding a new swap entry type
      (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
      difference is that on fault the original entry is immediately restored by
      the fault handler instead of waiting.
      
      Restoring the entry triggers calls to MMU notifers which allows a device
      driver to revoke the atomic access permission from the GPU prior to the
      CPU finalising the entry.
      
      Patches
      =======
      
      Patches 1 & 2 refactor existing migration and device private entry
      functions.
      
      Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
      functionality into separate functions - try_to_migrate_one() and
      try_to_munlock_one().
      
      Patch 5 renames some existing code but does not introduce functionality.
      
      Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
      
      Patch 7 contains the bulk of the implementation for device exclusive
      memory.
      
      Patch 8 contains some additions to the HMM selftests to ensure everything
      works as expected.
      
      Patch 9 is a cleanup for the Nouveau SVM implementation.
      
      Patch 10 contains the implementation of atomic access for the Nouveau
      driver.
      
      Testing
      =======
      
      This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
      which checks that GPU atomic accesses to system memory are atomic.
      Without this series the test fails as there is no way of write-protecting
      the page mapping which results in the device clobbering CPU writes.  For
      reference the test is available at
      https://ozlabs.org/~apopple/opencl_svm_atomics/
      
      Further testing has been performed by adding support for testing exclusive
      access to the hmm-tests kselftests.
      
      This patch (of 10):
      
      Remove multiple similar inline functions for dealing with different types
      of special swap entries.
      
      Both migration and device private swap entries use the swap offset to
      store a pfn.  Instead of multiple inline functions to obtain a struct page
      for each swap entry type use a common function pfn_swap_entry_to_page().
      Also open-code the various entry_to_pfn() functions as this results is
      shorter code that is easier to understand.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
      Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5cdaf8
  13. 01 7月, 2021 4 次提交
    • Y
      mm: memory: make numa_migrate_prep() non-static · f4c0d836
      Yang Shi 提交于
      The numa_migrate_prep() will be used by huge NUMA fault as well in the
      following patch, make it non-static.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c0d836
    • Y
      mm: memory: add orig_pmd to struct vm_fault · 5db4f15c
      Yang Shi 提交于
      Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
      
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.  It is definitely a maintenance burden to keep
      two THP migration implementation for different code paths and it is more
      error prone.  Using the generic THP migration implementation allows us
      remove the duplicate code and some hacks needed by the old ad hoc
      implementation.
      
      A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
      THP and NUMA balancing.  The most of them support THP migration except for
      S390.  Zi Yan tried to add THP migration support for S390 before but it
      was not accepted due to the design of S390 PMD.  For the discussion,
      please see: https://lkml.org/lkml/2018/4/27/953.
      
      Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
      huge PMD for S390 for now.
      
      I saw there were some hacks about gup from git history, but I didn't
      figure out if they have been removed or not since I just found FOLL_NUMA
      code in the current gup implementation and they seems useful.
      
      Patch #1 ~ #2 are preparation patches.
      Patch #3 is the real meat.
      Patch #4 ~ #6 keep consistent counters and behaviors with before.
      Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
      
      Test
      ----
      Did some tests to measure the latency of do_huge_pmd_numa_page.  The test
      VM has 80 vcpus and 64G memory.  The test would create 2 processes to
      consume 128G memory together which would incur memory pressure to cause
      THP splits.  And it also creates 80 processes to hog cpu, and the memory
      consumer processes are bound to different nodes periodically in order to
      increase NUMA faults.
      
      The below test script is used:
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Run stress-ng for 24 hours
      ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
      PID=$!
      
      ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
      
      # Wait for vm stressors forked
      sleep 5
      
      PID_1=`pgrep -P $PID | awk 'NR == 1'`
      PID_2=`pgrep -P $PID | awk 'NR == 2'`
      
      JOB1=`pgrep -P $PID_1`
      JOB2=`pgrep -P $PID_2`
      
      # Bind load jobs to different nodes periodically to force generate
      # cross node memory access
      while [ -d "/proc/$PID" ]
      do
              taskset -apc 8 $JOB1
              taskset -apc 8 $JOB2
              sleep 300
              taskset -apc 58 $JOB1
              taskset -apc 58 $JOB2
              sleep 300
      done
      
      With the above test the histogram of latency of do_huge_pmd_numa_page is
      as shown below.  Since the number of do_huge_pmd_numa_page varies
      drastically for each run (should be due to scheduler), so I converted the
      raw number to percentage.
      
                                   patched               base
      @us[stress-ng]:
      [0]                          3.57%                 0.16%
      [1]                          55.68%                18.36%
      [2, 4)                       10.46%                40.44%
      [4, 8)                       7.26%                 17.82%
      [8, 16)                      21.12%                13.41%
      [16, 32)                     1.06%                 4.27%
      [32, 64)                     0.56%                 4.07%
      [64, 128)                    0.16%                 0.35%
      [128, 256)                   < 0.1%                < 0.1%
      [256, 512)                   < 0.1%                < 0.1%
      [512, 1K)                    < 0.1%                < 0.1%
      [1K, 2K)                     < 0.1%                < 0.1%
      [2K, 4K)                     < 0.1%                < 0.1%
      [4K, 8K)                     < 0.1%                < 0.1%
      [8K, 16K)                    < 0.1%                < 0.1%
      [16K, 32K)                   < 0.1%                < 0.1%
      [32K, 64K)                   < 0.1%                < 0.1%
      
      Per the result, patched kernel is even slightly better than the base
      kernel.  I think this is because the lock contention against THP split is
      less than base kernel due to the refactor.
      
      To exclude the affect from THP split, I also did test w/o memory pressure.
      No obvious regression is spotted.  The below is the test result *w/o*
      memory pressure.
      
                                 patched                  base
      @us[stress-ng]:
      [0]                        7.97%                   18.4%
      [1]                        69.63%                  58.24%
      [2, 4)                     4.18%                   2.63%
      [4, 8)                     0.22%                   0.17%
      [8, 16)                    1.03%                   0.92%
      [16, 32)                   0.14%                   < 0.1%
      [32, 64)                   < 0.1%                  < 0.1%
      [64, 128)                  < 0.1%                  < 0.1%
      [128, 256)                 < 0.1%                  < 0.1%
      [256, 512)                 0.45%                   1.19%
      [512, 1K)                  15.45%                  17.27%
      [1K, 2K)                   < 0.1%                  < 0.1%
      [2K, 4K)                   < 0.1%                  < 0.1%
      [4K, 8K)                   < 0.1%                  < 0.1%
      [8K, 16K)                  0.86%                   0.88%
      [16K, 32K)                 < 0.1%                  0.15%
      [32K, 64K)                 < 0.1%                  < 0.1%
      [64K, 128K)                < 0.1%                  < 0.1%
      [128K, 256K)               < 0.1%                  < 0.1%
      
      The series also survived a series of tests that exercise NUMA balancing
      migrations by Mel.
      
      This patch (of 7):
      
      Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
      page fault could be removed, just like its PTE counterpart does.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db4f15c
    • A
      userfaultfd/shmem: support minor fault registration for shmem · c949b097
      Axel Rasmussen 提交于
      This patch allows shmem-backed VMAs to be registered for minor faults.
      Minor faults are appropriately relayed to userspace in the fault path, for
      VMAs with the relevant flag.
      
      This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
      minor faults, though, so userspace doesn't yet have a way to resolve such
      faults.
      
      Because of this, we also don't yet advertise this as a supported feature.
      That will be done in a separate commit when the feature is fully
      implemented.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c949b097
    • P
      mm/userfaultfd: fix uffd-wp special cases for fork() · 8f34f1ea
      Peter Xu 提交于
      We tried to do something similar in b569a176 ("userfaultfd: wp: drop
      _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
      right..  A few fixes around the code path:
      
      1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
         than the new vma.  That's overlooked in b569a176, so it won't work
         as expected.  Thanks to the recent rework on fork code
         (7a4830c3), we can easily get the new vma now, so switch the
         checks to that.
      
      2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
         huge pmd is a migration huge pmd.  When it happens, instead of using
         pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
         handle them separately.
      
      3. Forget to carry over uffd-wp bit for a write migration huge pmd
         entry.  This also happens in copy_huge_pmd(), where we converted a
         write huge migration entry into a read one.
      
      4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
      
      5. In copy_present_page() when COW is enforced when fork(), we also
         need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
         vma, and when the pte to be copied has uffd-wp bit set.
      
      Remove the comment in copy_present_pte() about this.  It won't help a huge
      lot to only comment there, but comment everywhere would be an overkill.
      Let's assume the commit messages would help.
      
      [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
        Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
      Fixes: b569a176 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f34f1ea
  14. 30 6月, 2021 5 次提交
  15. 17 6月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 22061a1f
      Hugh Dickins 提交于
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22061a1f
  16. 05 6月, 2021 1 次提交
  17. 07 5月, 2021 2 次提交
  18. 01 5月, 2021 2 次提交
    • N
      mm: apply_to_pte_range warn and fail if a large pte is encountered · 0c95cba4
      Nicholas Piggin 提交于
      apply_to_pte_range might mistake a large pte for bad, or treat it as a
      page table, resulting in a crash or corruption.  Add a test to warn and
      return error if large entries are found.
      
      Link: https://lkml.kernel.org/r/20210317062402.533919-4-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ding Tianhong <dingtianhong@huawei.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c95cba4
    • H
      NUMA balancing: reduce TLB flush via delaying mapping on hint page fault · b99a342d
      Huang Ying 提交于
      With NUMA balancing, in hint page fault handler, the faulting page will be
      migrated to the accessing node if necessary.  During the migration, TLB
      will be shot down on all CPUs that the process has run on recently.
      Because in the hint page fault handler, the PTE will be made accessible
      before the migration is tried.  The overhead of TLB shooting down can be
      high, so it's better to be avoided if possible.  In fact, if we delay
      mapping the page until migration, that can be avoided.  This is what this
      patch doing.
      
      For the multiple threads applications, it's possible that a page is
      accessed by multiple threads almost at the same time.  In the original
      implementation, because the first thread will install the accessible PTE
      before migrating the page, the other threads may access the page directly
      before the page is made inaccessible again during migration.  While with
      the patch, the second thread will go through the page fault handler too.
      And because of the PageLRU() checking in the following code path,
      
        migrate_misplaced_page()
          numamigrate_isolate_page()
            isolate_lru_page()
      
      the migrate_misplaced_page() will return 0, and the PTE will be made
      accessible in the second thread.
      
      This will introduce a little more overhead.  But we think the possibility
      for a page to be accessed by the multiple threads at the same time is low,
      and the overhead difference isn't too large.  If this becomes a problem in
      some workloads, we need to consider how to reduce the overhead.
      
      To test the patch, we run a test case as follows on a 2-socket Intel
      server (1 NUMA node per socket) with 128GB DRAM (64GB per socket).
      
      1. Run a memory eater on NUMA node 1 to use 40GB memory before running
         pmbench.
      
      2. Run pmbench (normal accessing pattern) with 8 processes, and 8
         threads per process, so there are 64 threads in total.  The
         working-set size of each process is 8960MB, so the total working-set
         size is 8 * 8960MB = 70GB.  The CPU of all pmbench processes is bound
         to node 1.  The pmbench processes will access some DRAM on node 0.
      
      3. After the pmbench processes run for 10 seconds, kill the memory
         eater.  Now, some pages will be migrated from node 0 to node 1 via
         NUMA balancing.
      
      Test results show that, with the patch, the pmbench throughput (page
      accesses/s) increases 5.5%.  The number of the TLB shootdowns interrupts
      reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6 pages (35.8GB)
      migrated.  From the perf profile, it can be found that the CPU cycles
      spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%.  That
      is, the CPU cycles spent by TLB shooting down decreases greatly.
      
      Link: https://lkml.kernel.org/r/20210408132236.1175607-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Matthew Wilcox" <willy@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b99a342d