1. 23 3月, 2022 9 次提交
  2. 25 2月, 2022 1 次提交
    • A
      uaccess: remove CONFIG_SET_FS · 967747bb
      Arnd Bergmann 提交于
      There are no remaining callers of set_fs(), so CONFIG_SET_FS
      can be removed globally, along with the thread_info field and
      any references to it.
      
      This turns access_ok() into a cheaper check against TASK_SIZE_MAX.
      
      As CONFIG_SET_FS is now gone, drop all remaining references to
      set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().
      
      Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc changes
      Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
      Acked-by: NDinh Nguyen <dinguyen@kernel.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      967747bb
  3. 18 2月, 2022 1 次提交
    • H
      mm/munlock: rmap call mlock_vma_page() munlock_vma_page() · cea86fe2
      Hugh Dickins 提交于
      Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
      inline functions which check (vma->vm_flags & VM_LOCKED) before calling
      mlock_page() and munlock_page() in mm/mlock.c.
      
      Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
      because we have understandable difficulty in accounting pte maps of THPs,
      and if passed a PageHead page, mlock_page() and munlock_page() cannot
      tell whether it's a pmd map to be counted or a pte map to be ignored.
      
      Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
      others, and use that to call mlock_vma_page() at the end of the page
      adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
      beginning? unimportant, but end was easier for assertions in testing).
      
      No page lock is required (although almost all adds happen to hold it):
      delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
      Certainly page lock did serialize with page migration, but I'm having
      difficulty explaining why that was ever important.
      
      Mlock accounting on THPs has been hard to define, differed between anon
      and file, involved PageDoubleMap in some places and not others, required
      clear_page_mlock() at some points.  Keep it simple now: just count the
      pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
      
      page_add_new_anon_rmap() callers unchanged: they have long been calling
      lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
      handling (it also checks for not VM_SPECIAL: I think that's overcautious,
      and inconsistent with other checks, that mmap_region() already prevents
      VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      cea86fe2
  4. 20 1月, 2022 1 次提交
  5. 15 1月, 2022 3 次提交
  6. 08 1月, 2022 1 次提交
  7. 07 11月, 2021 6 次提交
  8. 29 10月, 2021 1 次提交
    • Y
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault · eac96c3e
      Yang Shi 提交于
      When handling shmem page fault the THP with corrupted subpage could be
      PMD mapped if certain conditions are satisfied.  But kernel is supposed
      to send SIGBUS when trying to map hwpoisoned page.
      
      There are two paths which may do PMD map: fault around and regular
      fault.
      
      Before commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") the thing was even worse in fault around path.  The THP
      could be PMD mapped as long as the VMA fits regardless what subpage is
      accessed and corrupted.  After this commit as long as head page is not
      corrupted the THP could be PMD mapped.
      
      In the regular fault path the THP could be PMD mapped as long as the
      corrupted page is not accessed and the VMA fits.
      
      This loophole could be fixed by iterating every subpage to check if any
      of them is hwpoisoned or not, but it is somewhat costly in page fault
      path.
      
      So introduce a new page flag called HasHWPoisoned on the first tail
      page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
      subpage of THP is found hwpoisoned by memory failure and after the
      refcount is bumped successfully, then cleared when the THP is freed or
      split.
      
      The soft offline path doesn't need this since soft offline handler just
      marks a subpage hwpoisoned when the subpage is migrated successfully.
      But shmem THP didn't get split then migrated at all.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac96c3e
  9. 18 10月, 2021 1 次提交
  10. 01 10月, 2021 1 次提交
  11. 27 9月, 2021 2 次提交
  12. 13 9月, 2021 1 次提交
    • D
      afs: Fix mmap coherency vs 3rd-party changes · 6e0e99d5
      David Howells 提交于
      Fix the coherency management of mmap'd data such that 3rd-party changes
      become visible as soon as possible after the callback notification is
      delivered by the fileserver.  This is done by the following means:
      
       (1) When we break a callback on a vnode specified by the CB.CallBack call
           from the server, we queue a work item (vnode->cb_work) to go and
           clobber all the PTEs mapping to that inode.
      
           This causes the CPU to trip through the ->map_pages() and
           ->page_mkwrite() handlers if userspace attempts to access the page(s)
           again.
      
           (Ideally, this would be done in the service handler for CB.CallBack,
           but the server is waiting for our reply before considering, and we
           have a list of vnodes, all of which need breaking - and the process of
           getting the mmap_lock and stripping the PTEs on all CPUs could be
           quite slow.)
      
       (2) Call afs_validate() from the ->map_pages() handler to check to see if
           the file has changed and to get a new callback promise from the
           server.
      
      Also handle the fileserver telling us that it's dropping all callbacks,
      possibly after it's been restarted by sending us a CB.InitCallBackState*
      call by the following means:
      
       (3) Maintain a per-cell list of afs files that are currently mmap'd
           (cell->fs_open_mmaps).
      
       (4) Add a work item to each server that is invoked if there are any open
           mmaps when CB.InitCallBackState happens.  This work item goes through
           the aforementioned list and invokes the vnode->cb_work work item for
           each one that is currently using this server.
      
           This causes the PTEs to be cleared, causing ->map_pages() or
           ->page_mkwrite() to be called again, thereby calling afs_validate()
           again.
      
      I've chosen to simply strip the PTEs at the point of notification reception
      rather than invalidate all the pages as well because (a) it's faster, (b)
      we may get a notification for other reasons than the data being altered (in
      which case we don't want to clobber the pagecache) and (c) we need to ask
      the server to find out - and I don't want to wait for the reply before
      holding up userspace.
      
      This was tested using the attached test program:
      
      	#include <stdbool.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      	#include <sys/mman.h>
      	int main(int argc, char *argv[])
      	{
      		size_t size = getpagesize();
      		unsigned char *p;
      		bool mod = (argc == 3);
      		int fd;
      		if (argc != 2 && argc != 3) {
      			fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
      			exit(2);
      		}
      		fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
      		if (fd < 0) {
      			perror(argv[1]);
      			exit(1);
      		}
      
      		p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
      			 MAP_SHARED, fd, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      		for (;;) {
      			if (mod) {
      				p[0]++;
      				msync(p, size, MS_ASYNC);
      				fsync(fd);
      			}
      			printf("%02x", p[0]);
      			fflush(stdout);
      			sleep(1);
      		}
      	}
      
      It runs in two modes: in one mode, it mmaps a file, then sits in a loop
      reading the first byte, printing it and sleeping for a second; in the
      second mode it mmaps a file, then sits in a loop incrementing the first
      byte and flushing, then printing and sleeping.
      
      Two instances of this program can be run on different machines, one doing
      the reading and one doing the writing.  The reader should see the changes
      made by the writer, but without this patch, they aren't because validity
      checking is being done lazily - only on entry to the filesystem.
      
      Testing the InitCallBackState change is more complicated.  The server has
      to be taken offline, the saved callback state file removed and then the
      server restarted whilst the reading-mode program continues to run.  The
      client machine then has to poke the server to trigger the InitCallBackState
      call.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
      6e0e99d5
  13. 24 7月, 2021 1 次提交
  14. 02 7月, 2021 4 次提交
    • A
      mm: device exclusive memory access · b756a3b5
      Alistair Popple 提交于
      Some devices require exclusive write access to shared virtual memory (SVM)
      ranges to perform atomic operations on that memory.  This requires CPU
      page tables to be updated to deny access whilst atomic operations are
      occurring.
      
      In order to do this introduce a new swap entry type
      (SWP_DEVICE_EXCLUSIVE).  When a SVM range needs to be marked for exclusive
      access by a device all page table mappings for the particular range are
      replaced with device exclusive swap entries.  This causes any CPU access
      to the page to result in a fault.
      
      Faults are resovled by replacing the faulting entry with the original
      mapping.  This results in MMU notifiers being called which a driver uses
      to update access permissions such as revoking atomic access.  After
      notifiers have been called the device will no longer have exclusive access
      to the region.
      
      Walking of the page tables to find the target pages is handled by
      get_user_pages() rather than a direct page table walk.  A direct page
      table walk similar to what migrate_vma_collect()/unmap() does could also
      have been utilised.  However this resulted in more code similar in
      functionality to what get_user_pages() provides as page faulting is
      required to make the PTEs present and to break COW.
      
      [dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
        Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b756a3b5
    • A
      mm/memory.c: allow different return codes for copy_nonpresent_pte() · 9a5cc85c
      Alistair Popple 提交于
      Currently if copy_nonpresent_pte() returns a non-zero value it is assumed
      to be a swap entry which requires further processing outside the loop in
      copy_pte_range() after dropping locks.  This prevents other values being
      returned to signal conditions such as failure which a subsequent change
      requires.
      
      Instead make copy_nonpresent_pte() return an error code if further
      processing is required and read the value for the swap entry in the main
      loop under the ptl.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5cc85c
    • A
      mm/swapops: rework swap entry manipulation code · 4dd845b5
      Alistair Popple 提交于
      Both migration and device private pages use special swap entries that are
      manipluated by a range of inline functions.  The arguments to these are
      somewhat inconsistent so rework them to remove flag type arguments and to
      make the arguments similar for both read and write entry creation.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dd845b5
    • A
      mm: remove special swap entry functions · af5cdaf8
      Alistair Popple 提交于
      Patch series "Add support for SVM atomics in Nouveau", v11.
      
      Introduction
      ============
      
      Some devices have features such as atomic PTE bits that can be used to
      implement atomic access to system memory.  To support atomic operations to
      a shared virtual memory page such a device needs access to that page which
      is exclusive of the CPU.  This series introduces a mechanism to
      temporarily unmap pages granting exclusive access to a device.
      
      These changes are required to support OpenCL atomic operations in Nouveau
      to shared virtual memory (SVM) regions allocated with the
      CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
      OpenCL SVM feature is available at
      https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
      OpenCL_API.html#_shared_virtual_memory .
      
      Implementation
      ==============
      
      Exclusive device access is implemented by adding a new swap entry type
      (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
      difference is that on fault the original entry is immediately restored by
      the fault handler instead of waiting.
      
      Restoring the entry triggers calls to MMU notifers which allows a device
      driver to revoke the atomic access permission from the GPU prior to the
      CPU finalising the entry.
      
      Patches
      =======
      
      Patches 1 & 2 refactor existing migration and device private entry
      functions.
      
      Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
      functionality into separate functions - try_to_migrate_one() and
      try_to_munlock_one().
      
      Patch 5 renames some existing code but does not introduce functionality.
      
      Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
      
      Patch 7 contains the bulk of the implementation for device exclusive
      memory.
      
      Patch 8 contains some additions to the HMM selftests to ensure everything
      works as expected.
      
      Patch 9 is a cleanup for the Nouveau SVM implementation.
      
      Patch 10 contains the implementation of atomic access for the Nouveau
      driver.
      
      Testing
      =======
      
      This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
      which checks that GPU atomic accesses to system memory are atomic.
      Without this series the test fails as there is no way of write-protecting
      the page mapping which results in the device clobbering CPU writes.  For
      reference the test is available at
      https://ozlabs.org/~apopple/opencl_svm_atomics/
      
      Further testing has been performed by adding support for testing exclusive
      access to the hmm-tests kselftests.
      
      This patch (of 10):
      
      Remove multiple similar inline functions for dealing with different types
      of special swap entries.
      
      Both migration and device private swap entries use the swap offset to
      store a pfn.  Instead of multiple inline functions to obtain a struct page
      for each swap entry type use a common function pfn_swap_entry_to_page().
      Also open-code the various entry_to_pfn() functions as this results is
      shorter code that is easier to understand.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
      Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5cdaf8
  15. 01 7月, 2021 4 次提交
    • Y
      mm: memory: make numa_migrate_prep() non-static · f4c0d836
      Yang Shi 提交于
      The numa_migrate_prep() will be used by huge NUMA fault as well in the
      following patch, make it non-static.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c0d836
    • Y
      mm: memory: add orig_pmd to struct vm_fault · 5db4f15c
      Yang Shi 提交于
      Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
      
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.  It is definitely a maintenance burden to keep
      two THP migration implementation for different code paths and it is more
      error prone.  Using the generic THP migration implementation allows us
      remove the duplicate code and some hacks needed by the old ad hoc
      implementation.
      
      A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
      THP and NUMA balancing.  The most of them support THP migration except for
      S390.  Zi Yan tried to add THP migration support for S390 before but it
      was not accepted due to the design of S390 PMD.  For the discussion,
      please see: https://lkml.org/lkml/2018/4/27/953.
      
      Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
      huge PMD for S390 for now.
      
      I saw there were some hacks about gup from git history, but I didn't
      figure out if they have been removed or not since I just found FOLL_NUMA
      code in the current gup implementation and they seems useful.
      
      Patch #1 ~ #2 are preparation patches.
      Patch #3 is the real meat.
      Patch #4 ~ #6 keep consistent counters and behaviors with before.
      Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
      
      Test
      ----
      Did some tests to measure the latency of do_huge_pmd_numa_page.  The test
      VM has 80 vcpus and 64G memory.  The test would create 2 processes to
      consume 128G memory together which would incur memory pressure to cause
      THP splits.  And it also creates 80 processes to hog cpu, and the memory
      consumer processes are bound to different nodes periodically in order to
      increase NUMA faults.
      
      The below test script is used:
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Run stress-ng for 24 hours
      ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
      PID=$!
      
      ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
      
      # Wait for vm stressors forked
      sleep 5
      
      PID_1=`pgrep -P $PID | awk 'NR == 1'`
      PID_2=`pgrep -P $PID | awk 'NR == 2'`
      
      JOB1=`pgrep -P $PID_1`
      JOB2=`pgrep -P $PID_2`
      
      # Bind load jobs to different nodes periodically to force generate
      # cross node memory access
      while [ -d "/proc/$PID" ]
      do
              taskset -apc 8 $JOB1
              taskset -apc 8 $JOB2
              sleep 300
              taskset -apc 58 $JOB1
              taskset -apc 58 $JOB2
              sleep 300
      done
      
      With the above test the histogram of latency of do_huge_pmd_numa_page is
      as shown below.  Since the number of do_huge_pmd_numa_page varies
      drastically for each run (should be due to scheduler), so I converted the
      raw number to percentage.
      
                                   patched               base
      @us[stress-ng]:
      [0]                          3.57%                 0.16%
      [1]                          55.68%                18.36%
      [2, 4)                       10.46%                40.44%
      [4, 8)                       7.26%                 17.82%
      [8, 16)                      21.12%                13.41%
      [16, 32)                     1.06%                 4.27%
      [32, 64)                     0.56%                 4.07%
      [64, 128)                    0.16%                 0.35%
      [128, 256)                   < 0.1%                < 0.1%
      [256, 512)                   < 0.1%                < 0.1%
      [512, 1K)                    < 0.1%                < 0.1%
      [1K, 2K)                     < 0.1%                < 0.1%
      [2K, 4K)                     < 0.1%                < 0.1%
      [4K, 8K)                     < 0.1%                < 0.1%
      [8K, 16K)                    < 0.1%                < 0.1%
      [16K, 32K)                   < 0.1%                < 0.1%
      [32K, 64K)                   < 0.1%                < 0.1%
      
      Per the result, patched kernel is even slightly better than the base
      kernel.  I think this is because the lock contention against THP split is
      less than base kernel due to the refactor.
      
      To exclude the affect from THP split, I also did test w/o memory pressure.
      No obvious regression is spotted.  The below is the test result *w/o*
      memory pressure.
      
                                 patched                  base
      @us[stress-ng]:
      [0]                        7.97%                   18.4%
      [1]                        69.63%                  58.24%
      [2, 4)                     4.18%                   2.63%
      [4, 8)                     0.22%                   0.17%
      [8, 16)                    1.03%                   0.92%
      [16, 32)                   0.14%                   < 0.1%
      [32, 64)                   < 0.1%                  < 0.1%
      [64, 128)                  < 0.1%                  < 0.1%
      [128, 256)                 < 0.1%                  < 0.1%
      [256, 512)                 0.45%                   1.19%
      [512, 1K)                  15.45%                  17.27%
      [1K, 2K)                   < 0.1%                  < 0.1%
      [2K, 4K)                   < 0.1%                  < 0.1%
      [4K, 8K)                   < 0.1%                  < 0.1%
      [8K, 16K)                  0.86%                   0.88%
      [16K, 32K)                 < 0.1%                  0.15%
      [32K, 64K)                 < 0.1%                  < 0.1%
      [64K, 128K)                < 0.1%                  < 0.1%
      [128K, 256K)               < 0.1%                  < 0.1%
      
      The series also survived a series of tests that exercise NUMA balancing
      migrations by Mel.
      
      This patch (of 7):
      
      Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
      page fault could be removed, just like its PTE counterpart does.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db4f15c
    • A
      userfaultfd/shmem: support minor fault registration for shmem · c949b097
      Axel Rasmussen 提交于
      This patch allows shmem-backed VMAs to be registered for minor faults.
      Minor faults are appropriately relayed to userspace in the fault path, for
      VMAs with the relevant flag.
      
      This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
      minor faults, though, so userspace doesn't yet have a way to resolve such
      faults.
      
      Because of this, we also don't yet advertise this as a supported feature.
      That will be done in a separate commit when the feature is fully
      implemented.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c949b097
    • P
      mm/userfaultfd: fix uffd-wp special cases for fork() · 8f34f1ea
      Peter Xu 提交于
      We tried to do something similar in b569a176 ("userfaultfd: wp: drop
      _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
      right..  A few fixes around the code path:
      
      1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
         than the new vma.  That's overlooked in b569a176, so it won't work
         as expected.  Thanks to the recent rework on fork code
         (7a4830c3), we can easily get the new vma now, so switch the
         checks to that.
      
      2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
         huge pmd is a migration huge pmd.  When it happens, instead of using
         pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
         handle them separately.
      
      3. Forget to carry over uffd-wp bit for a write migration huge pmd
         entry.  This also happens in copy_huge_pmd(), where we converted a
         write huge migration entry into a read one.
      
      4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
      
      5. In copy_present_page() when COW is enforced when fork(), we also
         need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
         vma, and when the pte to be copied has uffd-wp bit set.
      
      Remove the comment in copy_present_pte() about this.  It won't help a huge
      lot to only comment there, but comment everywhere would be an overkill.
      Let's assume the commit messages would help.
      
      [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
        Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
      Fixes: b569a176 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f34f1ea
  16. 30 6月, 2021 3 次提交