1. 27 9月, 2021 2 次提交
  2. 04 9月, 2021 2 次提交
    • H
      mm,do_huge_pmd_numa_page: remove unnecessary TLB flushing code · f00230ff
      Huang Ying 提交于
      Before commit c5b5a3dd ("mm: thp: refactor NUMA fault handling"), the
      TLB flushing is done in do_huge_pmd_numa_page() itself via
      flush_tlb_range().
      
      But after commit c5b5a3dd ("mm: thp: refactor NUMA fault handling"),
      the TLB flushing is done in migrate_pages() as in the following code path
      anyway.
      
      do_huge_pmd_numa_page
        migrate_misplaced_page
          migrate_pages
      
      So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
      unnecessary.  So the code is deleted in this patch to simplify the code.
      This is only code cleanup, there's no visible performance difference.
      
      The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is
      deleted too.  Because migrate_pages() takes care of that too when CPU
      TLB is flushed.
      
      Link: https://lkml.kernel.org/r/20210720065529.716031-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f00230ff
    • H
      huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE · d144bf62
      Hugh Dickins 提交于
      A successful shmem_fallocate() guarantees that the extent has been
      reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
      But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
      split huge pages and free their excess beyond i_size; and by other uses of
      split_huge_page() near i_size.
      
      It's sad to add a shmem inode field just for this, but I did not find a
      better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
      would be cheaper, but I'm averse to unclearable flags.  The fallocend
      field is not perfect either (many disjoint ranges might be fallocated),
      but good enough; and gains another use later on.
      
      Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d144bf62
  3. 12 7月, 2021 1 次提交
  4. 02 7月, 2021 3 次提交
    • A
      mm/rmap: split migration into its own function · a98a2f0c
      Alistair Popple 提交于
      Migration is currently implemented as a mode of operation for
      try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
      or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
      
      However it does not have much in common with the rest of the unmap
      functionality of try_to_unmap_one() and thus splitting it into a separate
      function reduces the complexity of try_to_unmap_one() making it more
      readable.
      
      Several simplifications can also be made in try_to_migrate_one() based on
      the following observations:
      
       - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
       - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
       - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
      
      TTU_SPLIT_FREEZE is a special case of migration used when splitting an
      anonymous page.  This is most easily dealt with by calling the correct
      function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
      for PageAnon or try_to_unmap().
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a98a2f0c
    • A
      mm/swapops: rework swap entry manipulation code · 4dd845b5
      Alistair Popple 提交于
      Both migration and device private pages use special swap entries that are
      manipluated by a range of inline functions.  The arguments to these are
      somewhat inconsistent so rework them to remove flag type arguments and to
      make the arguments similar for both read and write entry creation.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dd845b5
    • A
      mm: remove special swap entry functions · af5cdaf8
      Alistair Popple 提交于
      Patch series "Add support for SVM atomics in Nouveau", v11.
      
      Introduction
      ============
      
      Some devices have features such as atomic PTE bits that can be used to
      implement atomic access to system memory.  To support atomic operations to
      a shared virtual memory page such a device needs access to that page which
      is exclusive of the CPU.  This series introduces a mechanism to
      temporarily unmap pages granting exclusive access to a device.
      
      These changes are required to support OpenCL atomic operations in Nouveau
      to shared virtual memory (SVM) regions allocated with the
      CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
      OpenCL SVM feature is available at
      https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
      OpenCL_API.html#_shared_virtual_memory .
      
      Implementation
      ==============
      
      Exclusive device access is implemented by adding a new swap entry type
      (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
      difference is that on fault the original entry is immediately restored by
      the fault handler instead of waiting.
      
      Restoring the entry triggers calls to MMU notifers which allows a device
      driver to revoke the atomic access permission from the GPU prior to the
      CPU finalising the entry.
      
      Patches
      =======
      
      Patches 1 & 2 refactor existing migration and device private entry
      functions.
      
      Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
      functionality into separate functions - try_to_migrate_one() and
      try_to_munlock_one().
      
      Patch 5 renames some existing code but does not introduce functionality.
      
      Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
      
      Patch 7 contains the bulk of the implementation for device exclusive
      memory.
      
      Patch 8 contains some additions to the HMM selftests to ensure everything
      works as expected.
      
      Patch 9 is a cleanup for the Nouveau SVM implementation.
      
      Patch 10 contains the implementation of atomic access for the Nouveau
      driver.
      
      Testing
      =======
      
      This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
      which checks that GPU atomic accesses to system memory are atomic.
      Without this series the test fails as there is no way of write-protecting
      the page mapping which results in the device clobbering CPU writes.  For
      reference the test is available at
      https://ozlabs.org/~apopple/opencl_svm_atomics/
      
      Further testing has been performed by adding support for testing exclusive
      access to the hmm-tests kselftests.
      
      This patch (of 10):
      
      Remove multiple similar inline functions for dealing with different types
      of special swap entries.
      
      Both migration and device private swap entries use the swap offset to
      store a pfn.  Instead of multiple inline functions to obtain a struct page
      for each swap entry type use a common function pfn_swap_entry_to_page().
      Also open-code the various entry_to_pfn() functions as this results is
      shorter code that is easier to understand.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
      Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5cdaf8
  5. 01 7月, 2021 11 次提交
    • M
      mm/thp: fix strncpy warning · 1212e00c
      Matthew Wilcox (Oracle) 提交于
      Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify
      complain as it thinks the string might be longer than the buffer, and if
      it is, we will end up with a "string" that is missing a NUL terminator.
      It's trivial to show that 'tok' points to a NUL-terminated string which is
      less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy()
      and avoid the warning.
      
      Link: https://lkml.kernel.org/r/20210615200242.1716568-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1212e00c
    • H
      mm/thp: remap_page() is only needed on anonymous THP · ab02c252
      Hugh Dickins 提交于
      THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and
      migration entries are only inserted when TTU_MIGRATION (unused here) or
      TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to
      search for migration entries to remove when !PageAnon.
      
      Link: https://lkml.kernel.org/r/f987bc44-f28e-688d-2424-b4722153ed8@google.com
      Fixes: baa355fd ("thp: file pages support for split_huge_page()")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab02c252
    • Y
      mm: thp: skip make PMD PROT_NONE if THP migration is not supported · e346e668
      Yang Shi 提交于
      A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
      NUMA balancing and THP.  But S390 doesn't support THP migration so NUMA
      balancing actually can't migrate any misplaced pages.
      
      Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
      by pointless NUMA hinting faults on S390.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e346e668
    • Y
      mm: thp: refactor NUMA fault handling · c5b5a3dd
      Yang Shi 提交于
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.
      
      This patch reworks the NUMA fault handling to use generic migration
      implementation to migrate misplaced page.  There is no functional change.
      
      After the refactor the flow of NUMA fault handling looks just like its
      PTE counterpart:
        Acquire ptl
        Prepare for migration (elevate page refcount)
        Release ptl
        Isolate page from lru and elevate page refcount
        Migrate the misplaced THP
      
      If migration fails just restore the old normal PMD.
      
      In the old code anon_vma lock was needed to serialize THP migration
      against THP split, but since then the THP code has been reworked a lot, it
      seems anon_vma lock is not required anymore to avoid the race.
      
      The page refcount elevation when holding ptl should prevent from THP
      split.
      
      Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
      and remove all the dead and duplicate code.
      
      [dan.carpenter@oracle.com: fix a double unlock bug]
        Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5b5a3dd
    • Y
      mm: memory: add orig_pmd to struct vm_fault · 5db4f15c
      Yang Shi 提交于
      Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
      
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.  It is definitely a maintenance burden to keep
      two THP migration implementation for different code paths and it is more
      error prone.  Using the generic THP migration implementation allows us
      remove the duplicate code and some hacks needed by the old ad hoc
      implementation.
      
      A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
      THP and NUMA balancing.  The most of them support THP migration except for
      S390.  Zi Yan tried to add THP migration support for S390 before but it
      was not accepted due to the design of S390 PMD.  For the discussion,
      please see: https://lkml.org/lkml/2018/4/27/953.
      
      Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
      huge PMD for S390 for now.
      
      I saw there were some hacks about gup from git history, but I didn't
      figure out if they have been removed or not since I just found FOLL_NUMA
      code in the current gup implementation and they seems useful.
      
      Patch #1 ~ #2 are preparation patches.
      Patch #3 is the real meat.
      Patch #4 ~ #6 keep consistent counters and behaviors with before.
      Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
      
      Test
      ----
      Did some tests to measure the latency of do_huge_pmd_numa_page.  The test
      VM has 80 vcpus and 64G memory.  The test would create 2 processes to
      consume 128G memory together which would incur memory pressure to cause
      THP splits.  And it also creates 80 processes to hog cpu, and the memory
      consumer processes are bound to different nodes periodically in order to
      increase NUMA faults.
      
      The below test script is used:
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Run stress-ng for 24 hours
      ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
      PID=$!
      
      ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
      
      # Wait for vm stressors forked
      sleep 5
      
      PID_1=`pgrep -P $PID | awk 'NR == 1'`
      PID_2=`pgrep -P $PID | awk 'NR == 2'`
      
      JOB1=`pgrep -P $PID_1`
      JOB2=`pgrep -P $PID_2`
      
      # Bind load jobs to different nodes periodically to force generate
      # cross node memory access
      while [ -d "/proc/$PID" ]
      do
              taskset -apc 8 $JOB1
              taskset -apc 8 $JOB2
              sleep 300
              taskset -apc 58 $JOB1
              taskset -apc 58 $JOB2
              sleep 300
      done
      
      With the above test the histogram of latency of do_huge_pmd_numa_page is
      as shown below.  Since the number of do_huge_pmd_numa_page varies
      drastically for each run (should be due to scheduler), so I converted the
      raw number to percentage.
      
                                   patched               base
      @us[stress-ng]:
      [0]                          3.57%                 0.16%
      [1]                          55.68%                18.36%
      [2, 4)                       10.46%                40.44%
      [4, 8)                       7.26%                 17.82%
      [8, 16)                      21.12%                13.41%
      [16, 32)                     1.06%                 4.27%
      [32, 64)                     0.56%                 4.07%
      [64, 128)                    0.16%                 0.35%
      [128, 256)                   < 0.1%                < 0.1%
      [256, 512)                   < 0.1%                < 0.1%
      [512, 1K)                    < 0.1%                < 0.1%
      [1K, 2K)                     < 0.1%                < 0.1%
      [2K, 4K)                     < 0.1%                < 0.1%
      [4K, 8K)                     < 0.1%                < 0.1%
      [8K, 16K)                    < 0.1%                < 0.1%
      [16K, 32K)                   < 0.1%                < 0.1%
      [32K, 64K)                   < 0.1%                < 0.1%
      
      Per the result, patched kernel is even slightly better than the base
      kernel.  I think this is because the lock contention against THP split is
      less than base kernel due to the refactor.
      
      To exclude the affect from THP split, I also did test w/o memory pressure.
      No obvious regression is spotted.  The below is the test result *w/o*
      memory pressure.
      
                                 patched                  base
      @us[stress-ng]:
      [0]                        7.97%                   18.4%
      [1]                        69.63%                  58.24%
      [2, 4)                     4.18%                   2.63%
      [4, 8)                     0.22%                   0.17%
      [8, 16)                    1.03%                   0.92%
      [16, 32)                   0.14%                   < 0.1%
      [32, 64)                   < 0.1%                  < 0.1%
      [64, 128)                  < 0.1%                  < 0.1%
      [128, 256)                 < 0.1%                  < 0.1%
      [256, 512)                 0.45%                   1.19%
      [512, 1K)                  15.45%                  17.27%
      [1K, 2K)                   < 0.1%                  < 0.1%
      [2K, 4K)                   < 0.1%                  < 0.1%
      [4K, 8K)                   < 0.1%                  < 0.1%
      [8K, 16K)                  0.86%                   0.88%
      [16K, 32K)                 < 0.1%                  0.15%
      [32K, 64K)                 < 0.1%                  < 0.1%
      [64K, 128K)                < 0.1%                  < 0.1%
      [128K, 256K)               < 0.1%                  < 0.1%
      
      The series also survived a series of tests that exercise NUMA balancing
      migrations by Mel.
      
      This patch (of 7):
      
      Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
      page fault could be removed, just like its PTE counterpart does.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db4f15c
    • P
      mm/userfaultfd: fix uffd-wp special cases for fork() · 8f34f1ea
      Peter Xu 提交于
      We tried to do something similar in b569a176 ("userfaultfd: wp: drop
      _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
      right..  A few fixes around the code path:
      
      1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
         than the new vma.  That's overlooked in b569a176, so it won't work
         as expected.  Thanks to the recent rework on fork code
         (7a4830c3), we can easily get the new vma now, so switch the
         checks to that.
      
      2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
         huge pmd is a migration huge pmd.  When it happens, instead of using
         pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
         handle them separately.
      
      3. Forget to carry over uffd-wp bit for a write migration huge pmd
         entry.  This also happens in copy_huge_pmd(), where we converted a
         write huge migration entry into a read one.
      
      4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
      
      5. In copy_present_page() when COW is enforced when fork(), we also
         need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
         vma, and when the pte to be copied has uffd-wp bit set.
      
      Remove the comment in copy_present_pte() about this.  It won't help a huge
      lot to only comment there, but comment everywhere would be an overkill.
      Let's assume the commit messages would help.
      
      [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
        Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
      Fixes: b569a176 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f34f1ea
    • P
      mm/thp: simplify copying of huge zero page pmd when fork · 5fc7a5f6
      Peter Xu 提交于
      Patch series "mm/uffd: Misc fix for uffd-wp and one more test".
      
      This series tries to fix some corner case bugs for uffd-wp on either thp
      or fork().  Then it introduced a new test with pagemap/pageout.
      
      Patch layout:
      
      Patch 1:    cleanup for THP, it'll slightly simplify the follow up patches
      Patch 2-4:  misc fixes for uffd-wp here and there; please refer to each patch
      Patch 5:    add pagemap support for uffd-wp
      Patch 6:    add pagemap/pageout test for uffd-wp
      
      The last test introduced can also verify some of the fixes in previous
      patches, as the test will fail without the fixes.  However it's not easy
      to verify all the changes in patch 2-4, but hopefully they can still be
      properly reviewed.
      
      Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch
      5 will be incomplete as it's missing e.g.  hugetlbfs part or the special
      swap pte detection.  However that's not needed in this series, and since
      that series is still during review, this series does not depend on that
      one (the last test only runs with anonymous memory, not file-backed).  So
      this series can be merged even before that series.
      
      This patch (of 6):
      
      Huge zero page is handled in a special path in copy_huge_pmd(), however it
      should share most codes with a normal thp page.  Trying to share more code
      with it by removing the special path.  The only leftover so far is the
      huge zero page refcounting (mm_get_huge_zero_page()), because that's
      separately done with a global counter.
      
      This prepares for a future patch to modify the huge pmd to be installed,
      so that we don't need to duplicate it explicitly into huge zero page case
      too.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fc7a5f6
    • M
      mm/huge_memory.c: don't discard hugepage if other processes are mapping it · babbbdd0
      Miaohe Lin 提交于
      If other processes are mapping any other subpages of the hugepage, i.e.
      in pte-mapped thp case, page_mapcount() will return 1 incorrectly.  Then
      we would discard the page while other processes are still mapping it.  Fix
      it by using total_mapcount() which can tell whether other processes are
      still mapping it.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
      Fixes: b8d3c4c3 ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      babbbdd0
    • M
      mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd · 9132a468
      Miaohe Lin 提交于
      Commit aa88b68c ("thp: keep huge zero page pinned until tlb flush")
      introduced tlb_remove_page() for huge zero page to keep it pinned until
      flush is complete and prevents the page from being split under us.  But
      huge zero page is kept pinned until all relevant mm_users reach zero since
      the commit 6fcb52a5 ("thp: reduce usage of huge zero page's atomic
      counter").  So tlb_remove_page_size() for huge zero pmd is unnecessary
      now.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.comReviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9132a468
    • M
      mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() · e6be37b2
      Miaohe Lin 提交于
      Since commit 99cb0dbd ("mm,thp: add read-only THP support for
      (non-shmem) FS"), read-only THP file mapping is supported.  But it forgot
      to add checking for it in transparent_hugepage_enabled().  To fix it, we
      add checking for read-only THP file mapping and also introduce helper
      transhuge_vma_enabled() to check whether thp is enabled for specified vma
      to reduce duplicated code.  We rename transparent_hugepage_enabled to
      transparent_hugepage_active to make the code easier to follow as suggested
      by David Hildenbrand.
      
      [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
        Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6be37b2
    • M
      mm/huge_memory.c: use page->deferred_list · dfe5c51c
      Miaohe Lin 提交于
      Now that we can represent the location of ->deferred_list instead of
      ->mapping + ->index, make use of it to improve readability.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfe5c51c
  6. 17 6月, 2021 4 次提交
    • Y
      mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split · 504e070d
      Yang Shi 提交于
      When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
      fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
      it may miss the failure when head page is unmapped but other subpage is
      mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
      catch it.  This may incur some confusion.
      
      As this is not a fatal issue, so consolidate the two DEBUG_VM checks
      into one VM_WARN_ON_ONCE_PAGE().
      
      [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      504e070d
    • H
      mm/thp: try_to_unmap() use TTU_SYNC for safe splitting · 732ed558
      Hugh Dickins 提交于
      Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
      (!unmap_success): with dump_page() showing mapcount:1, but then its raw
      struct page output showing _mapcount ffffffff i.e.  mapcount 0.
      
      And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
      it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
      and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
      all indicative of some mapcount difficulty in development here perhaps.
      But the !CONFIG_DEBUG_VM path handles the failures correctly and
      silently.
      
      I believe the problem is that once a racing unmap has cleared pte or
      pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
      from try_to_unmap() before the racing task has reached decrementing
      mapcount.
      
      Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
      follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
      TTU_SYNC to the options, and passing that from unmap_page().
      
      When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
      for both: the slight overhead added should rarely matter, except perhaps
      if splitting sparsely-populated multiply-mapped shmem.  Once confident
      that bugs are fixed, TTU_SYNC here can be removed, and the race
      tolerated.
      
      Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
      Fixes: fec89c10 ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      732ed558
    • H
      mm/thp: make is_huge_zero_pmd() safe and quicker · 3b77e8c8
      Hugh Dickins 提交于
      Most callers of is_huge_zero_pmd() supply a pmd already verified
      present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
      migration entry, in which the pfn is encoded differently from a present
      pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
      since L1TF forced us to protect against that); or perhaps even crash in
      pmd_page() applied to a swap-like entry.
      
      Make it safe by adding pmd_present() check into is_huge_zero_pmd()
      itself; and make it quicker by saving huge_zero_pfn, so that
      is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.
      
      __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
      but is unnecessary now that is_huge_zero_pmd() checks present.
      
      Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
      Fixes: e71769ae ("mm: enable thp migration for shmem thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b77e8c8
    • H
      mm/thp: fix __split_huge_pmd_locked() on shmem migration entry · 99fa8a48
      Hugh Dickins 提交于
      Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
      
      Here is v2 batch of long-standing THP bug fixes that I had not got
      around to sending before, but prompted now by Wang Yugui's report
      https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
      they have done no harm, but have *not* fixed that issue: something more
      is needed and I have no idea of what.
      
      This patch (of 7):
      
      Stressing huge tmpfs page migration racing hole punch often crashed on
      the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
      kernel; or shortly afterwards, on a bad dereference in
      __split_huge_pmd_locked() when DEBUG_VM=n.  They forgot to allow for pmd
      migration entries in the non-anonymous case.
      
      Full disclosure: those particular experiments were on a kernel with more
      relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
      vanilla kernel: it is conceivable that stricter locking happens to avoid
      those cases, or makes them less likely; but __split_huge_pmd_locked()
      already allowed for pmd migration entries when handling anonymous THPs,
      so this commit brings the shmem and file THP handling into line.
      
      And while there: use old_pmd rather than _pmd, as in the following
      blocks; and make it clearer to the eye that the !vma_is_anonymous()
      block is self-contained, making an early return after accounting for
      unmapping.
      
      Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
      Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
      Fixes: e71769ae ("mm: enable thp migration for shmem thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99fa8a48
  7. 07 5月, 2021 1 次提交
  8. 06 5月, 2021 9 次提交
  9. 14 3月, 2021 2 次提交
  10. 27 2月, 2021 1 次提交
    • R
      mm,thp,shmem: limit shmem THP alloc gfp_mask · 164cc4fe
      Rik van Riel 提交于
      Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.
      
      The allocation flags of anonymous transparent huge pages can be controlled
      through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
      help the system from getting bogged down in the page reclaim and
      compaction code when many THPs are getting allocated simultaneously.
      
      However, the gfp_mask for shmem THP allocations were not limited by those
      configuration settings, and some workloads ended up with all CPUs stuck on
      the LRU lock in the page reclaim code, trying to allocate dozens of THPs
      simultaneously.
      
      This patch applies the same configurated limitation of THPs to shmem
      hugepage allocations, to prevent that from happening.
      
      This way a THP defrag setting of "never" or "defer+madvise" will result in
      quick allocation failures without direct reclaim when no 2MB free pages
      are available.
      
      With this patch applied, THP allocations for tmpfs will be a little more
      aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
      less aggressive for files that are not mmapped or mapped without that
      flag.
      
      This patch (of 4):
      
      The allocation flags of anonymous transparent huge pages can be controlled
      through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
      help the system from getting bogged down in the page reclaim and
      compaction code when many THPs are getting allocated simultaneously.
      
      However, the gfp_mask for shmem THP allocations were not limited by those
      configuration settings, and some workloads ended up with all CPUs stuck on
      the LRU lock in the page reclaim code, trying to allocate dozens of THPs
      simultaneously.
      
      This patch applies the same configurated limitation of THPs to shmem
      hugepage allocations, to prevent that from happening.
      
      Controlling the gfp_mask of THP allocations through the knobs in sysfs
      allows users to determine the balance between how aggressively the system
      tries to allocate THPs at fault time, and how much the application may end
      up stalling attempting those allocations.
      
      This way a THP defrag setting of "never" or "defer+madvise" will result in
      quick allocation failures without direct reclaim when no 2MB free pages
      are available.
      
      With this patch applied, THP allocations for tmpfs will be a little more
      aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
      less aggressive for files that are not mmapped or mapped without that
      flag.
      
      Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
      Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.comSigned-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      164cc4fe
  11. 25 2月, 2021 4 次提交