1. 09 9月, 2021 5 次提交
  2. 04 9月, 2021 4 次提交
    • R
      mm/migrate: correct kernel-doc notation · c9bd7d18
      Randy Dunlap 提交于
      Use the expected "Return:" format to prevent a kernel-doc warning.
      
      mm/migrate.c:1157: warning: Excess function parameter 'returns' description in 'next_demotion_node'
      
      Link: https://lkml.kernel.org/r/20210808203151.10632-1-rdunlap@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9bd7d18
    • Y
      mm/migrate: enable returning precise migrate_pages() success count · 5ac95884
      Yang Shi 提交于
      Under normal circumstances, migrate_pages() returns the number of pages
      migrated.  In error conditions, it returns an error code.  When returning
      an error code, there is no way to know how many pages were migrated or not
      migrated.
      
      Make migrate_pages() return how many pages are demoted successfully for
      all cases, including when encountering errors.  Page reclaim behavior will
      depend on this in subsequent patches.
      
      Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ac95884
    • D
      mm/migrate: update node demotion order on hotplug events · 884a6e5d
      Dave Hansen 提交于
      Reclaim-based migration is attempting to optimize data placement in memory
      based on the system topology.  If the system changes, so must the
      migration ordering.
      
      The implementation is conceptually simple and entirely unoptimized.  On
      any memory or CPU hotplug events, assume that a node was added or removed
      and recalculate all migration targets.  This ensures that the
      node_demotion[] array is always ready to be used in case the new reclaim
      mode is enabled.
      
      This recalculation is far from optimal, most glaringly that it does not
      even attempt to figure out the hotplug event would have some *actual*
      effect on the demotion order.  But, given the expected paucity of hotplug
      events, this should be fine.
      
      Link: https://lkml.kernel.org/r/20210721063926.3024591-2-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20210715055145.195411-3-ying.huang@intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      884a6e5d
    • D
      mm/numa: automatically generate node migration order · 79c28a41
      Dave Hansen 提交于
      Patch series "Migrate Pages in lieu of discard", v11.
      
      We're starting to see systems with more and more kinds of memory such as
      Intel's implementation of persistent memory.
      
      Let's say you have a system with some DRAM and some persistent memory.
      Today, once DRAM fills up, reclaim will start and some of the DRAM
      contents will be thrown out.  Allocations will, at some point, start
      falling over to the slower persistent memory.
      
      That has two nasty properties.  First, the newer allocations can end up in
      the slower persistent memory.  Second, reclaimed data in DRAM are just
      discarded even if there are gobs of space in persistent memory that could
      be used.
      
      This patchset implements a solution to these problems.  At the end of the
      reclaim process in shrink_page_list() just before the last page refcount
      is dropped, the page is migrated to persistent memory instead of being
      dropped.
      
      While I've talked about a DRAM/PMEM pairing, this approach would function
      in any environment where memory tiers exist.
      
      This is not perfect.  It "strands" pages in slower memory and never brings
      them back to fast DRAM.  Huang Ying has follow-on work which repurposes
      NUMA balancing to promote hot pages back to DRAM.
      
      This is also all based on an upstream mechanism that allows persistent
      memory to be onlined and used as if it were volatile:
      
      	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
      
      With that, the DRAM and PMEM in each socket will be represented as 2
      separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
      general inter-NUMA demotion mechanism introduced in the patchset can
      migrate the cold DRAM pages to the PMEM node.
      
      We have tested the patchset with the postgresql and pgbench.  On a
      2-socket server machine with DRAM and PMEM, the kernel with the patchset
      can improve the score of pgbench up to 22.1% compared with that of the
      DRAM only + disk case.  This comes from the reduced disk read throughput
      (which reduces up to 70.8%).
      
      == Open Issues ==
      
       * Memory policies and cpusets that, for instance, restrict allocations
         to DRAM can be demoted to PMEM whenever they opt in to this
         new mechanism.  A cgroup-level API to opt-in or opt-out of
         these migrations will likely be required as a follow-on.
       * Could be more aggressive about where anon LRU scanning occurs
         since it no longer necessarily involves I/O.  get_scan_count()
         for instance says: "If we have no swap space, do not bother
         scanning anon pages"
      
      This patch (of 9):
      
      Prepare for the kernel to auto-migrate pages to other memory nodes with a
      node migration table.  This allows creating single migration target for
      each NUMA node to enable the kernel to do NUMA page migrations instead of
      simply discarding colder pages.  A node with no target is a "terminal
      node", so reclaim acts normally there.  The migration target does not
      fundamentally _need_ to be a single node, but this implementation starts
      there to limit complexity.
      
      When memory fills up on a node, memory contents can be automatically
      migrated to another node.  The biggest problems are knowing when to
      migrate and to where the migration should be targeted.
      
      The most straightforward way to generate the "to where" list would be to
      follow the page allocator fallback lists.  Those lists already tell us if
      memory is full where to look next.  It would also be logical to move
      memory in that order.
      
      But, the allocator fallback lists have a fatal flaw: most nodes appear in
      all the lists.  This would potentially lead to migration cycles (A->B,
      B->A, A->B, ...).
      
      Instead of using the allocator fallback lists directly, keep a separate
      node migration ordering.  But, reuse the same data used to generate page
      allocator fallback in the first place: find_next_best_node().
      
      This means that the firmware data used to populate node distances
      essentially dictates the ordering for now.  It should also be
      architecture-neutral since all NUMA architectures have a working
      find_next_best_node().
      
      RCU is used to allow lock-less read of node_demotion[] and prevent
      demotion cycles been observed.  If multiple reads of node_demotion[] are
      performed, a single rcu_read_lock() must be held over all reads to ensure
      no cycles are observed.  Details are as follows.
      
      === What does RCU provide? ===
      
      Imagine a simple loop which walks down the demotion path looking
      for the last node:
      
              terminal_node = start_node;
              while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                      terminal_node = node_demotion[terminal_node];
              }
      
      The initial values are:
      
              node_demotion[0] = 1;
              node_demotion[1] = NUMA_NO_NODE;
      
      and are updated to:
      
              node_demotion[0] = NUMA_NO_NODE;
              node_demotion[1] = 0;
      
      What guarantees that the cycle is not observed:
      
              node_demotion[0] = 1;
              node_demotion[1] = 0;
      
      and would loop forever?
      
      With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
      the write side does a synchronize_rcu(), the loop that observed the old
      contents is known to be complete before the synchronize_rcu() has
      completed.
      
      RCU, combined with disable_all_migrate_targets(), ensures that the old
      migration state is not visible by the time __set_migration_target_nodes()
      is called.
      
      === What does READ_ONCE() provide? ===
      
      READ_ONCE() forbids the compiler from merging or reordering successive
      reads of node_demotion[].  This ensures that any updates are *eventually*
      observed.
      
      Consider the above loop again.  The compiler could theoretically read the
      entirety of node_demotion[] into local storage (registers) and never go
      back to memory, and *permanently* observe bad values for node_demotion[].
      
      Note: RCU does not provide any universal compiler-ordering
      guarantees:
      
      	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
      
      This code is unused for now.  It will be called later in the
      series.
      
      Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c28a41
  3. 31 7月, 2021 1 次提交
  4. 13 7月, 2021 1 次提交
  5. 02 7月, 2021 4 次提交
    • A
      mm: rename migrate_pgmap_owner · 6b49bf6d
      Alistair Popple 提交于
      MMU notifier ranges have a migrate_pgmap_owner field which is used by
      drivers to store a pointer.  This is subsequently used by the driver
      callback to filter MMU_NOTIFY_MIGRATE events.  Other notifier event types
      can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
      field to 'owner' and create a new notifier initialisation function to
      initialise this field.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Suggested-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b49bf6d
    • A
      mm/rmap: split migration into its own function · a98a2f0c
      Alistair Popple 提交于
      Migration is currently implemented as a mode of operation for
      try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
      or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
      
      However it does not have much in common with the rest of the unmap
      functionality of try_to_unmap_one() and thus splitting it into a separate
      function reduces the complexity of try_to_unmap_one() making it more
      readable.
      
      Several simplifications can also be made in try_to_migrate_one() based on
      the following observations:
      
       - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
       - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
       - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
      
      TTU_SPLIT_FREEZE is a special case of migration used when splitting an
      anonymous page.  This is most easily dealt with by calling the correct
      function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
      for PageAnon or try_to_unmap().
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a98a2f0c
    • A
      mm/swapops: rework swap entry manipulation code · 4dd845b5
      Alistair Popple 提交于
      Both migration and device private pages use special swap entries that are
      manipluated by a range of inline functions.  The arguments to these are
      somewhat inconsistent so rework them to remove flag type arguments and to
      make the arguments similar for both read and write entry creation.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dd845b5
    • A
      mm: remove special swap entry functions · af5cdaf8
      Alistair Popple 提交于
      Patch series "Add support for SVM atomics in Nouveau", v11.
      
      Introduction
      ============
      
      Some devices have features such as atomic PTE bits that can be used to
      implement atomic access to system memory.  To support atomic operations to
      a shared virtual memory page such a device needs access to that page which
      is exclusive of the CPU.  This series introduces a mechanism to
      temporarily unmap pages granting exclusive access to a device.
      
      These changes are required to support OpenCL atomic operations in Nouveau
      to shared virtual memory (SVM) regions allocated with the
      CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
      OpenCL SVM feature is available at
      https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
      OpenCL_API.html#_shared_virtual_memory .
      
      Implementation
      ==============
      
      Exclusive device access is implemented by adding a new swap entry type
      (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
      difference is that on fault the original entry is immediately restored by
      the fault handler instead of waiting.
      
      Restoring the entry triggers calls to MMU notifers which allows a device
      driver to revoke the atomic access permission from the GPU prior to the
      CPU finalising the entry.
      
      Patches
      =======
      
      Patches 1 & 2 refactor existing migration and device private entry
      functions.
      
      Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
      functionality into separate functions - try_to_migrate_one() and
      try_to_munlock_one().
      
      Patch 5 renames some existing code but does not introduce functionality.
      
      Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
      
      Patch 7 contains the bulk of the implementation for device exclusive
      memory.
      
      Patch 8 contains some additions to the HMM selftests to ensure everything
      works as expected.
      
      Patch 9 is a cleanup for the Nouveau SVM implementation.
      
      Patch 10 contains the implementation of atomic access for the Nouveau
      driver.
      
      Testing
      =======
      
      This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
      which checks that GPU atomic accesses to system memory are atomic.
      Without this series the test fails as there is no way of write-protecting
      the page mapping which results in the device clobbering CPU writes.  For
      reference the test is available at
      https://ozlabs.org/~apopple/opencl_svm_atomics/
      
      Further testing has been performed by adding support for testing exclusive
      access to the hmm-tests kselftests.
      
      This patch (of 10):
      
      Remove multiple similar inline functions for dealing with different types
      of special swap entries.
      
      Both migration and device private swap entries use the swap offset to
      store a pfn.  Instead of multiple inline functions to obtain a struct page
      for each swap entry type use a common function pfn_swap_entry_to_page().
      Also open-code the various entry_to_pfn() functions as this results is
      shorter code that is easier to understand.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
      Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5cdaf8
  6. 01 7月, 2021 8 次提交
    • Y
      mm: migrate: check mapcount for THP instead of refcount · 662aeea7
      Yang Shi 提交于
      The generic migration path will check refcount, so no need check refcount
      here.  But the old code actually prevents from migrating shared THP
      (mapped by multiple processes), so bail out early if mapcount is > 1 to
      keep the behavior.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      662aeea7
    • Y
      mm: migrate: don't split THP for misplaced NUMA page · b0b515bf
      Yang Shi 提交于
      The old behavior didn't split THP if migration is failed due to lack of
      memory on the target node.  But the THP migration does split THP, so keep
      the old behavior for misplaced NUMA page migration.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0b515bf
    • Y
      mm: migrate: account THP NUMA migration counters correctly · c5fc5c3a
      Yang Shi 提交于
      Now both base page and THP NUMA migration is done via
      migrate_misplaced_page(), keep the counters correctly for THP.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5fc5c3a
    • Y
      mm: thp: refactor NUMA fault handling · c5b5a3dd
      Yang Shi 提交于
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.
      
      This patch reworks the NUMA fault handling to use generic migration
      implementation to migrate misplaced page.  There is no functional change.
      
      After the refactor the flow of NUMA fault handling looks just like its
      PTE counterpart:
        Acquire ptl
        Prepare for migration (elevate page refcount)
        Release ptl
        Isolate page from lru and elevate page refcount
        Migrate the misplaced THP
      
      If migration fails just restore the old normal PMD.
      
      In the old code anon_vma lock was needed to serialize THP migration
      against THP split, but since then the THP code has been reworked a lot, it
      seems anon_vma lock is not required anymore to avoid the race.
      
      The page refcount elevation when holding ptl should prevent from THP
      split.
      
      Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
      and remove all the dead and duplicate code.
      
      [dan.carpenter@oracle.com: fix a double unlock bug]
        Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5b5a3dd
    • M
      mm: migrate: fix missing update page_private to hugetlb_page_subpool · 6acfb5ba
      Muchun Song 提交于
      Since commit d6995da3 ("hugetlb: use page.private for hugetlb specific
      page flags") converts page.private for hugetlb specific page flags.  We
      should use hugetlb_page_subpool() to get the subpool pointer instead of
      page_private().
      
      This 'could' prevent the migration of hugetlb pages.  page_private(hpage)
      is now used for hugetlb page specific flags.  At migration time, the only
      flag which could be set is HPageVmemmapOptimized.  This flag will only be
      set if the new vmemmap reduction feature is enabled.  In addition,
      !page_mapping() implies an anonymous mapping.  So, this will prevent
      migration of hugetb pages in anonymous mappings if the vmemmap reduction
      feature is enabled.
      
      In addition, that if statement checked for the rare race condition of a
      page being migrated while in the process of being freed.  Since that check
      is now wrong, we could leak hugetlb subpool usage counts.
      
      The commit forgot to update it in the page migration routine.  So fix it.
      
      [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
        Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
      Fixes: d6995da3 ("hugetlb: use page.private for hugetlb specific page flags")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reported-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>	[arm64]
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6acfb5ba
    • M
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY · 8cc5fcbb
      Mina Almasry 提交于
      On UFFDIO_COPY, if we fail to copy the page contents while holding the
      hugetlb_fault_mutex, we will drop the mutex and return to the caller after
      allocating a page that consumed a reservation.  In this case there may be
      a fault that double consumes the reservation.  To handle this, we free the
      allocated page, fix the reservations, and allocate a temporary hugetlb
      page and return that to the caller.  When the caller does the copy outside
      of the lock, we again check the cache, and allocate a page consuming the
      reservation, and copy over the contents.
      
      Test:
      Hacked the code locally such that resv_huge_pages underflows produce
      a warning and the copy_huge_page_from_user() always fails, then:
      
      ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
              2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      ./tools/testing/selftests/vm/userfaultfd hugetlb 10
      	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      
      Both tests succeed and produce no warnings. After the
      test runs number of free/resv hugepages is correct.
      
      [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
        Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
      [almasrymina@google.com: fix allocation error check and copy func name]
        Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
      
      Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc5fcbb
    • C
      mm/hugetlb: change parameters of arch_make_huge_pte() · 79c1c594
      Christophe Leroy 提交于
      Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
      
      This series implements huge VMAP and VMALLOC on powerpc 8xx.
      
      Powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are
      leaf at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported
      page size.
      
      For now, implement use of 16k and 512k pages which is done
      at PTE level.
      
      Support of 8M pages will be implemented later, it requires use of
      hugepd tables.
      
      To allow this, the architecture provides two functions:
      - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
      page size to use. A stub returning PAGE_SIZE is provided when the
      architecture doesn't provide this function.
      - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
      what page shift to use for a given area size. A stub returning
      PAGE_SHIFT is provided when the architecture doesn't provide this
      function.
      
      This patch (of 5):
      
      At the time being, arch_make_huge_pte() has the following prototype:
      
        pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
      			   struct page *page, int writable);
      
      vma is used to get the pages shift or size.
      vma is also used on Sparc to get vm_flags.
      page is not used.
      writable is not used.
      
      In order to use this function without a vma, replace vma by shift and
      flags.  Also remove the used parameters.
      
      Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c1c594
    • M
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371
      Muchun Song 提交于
      When we free a HugeTLB page to the buddy allocator, we need to allocate
      the vmemmap pages associated with it.  However, we may not be able to
      allocate the vmemmap pages when the system is under memory pressure.  In
      this case, we just refuse to free the HugeTLB page.  This changes behavior
      in some corner cases as listed below:
      
       1) Failing to free a huge page triggered by the user (decrease nr_pages).
      
          User needs to try again later.
      
       2) Failing to free a surplus huge page when freed by the application.
      
          Try again later when freeing a huge page next time.
      
       3) Failing to dissolve a free huge page on ZONE_MOVABLE via
          offline_pages().
      
          This can happen when we have plenty of ZONE_MOVABLE memory, but
          not enough kernel memory to allocate vmemmmap pages.  We may even
          be able to migrate huge page contents, but will not be able to
          dissolve the source huge page.  This will prevent an offline
          operation and is unfortunate as memory offlining is expected to
          succeed on movable zones.  Users that depend on memory hotplug
          to succeed for movable zones should carefully consider whether the
          memory savings gained from this feature are worth the risk of
          possibly not being able to offline memory in certain situations.
      
       4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
          alloc_contig_range() - once we have that handling in place. Mainly
          affects CMA and virtio-mem.
      
          Similar to 3). virito-mem will handle migration errors gracefully.
          CMA might be able to fallback on other free areas within the CMA
          region.
      
      Vmemmap pages are allocated from the page freeing context.  In order for
      those allocations to be not disruptive (e.g.  trigger oom killer)
      __GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
      a non sleeping allocation would be too fragile and it could fail too
      easily under memory pressure.  GFP_ATOMIC or other modes to access memory
      reserves is not used because we want to prevent consuming reserves under
      heavy hugetlb freeing.
      
      [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
        Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
      [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2fa371
  7. 30 6月, 2021 1 次提交
  8. 17 6月, 2021 1 次提交
  9. 07 5月, 2021 1 次提交
  10. 06 5月, 2021 8 次提交
  11. 01 5月, 2021 1 次提交
  12. 25 2月, 2021 2 次提交
  13. 06 2月, 2021 1 次提交
  14. 25 1月, 2021 2 次提交