1. 01 7月, 2021 32 次提交
    • Y
      mm: memory: add orig_pmd to struct vm_fault · 5db4f15c
      Yang Shi 提交于
      Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.
      
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.  It is definitely a maintenance burden to keep
      two THP migration implementation for different code paths and it is more
      error prone.  Using the generic THP migration implementation allows us
      remove the duplicate code and some hacks needed by the old ad hoc
      implementation.
      
      A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
      THP and NUMA balancing.  The most of them support THP migration except for
      S390.  Zi Yan tried to add THP migration support for S390 before but it
      was not accepted due to the design of S390 PMD.  For the discussion,
      please see: https://lkml.org/lkml/2018/4/27/953.
      
      Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
      huge PMD for S390 for now.
      
      I saw there were some hacks about gup from git history, but I didn't
      figure out if they have been removed or not since I just found FOLL_NUMA
      code in the current gup implementation and they seems useful.
      
      Patch #1 ~ #2 are preparation patches.
      Patch #3 is the real meat.
      Patch #4 ~ #6 keep consistent counters and behaviors with before.
      Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
      
      Test
      ----
      Did some tests to measure the latency of do_huge_pmd_numa_page.  The test
      VM has 80 vcpus and 64G memory.  The test would create 2 processes to
      consume 128G memory together which would incur memory pressure to cause
      THP splits.  And it also creates 80 processes to hog cpu, and the memory
      consumer processes are bound to different nodes periodically in order to
      increase NUMA faults.
      
      The below test script is used:
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Run stress-ng for 24 hours
      ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
      PID=$!
      
      ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &
      
      # Wait for vm stressors forked
      sleep 5
      
      PID_1=`pgrep -P $PID | awk 'NR == 1'`
      PID_2=`pgrep -P $PID | awk 'NR == 2'`
      
      JOB1=`pgrep -P $PID_1`
      JOB2=`pgrep -P $PID_2`
      
      # Bind load jobs to different nodes periodically to force generate
      # cross node memory access
      while [ -d "/proc/$PID" ]
      do
              taskset -apc 8 $JOB1
              taskset -apc 8 $JOB2
              sleep 300
              taskset -apc 58 $JOB1
              taskset -apc 58 $JOB2
              sleep 300
      done
      
      With the above test the histogram of latency of do_huge_pmd_numa_page is
      as shown below.  Since the number of do_huge_pmd_numa_page varies
      drastically for each run (should be due to scheduler), so I converted the
      raw number to percentage.
      
                                   patched               base
      @us[stress-ng]:
      [0]                          3.57%                 0.16%
      [1]                          55.68%                18.36%
      [2, 4)                       10.46%                40.44%
      [4, 8)                       7.26%                 17.82%
      [8, 16)                      21.12%                13.41%
      [16, 32)                     1.06%                 4.27%
      [32, 64)                     0.56%                 4.07%
      [64, 128)                    0.16%                 0.35%
      [128, 256)                   < 0.1%                < 0.1%
      [256, 512)                   < 0.1%                < 0.1%
      [512, 1K)                    < 0.1%                < 0.1%
      [1K, 2K)                     < 0.1%                < 0.1%
      [2K, 4K)                     < 0.1%                < 0.1%
      [4K, 8K)                     < 0.1%                < 0.1%
      [8K, 16K)                    < 0.1%                < 0.1%
      [16K, 32K)                   < 0.1%                < 0.1%
      [32K, 64K)                   < 0.1%                < 0.1%
      
      Per the result, patched kernel is even slightly better than the base
      kernel.  I think this is because the lock contention against THP split is
      less than base kernel due to the refactor.
      
      To exclude the affect from THP split, I also did test w/o memory pressure.
      No obvious regression is spotted.  The below is the test result *w/o*
      memory pressure.
      
                                 patched                  base
      @us[stress-ng]:
      [0]                        7.97%                   18.4%
      [1]                        69.63%                  58.24%
      [2, 4)                     4.18%                   2.63%
      [4, 8)                     0.22%                   0.17%
      [8, 16)                    1.03%                   0.92%
      [16, 32)                   0.14%                   < 0.1%
      [32, 64)                   < 0.1%                  < 0.1%
      [64, 128)                  < 0.1%                  < 0.1%
      [128, 256)                 < 0.1%                  < 0.1%
      [256, 512)                 0.45%                   1.19%
      [512, 1K)                  15.45%                  17.27%
      [1K, 2K)                   < 0.1%                  < 0.1%
      [2K, 4K)                   < 0.1%                  < 0.1%
      [4K, 8K)                   < 0.1%                  < 0.1%
      [8K, 16K)                  0.86%                   0.88%
      [16K, 32K)                 < 0.1%                  0.15%
      [32K, 64K)                 < 0.1%                  < 0.1%
      [64K, 128K)                < 0.1%                  < 0.1%
      [128K, 256K)               < 0.1%                  < 0.1%
      
      The series also survived a series of tests that exercise NUMA balancing
      migrations by Mel.
      
      This patch (of 7):
      
      Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
      page fault could be removed, just like its PTE counterpart does.
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db4f15c
    • M
      mm: migrate: fix missing update page_private to hugetlb_page_subpool · 6acfb5ba
      Muchun Song 提交于
      Since commit d6995da3 ("hugetlb: use page.private for hugetlb specific
      page flags") converts page.private for hugetlb specific page flags.  We
      should use hugetlb_page_subpool() to get the subpool pointer instead of
      page_private().
      
      This 'could' prevent the migration of hugetlb pages.  page_private(hpage)
      is now used for hugetlb page specific flags.  At migration time, the only
      flag which could be set is HPageVmemmapOptimized.  This flag will only be
      set if the new vmemmap reduction feature is enabled.  In addition,
      !page_mapping() implies an anonymous mapping.  So, this will prevent
      migration of hugetb pages in anonymous mappings if the vmemmap reduction
      feature is enabled.
      
      In addition, that if statement checked for the rare race condition of a
      page being migrated while in the process of being freed.  Since that check
      is now wrong, we could leak hugetlb subpool usage counts.
      
      The commit forgot to update it in the page migration routine.  So fix it.
      
      [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
        Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
      Fixes: d6995da3 ("hugetlb: use page.private for hugetlb specific page flags")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reported-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>	[arm64]
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6acfb5ba
    • A
      arm64/mm: drop HAVE_ARCH_PFN_VALID · 16c9afc7
      Anshuman Khandual 提交于
      CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64
      platforms and free_unused_memmap() would just return without creating any
      holes in the memmap mapping.  There is no need for any special handling in
      pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped.  This also moves
      the pfn upper bits sanity check into generic pfn_valid().
      
      Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c9afc7
    • M
      memblock: update initialization of reserved pages · 9092d4f7
      Mike Rapoport 提交于
      The struct pages representing a reserved memory region are initialized
      using reserve_bootmem_range() function.  This function is called for each
      reserved region just before the memory is freed from memblock to the buddy
      page allocator.
      
      The struct pages for MEMBLOCK_NOMAP regions are kept with the default
      values set by the memory map initialization which makes it necessary to
      have a special treatment for such pages in pfn_valid() and
      pfn_valid_within().
      
      Split out initialization of the reserved pages to a function with a
      meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
      reserved regions and mark struct pages for the NOMAP regions as
      PageReserved.
      
      Link: https://lkml.kernel.org/r/20210511100550.28178-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9092d4f7
    • M
      include/linux/mmzone.h: add documentation for pfn_valid() · 51c656ae
      Mike Rapoport 提交于
      Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4.
      
      These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
      pfn_valid_within() to 1.
      
      The idea is to mark NOMAP pages as reserved in the memory map and restore
      the intended semantics of pfn_valid() to designate availability of struct
      page for a pfn.
      
      With this the core mm will be able to cope with the fact that it cannot
      use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER
      blocks will be treated correctly even without the need for
      pfn_valid_within.
      
      This patch (of 4):
      
      Add comment describing the semantics of pfn_valid() that clarifies that
      pfn_valid() only checks for availability of a memory map entry (i.e.
      struct page) for a PFN rather than availability of usable memory backing
      that PFN.
      
      The most "generic" version of pfn_valid() used by the configurations with
      SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
      suitable place for documentation about semantics of pfn_valid().
      
      Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Suggested-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51c656ae
    • B
      mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies · 269fbe72
      Ben Widawsky 提交于
      Current structure 'mempolicy' uses a union to store the node info for
      bind/interleave/perfer policies.
      
      	union {
      		short 		 preferred_node; /* preferred */
      		nodemask_t	 nodes;		/* interleave/bind */
      		/* undefined for default */
      	} v;
      
      Since preferred node can also be represented by a nodemask_t with only ont
      bit set, unify these policies with using one nodemask_t 'nodes', which can
      remove a union, simplify the code and make it easier to support future's
      new policy's node info.
      
      Link: https://lore.kernel.org/r/20200630212517.308045-7-ben.widawsky@intel.com
      Link: https://lkml.kernel.org/r/1623399825-75651-1-git-send-email-feng.tang@intel.comCo-developed-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      269fbe72
    • F
      mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy · 7858d7bc
      Feng Tang 提交于
      MPOL_LOCAL policy has been setup as a real policy, but it is still handled
      like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit
      set, and there are many places having to judge the real 'prefer' or the
      'local' policy, which are quite confusing.
      
      In current code, there are 4 cases that MPOL_LOCAL are used:
      
      1. user specifies 'local' policy
      
      2. user specifies 'prefer' policy, but with empty nodemask
      
      3. system 'default' policy is used
      
      4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
         flag set, and when it is 'rebind' to a nodemask which doesn't contains
         the 'preferred' node, it will perform as 'local' policy
      
      So make 'local' a real policy instead of a fake 'prefer' one, and kill
      MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading.
      
      For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal
      Hocko pointed out:
      
      : I do believe that rebinding preferred policy is just bogus and it should
      : be dropped altogether on the ground that a preference is a mere hint from
      : userspace where to start the allocation.  Unless I am missing something
      : cpusets will be always authoritative for the final placement.  The
      : preferred node just acts as a starting point and it should be really
      : preserved when cpusets changes.  Otherwise we have a very subtle behavior
      : corner cases.
      
      So dump all the tricky transformation between 'prefer' and 'local', and
      just record the new nodemask of rebinding.
      
      [feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko]
        Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com
      [feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal]
        Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com
      
      Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7858d7bc
    • F
      mm/mempolicy: cleanup nodemask intersection check for oom · b26e517a
      Feng Tang 提交于
      Patch series "mm/mempolicy: some fix and semantics cleanup", v4.
      
      Current memory policy code has some confusing and ambiguous part about
      MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
      there are many places having to distinguish them.  Also the nodemask
      intersection check needs cleanup to be more explicit for OOM use, and
      handle MPOL_INTERLEAVE correctly.  This patchset cleans up these and
      unifies the parameter sanity check for mbind() and set_mempolicy().
      
      This patch (of 3):
      
      mempolicy_nodemask_intersects seem to be a general purpose mempolicy
      function.  In fact it is partially tailored for the OOM purpose
      instead.  The oom proper is the only existing user so rename the
      function to make that purpose explicit.
      
      While at it drop the MPOL_INTERLEAVE as those allocations never has a
      nodemask defined (see alloc_page_interleave) so this is a dead code and
      a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
      requirement so it shouldn't be considered during the OOM.
      
      The final code can be reduced to a check for MPOL_BIND which is the
      only memory policy that is a hard requirement and thus relevant to a
      constrained OOM logic.
      
      [mhocko@suse.com: changelog edits]
      
      Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com
      Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com
      Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com
      Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b26e517a
    • M
      mm/zbud: don't export any zbud API · 2a03085c
      Miaohe Lin 提交于
      The zbud doesn't need to export any API and it is meant to be used via
      zpool API since the commit 12d79d64 ("mm/zpool: update zswap to use
      zpool").  So we can remove the unneeded zbud.h and move down zpool API to
      avoid any forward declaration.
      
      [linmiaohe@huawei.com: fix unused function warnings when CONFIG_ZPOOL is disabled]
        Link: https://lkml.kernel.org/r/20210619025508.1239386-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210608114515.206992-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a03085c
    • D
      mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() · 82840451
      David Hildenbrand 提交于
      A driver might set a page logically offline -- PageOffline() -- and turn
      the page inaccessible in the hypervisor; after that, access to page
      content can be fatal.  One example is virtio-mem; while unplugged memory
      -- marked as PageOffline() can currently be read in the hypervisor, this
      will no longer be the case in the future; for example, when having a
      virtio-mem device backed by huge pages in the hypervisor.
      
      Some special PFN walkers -- i.e., /proc/kcore -- read content of random
      pages after checking PageOffline(); however, these PFN walkers can race
      with drivers that set PageOffline().
      
      Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.
      
      page_offline_freeze()/page_offline_thaw() allows for a subsystem to
      synchronize with such drivers, achieving that a page cannot be set
      PageOffline() while frozen.
      
      page_offline_begin()/page_offline_end() is used by drivers that care about
      such races when setting a page PageOffline().
      
      For simplicity, use a rwsem for now; neither drivers nor users are
      performance sensitive.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82840451
    • D
      fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages · 0daa322b
      David Hildenbrand 提交于
      Let's avoid reading:
      
      1) Offline memory sections: the content of offline memory sections is
         stale as the memory is effectively unused by the kernel.  On s390x with
         standby memory, offline memory sections (belonging to offline storage
         increments) are not accessible.  With virtio-mem and the hyper-v
         balloon, we can have unavailable memory chunks that should not be
         accessed inside offline memory sections.  Last but not least, offline
         memory sections might contain hwpoisoned pages which we can no longer
         identify because the memmap is stale.
      
      2) PG_offline pages: logically offline pages that are documented as
         "The content of these pages is effectively stale.  Such pages should
         not be touched (read/write/dump/save) except by their owner.".
         Examples include pages inflated in a balloon or unavailble memory
         ranges inside hotplugged memory sections with virtio-mem or the hyper-v
         balloon.
      
      3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
         As documented: "Accessing is not safe since it may cause another
         machine check.  Don't touch!"
      
      Introduce is_page_hwpoison(), adding a comment that it is inherently racy
      but best we can really do.
      
      Reading /proc/kcore now performs similar checks as when reading
      /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
      It's also similar to hibernation code, however, we don't skip hwpoisoned
      pages when processing pages in kernel/power/snapshot.c:saveable_page()
      yet.
      
      Note 1: we can race against memory offlining code, especially memory going
      offline and getting unplugged: however, we will properly tear down the
      identity mapping and handle faults gracefully when accessing this memory
      from kcore code.
      
      Note 2: we can race against drivers setting PageOffline() and turning
      memory inaccessible in the hypervisor.  We'll handle this in a follow-up
      patch.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0daa322b
    • D
      fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER · 3c36b419
      David Hildenbrand 提交于
      Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.
      
      Looking for places where the kernel might unconditionally read
      PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
      needs some more love to not touch some other pages we really don't want to
      read -- i.e., hwpoisoned ones.
      
      Examples for PageOffline() pages are pages inflated in a balloon, memory
      unplugged via virtio-mem, and partially-present sections in memory added
      by the Hyper-V balloon.
      
      When reading pages inflated in a balloon, we essentially produce
      unnecessary load in the hypervisor; holes in partially present sections in
      case of Hyper-V are not accessible and already were a problem for
      /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages.  In
      the future, virtio-mem might disallow reading unplugged memory -- marked
      as PageOffline() -- in some environments, resulting in undefined behavior
      when accessed; therefore, I'm trying to identify and rework all these
      (corner) cases.
      
      With this series, there is really only access via /dev/mem, /proc/vmcore
      and kdb left after I ripped out /dev/kmem.  kdb is an advanced corner-case
      use case -- we won't care for now if someone explicitly tries to do nasty
      things by reading from/writing to physical addresses we better not touch.
      /dev/mem is a use case we won't support for virtio-mem, at least for now,
      so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
      /proc/vmcore is really only a problem when dumping the old kernel via
      something that's not makedumpfile (read: basically never), however, we'll
      try sanitizing that as well in the second kernel in the future.
      
      Tested via kcore_dump:
      	https://github.com/schlafwandler/kcore_dump
      
      This patch (of 6):
      
      Commit db779ef6 ("proc/kcore: Remove unused kclist_add_remap()")
      removed the last user of KCORE_REMAP.
      
      Commit 595dd46e ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
      dumping vsyscall user page") removed the last user of KCORE_OTHER.
      
      Let's drop both types.  While at it, also drop vaddr in "struct
      kcore_list", used by KCORE_REMAP only.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c36b419
    • Y
      include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low · 764c04a9
      Yu Zhao 提交于
      mm_vmscan_inactive_list_is_low has no users after commit b91ac374
      ("mm: vmscan: enforce inactive:active ratio at the reclaim root").
      
      Remove it.
      
      Link: https://lkml.kernel.org/r/20210614194554.2683395-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      764c04a9
    • A
      userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() · 7d64ae3a
      Axel Rasmussen 提交于
      In a previous commit, we added the mfill_atomic_install_pte() helper.
      This helper does the job of setting up PTEs for an existing page, to map
      it into a given VMA.  It deals with both the anon and shmem cases, as well
      as the shared and private cases.
      
      In other words, shmem_mfill_atomic_pte() duplicates a case it already
      handles.  So, expose it, and let shmem_mfill_atomic_pte() use it directly,
      to reduce code duplication.
      
      This requires that we refactor shmem_mfill_atomic_pte() a bit:
      
      Instead of doing accounting (shmem_recalc_inode() et al) part-way through
      the PTE setup, do it afterward.  This frees up mfill_atomic_install_pte()
      from having to care about this accounting, and means we don't need to e.g.
      shmem_uncharge() in the error path.
      
      A side effect is this switches shmem_mfill_atomic_pte() to use
      lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
      This wrapper does some extra accounting in an exceptional case, if
      appropriate, so it's actually the more correct thing to use.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d64ae3a
    • A
      userfaultfd/shmem: advertise shmem minor fault support · 964ab004
      Axel Rasmussen 提交于
      Now that the feature is fully implemented (the faulting path hooks exist
      so userspace is notified, and the ioctl to resolve such faults is
      available), advertise this as a supported feature.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964ab004
    • A
      userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte · 3460f6e5
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling for shmem", v6.
      
      Overview
      ========
      
      See the series which added minor faults for hugetlbfs [3] for a detailed
      overview of minor fault handling in general.  This series adds the same
      support for shmem-backed areas.
      
      This series is structured as follows:
      
      - Commits 1 and 2 are cleanups.
      - Commits 3 and 4 implement the new feature (minor fault handling for shmem).
      - Commit 5 advertises that the feature is now available since at this point it's
        fully implemented.
      - Commit 6 is a final cleanup, modifying an existing code path to re-use a new
        helper we've introduced.
      - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.
      
      Use Case
      ========
      
      In some cases it is useful to have VM memory backed by tmpfs instead of
      hugetlbfs.  So, this feature will be used to support the same VM live
      migration use case described in my original series.
      
      Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope
      to optimize the Android Runtime garbage collector using this feature:
      
      "The plan is to use userfaultfd for concurrently compacting the heap.
      With this feature, the heap can be shared-mapped at another location where
      the GC-thread(s) could continue the compaction operation without the need
      to invoke userfault ioctl(UFFDIO_COPY) each time.  OTOH, if and when Java
      threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
      execution.  Furthermore, this feature enables updating references in the
      'non-moving' portion of the heap efficiently.  Without this feature,
      uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."
      
      [1] https://lore.kernel.org/patchwork/cover/1388144/
      [2] https://lore.kernel.org/patchwork/patch/1408161/
      [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t
      
      This patch (of 9):
      
      Previously, we did a dance where we had one calling path in userfaultfd.c
      (mfill_atomic_pte), but then we split it into two in shmem_fs.h
      (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
      shared function in shmem.c (shmem_mfill_atomic_pte).
      
      This is all a bit overly complex.  Just call the single combined shmem
      function directly, allowing us to clean up various branches, boilerplate,
      etc.
      
      While we're touching this function, two other small cleanup changes:
      - offset is equivalent to pgoff, so we can get rid of offset entirely.
      - Split two VM_BUG_ON cases into two statements. This means the line
        number reported when the BUG is hit specifies exactly which condition
        was true.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3460f6e5
    • P
      mm/userfaultfd: fix uffd-wp special cases for fork() · 8f34f1ea
      Peter Xu 提交于
      We tried to do something similar in b569a176 ("userfaultfd: wp: drop
      _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
      right..  A few fixes around the code path:
      
      1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
         than the new vma.  That's overlooked in b569a176, so it won't work
         as expected.  Thanks to the recent rework on fork code
         (7a4830c3), we can easily get the new vma now, so switch the
         checks to that.
      
      2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
         huge pmd is a migration huge pmd.  When it happens, instead of using
         pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
         handle them separately.
      
      3. Forget to carry over uffd-wp bit for a write migration huge pmd
         entry.  This also happens in copy_huge_pmd(), where we converted a
         write huge migration entry into a read one.
      
      4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
      
      5. In copy_present_page() when COW is enforced when fork(), we also
         need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
         vma, and when the pte to be copied has uffd-wp bit set.
      
      Remove the comment in copy_present_pte() about this.  It won't help a huge
      lot to only comment there, but comment everywhere would be an overkill.
      Let's assume the commit messages would help.
      
      [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
        Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
      Fixes: b569a176 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f34f1ea
    • M
      mm: sparsemem: use huge PMD mapping for vmemmap pages · 2d7a2171
      Muchun Song 提交于
      The preparation of splitting huge PMD mapping of vmemmap pages is ready,
      so switch the mapping from PTE to PMD.
      
      Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d7a2171
    • M
      mm: sparsemem: split the huge PMD mapping of vmemmap pages · 3bc2b6a7
      Muchun Song 提交于
      Patch series "Split huge PMD mapping of vmemmap pages", v4.
      
      In order to reduce the difficulty of code review in series[1].  We disable
      huge PMD mapping of vmemmap pages when that feature is enabled.  In this
      series, we do not disable huge PMD mapping of vmemmap pages anymore.  We
      will split huge PMD mapping when needed.  When HugeTLB pages are freed
      from the pool we do not attempt coalasce and move back to a PMD mapping
      because it is much more complex.
      
      [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/
      
      This patch (of 3):
      
      In [1], PMD mappings of vmemmap pages were disabled if the the feature
      hugetlb_free_vmemmap was enabled.  This was done to simplify the initial
      implementation of vmmemap freeing for hugetlb pages.  Now, remove this
      simplification by allowing PMD mapping and switching to PTE mappings as
      needed for allocated hugetlb pages.
      
      When a hugetlb page is allocated, the vmemmap page tables are walked to
      free vmemmap pages.  During this walk, split huge PMD mappings to PTE
      mappings as required.  In the unlikely case PTE pages can not be
      allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
      page.
      
      When HugeTLB pages are freed from the pool, we do not attempt to
      coalesce and move back to a PMD mapping because it is much more complex.
      
      [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bc2b6a7
    • M
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY · 8cc5fcbb
      Mina Almasry 提交于
      On UFFDIO_COPY, if we fail to copy the page contents while holding the
      hugetlb_fault_mutex, we will drop the mutex and return to the caller after
      allocating a page that consumed a reservation.  In this case there may be
      a fault that double consumes the reservation.  To handle this, we free the
      allocated page, fix the reservations, and allocate a temporary hugetlb
      page and return that to the caller.  When the caller does the copy outside
      of the lock, we again check the cache, and allocate a page consuming the
      reservation, and copy over the contents.
      
      Test:
      Hacked the code locally such that resv_huge_pages underflows produce
      a warning and the copy_huge_page_from_user() always fails, then:
      
      ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
              2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      ./tools/testing/selftests/vm/userfaultfd hugetlb 10
      	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      
      Both tests succeed and produce no warnings. After the
      test runs number of free/resv hugepages is correct.
      
      [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
        Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
      [almasrymina@google.com: fix allocation error check and copy func name]
        Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
      
      Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc5fcbb
    • C
      mm/vmalloc: enable mapping of huge pages at pte level in vmalloc · 3382bbee
      Christophe Leroy 提交于
      On some architectures like powerpc, there are huge pages that are mapped
      at pte level.
      
      Enable it in vmalloc.
      
      For that, architectures can provide arch_vmap_pte_supported_shift() that
      returns the shift for pages to map at pte level.
      
      Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3382bbee
    • C
      mm/vmalloc: enable mapping of huge pages at pte level in vmap · f7ee1f13
      Christophe Leroy 提交于
      On some architectures like powerpc, there are huge pages that are mapped
      at pte level.
      
      Enable it in vmap.
      
      For that, architectures can provide arch_vmap_pte_range_map_size() that
      returns the size of pages to map at pte level.
      
      Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ee1f13
    • C
      mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge · c742199a
      Christophe Leroy 提交于
      For architectures with no PMD and/or no PUD, add stubs similar to what we
      have for architectures without P4D.
      
      [christophe.leroy@csgroup.eu: arm64: define only {pud/pmd}_{set/clear}_huge when useful]
        Link: https://lkml.kernel.org/r/73ec95f40cafbbb69bdfb43a7f53876fd845b0ce.1620990479.git.christophe.leroy@csgroup.eu
      [christophe.leroy@csgroup.eu: x86: define only {pud/pmd}_{set/clear}_huge when useful]
        Link: https://lkml.kernel.org/r/7fbf1b6bc3e15c07c24fa45278d57064f14c896b.1620930415.git.christophe.leroy@csgroup.eu
      
      Link: https://lkml.kernel.org/r/5ac5976419350e8e048d463a64cae449eb3ba4b0.1620795204.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c742199a
    • C
      mm/hugetlb: change parameters of arch_make_huge_pte() · 79c1c594
      Christophe Leroy 提交于
      Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
      
      This series implements huge VMAP and VMALLOC on powerpc 8xx.
      
      Powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are
      leaf at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported
      page size.
      
      For now, implement use of 16k and 512k pages which is done
      at PTE level.
      
      Support of 8M pages will be implemented later, it requires use of
      hugepd tables.
      
      To allow this, the architecture provides two functions:
      - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
      page size to use. A stub returning PAGE_SIZE is provided when the
      architecture doesn't provide this function.
      - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
      what page shift to use for a given area size. A stub returning
      PAGE_SHIFT is provided when the architecture doesn't provide this
      function.
      
      This patch (of 5):
      
      At the time being, arch_make_huge_pte() has the following prototype:
      
        pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
      			   struct page *page, int writable);
      
      vma is used to get the pages shift or size.
      vma is also used on Sparc to get vm_flags.
      page is not used.
      writable is not used.
      
      In order to use this function without a vma, replace vma by shift and
      flags.  Also remove the used parameters.
      
      Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c1c594
    • M
      mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() · e6be37b2
      Miaohe Lin 提交于
      Since commit 99cb0dbd ("mm,thp: add read-only THP support for
      (non-shmem) FS"), read-only THP file mapping is supported.  But it forgot
      to add checking for it in transparent_hugepage_enabled().  To fix it, we
      add checking for read-only THP file mapping and also introduce helper
      transhuge_vma_enabled() to check whether thp is enabled for specified vma
      to reduce duplicated code.  We rename transparent_hugepage_enabled to
      transparent_hugepage_active to make the code easier to follow as suggested
      by David Hildenbrand.
      
      [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
        Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6be37b2
    • M
      mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK · b2bd53f1
      Miaohe Lin 提交于
      Patch series "Cleanup and fixup for huge_memory:, v3.
      
      This series contains cleanups to remove dedicated macro and remove
      unnecessary tlb_remove_page_size() for huge zero pmd.  Also this adds
      missing read-only THP checking for transparent_hugepage_enabled() and
      avoids discarding hugepage if other processes are mapping it.  More
      details can be found in the respective changelogs.
      
      Thi patch (of 5):
      
      Rewrite the pgoff checking logic to remove macro HPAGE_CACHE_INDEX_MASK
      which is only used here to simplify the code.
      
      Link: https://lkml.kernel.org/r/20210511134857.1581273-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210511134857.1581273-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2bd53f1
    • M
      mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate · 77490587
      Muchun Song 提交于
      All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
      field in the hstate to indicate how many vmemmap pages associated with a
      HugeTLB page that can be freed to buddy allocator.  And initialize it in
      the hugetlb_vmemmap_init().  This patch is actual enablement of the
      feature.
      
      There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page
      structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a
      BUILD_BUG_ON to catch invalid usage of the tail struct page.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77490587
    • M
      mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap · e9fdff87
      Muchun Song 提交于
      Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
      freeing unused vmemmap pages associated with each hugetlb page on boot.
      
      We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
      is enabled.  Because vmemmap_remap_free() depends on vmemmap being base
      page mapped.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9fdff87
    • M
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371
      Muchun Song 提交于
      When we free a HugeTLB page to the buddy allocator, we need to allocate
      the vmemmap pages associated with it.  However, we may not be able to
      allocate the vmemmap pages when the system is under memory pressure.  In
      this case, we just refuse to free the HugeTLB page.  This changes behavior
      in some corner cases as listed below:
      
       1) Failing to free a huge page triggered by the user (decrease nr_pages).
      
          User needs to try again later.
      
       2) Failing to free a surplus huge page when freed by the application.
      
          Try again later when freeing a huge page next time.
      
       3) Failing to dissolve a free huge page on ZONE_MOVABLE via
          offline_pages().
      
          This can happen when we have plenty of ZONE_MOVABLE memory, but
          not enough kernel memory to allocate vmemmmap pages.  We may even
          be able to migrate huge page contents, but will not be able to
          dissolve the source huge page.  This will prevent an offline
          operation and is unfortunate as memory offlining is expected to
          succeed on movable zones.  Users that depend on memory hotplug
          to succeed for movable zones should carefully consider whether the
          memory savings gained from this feature are worth the risk of
          possibly not being able to offline memory in certain situations.
      
       4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
          alloc_contig_range() - once we have that handling in place. Mainly
          affects CMA and virtio-mem.
      
          Similar to 3). virito-mem will handle migration errors gracefully.
          CMA might be able to fallback on other free areas within the CMA
          region.
      
      Vmemmap pages are allocated from the page freeing context.  In order for
      those allocations to be not disruptive (e.g.  trigger oom killer)
      __GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
      a non sleeping allocation would be too fragile and it could fail too
      easily under memory pressure.  GFP_ATOMIC or other modes to access memory
      reserves is not used because we want to prevent consuming reserves under
      heavy hugetlb freeing.
      
      [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
        Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
      [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2fa371
    • M
      mm: hugetlb: free the vmemmap pages associated with each HugeTLB page · f41f2ed4
      Muchun Song 提交于
      Every HugeTLB has more than one struct page structure.  We __know__ that
      we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
      store metadata associated with each HugeTLB.
      
      There are a lot of struct page structures associated with each HugeTLB
      page.  For tail pages, the value of compound_head is the same.  So we can
      reuse first page of tail page structures.  We map the virtual addresses of
      the remaining pages of tail page structures to the first tail page struct,
      and then free these page frames.  Therefore, we need to reserve two pages
      as vmemmap areas.
      
      When we allocate a HugeTLB page from the buddy, we can free some vmemmap
      pages associated with each HugeTLB page.  It is more appropriate to do it
      in the prep_new_huge_page().
      
      The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
      associated with a HugeTLB page can be freed, returns zero for now, which
      means the feature is disabled.  We will enable it once all the
      infrastructure is there.
      
      [willy@infradead.org: fix documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f41f2ed4
    • M
      mm: hugetlb: gather discrete indexes of tail page · cd39d4e9
      Muchun Song 提交于
      For HugeTLB page, there are more metadata to save in the struct page.  But
      the head struct page cannot meet our needs, so we have to abuse other tail
      struct page to store the metadata.  In order to avoid conflicts caused by
      subsequent use of more tail struct pages, we can gather these discrete
      indexes of tail struct page.  In this case, it will be easier to add a new
      tail page index later.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd39d4e9
    • M
      mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c · 426e5c42
      Muchun Song 提交于
      Patch series "Free some vmemmap pages of HugeTLB page", v23.
      
      This patch series will free some vmemmap pages(struct page structures)
      associated with each HugeTLB page when preallocated to save memory.
      
      In order to reduce the difficulty of the first version of code review.  In
      this version, we disable PMD/huge page mapping of vmemmap if this feature
      was enabled.  This acutely eliminates a bunch of the complex code doing
      page table manipulation.  When this patch series is solid, we cam add the
      code of vmemmap page table manipulation in the future.
      
      The struct page structures (page structs) are used to describe a physical
      page frame.  By default, there is an one-to-one mapping from a page frame
      to it's corresponding page struct.
      
      The HugeTLB pages consist of multiple base page size pages and is
      supported by many architectures.  See hugetlbpage.rst in the Documentation
      directory for more details.  On the x86 architecture, HugeTLB pages of
      size 2MB and 1GB are currently supported.  Since the base page size on x86
      is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
      page consists of 4096 base pages.  For each base page, there is a
      corresponding page struct.
      
      Within the HugeTLB subsystem, only the first 4 page structs are used to
      contain unique information about a HugeTLB page.  HUGETLB_CGROUP_MIN_ORDER
      provides this upper limit.  The only 'useful' information in the remaining
      page structs is the compound_head field, and this field is the same for
      all tail pages.
      
      By removing redundant page structs for HugeTLB pages, memory can returned
      to the buddy allocator for other uses.
      
      When the system boot up, every 2M HugeTLB has 512 struct page structs which
      size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | -------------> |     2     |
       |           |                     +-----------+                +-----------+
       |           |                     |     3     | -------------> |     3     |
       |           |                     +-----------+                +-----------+
       |           |                     |     4     | -------------> |     4     |
       |    2MB    |                     +-----------+                +-----------+
       |           |                     |     5     | -------------> |     5     |
       |           |                     +-----------+                +-----------+
       |           |                     |     6     | -------------> |     6     |
       |           |                     +-----------+                +-----------+
       |           |                     |     7     | -------------> |     7     |
       |           |                     +-----------+                +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      The value of page->compound_head is the same for all tail pages.  The
      first page of page structs (page 0) associated with the HugeTLB page
      contains the 4 page structs necessary to describe the HugeTLB.  The only
      use of the remaining pages of page structs (page 1 to page 7) is to point
      to page->compound_head.  Therefore, we can remap pages 2 to 7 to page 1.
      Only 2 pages of page structs will be used for each HugeTLB page.  This
      will allow us to free the remaining 6 pages to the buddy allocator.
      
      Here is how things look after remapping.
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      When a HugeTLB is freed to the buddy system, we should allocate 6 pages
      for vmemmap pages and restore the previous mapping relationship.
      
      Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page.  It is similar
      to the 2MB HugeTLB page.  We also can use this approach to free the
      vmemmap pages.
      
      In this case, for the 1GB HugeTLB page, we can save 4094 pages.  This is a
      very substantial gain.  On our server, run some SPDK/QEMU applications
      which will use 1024GB HugeTLB page.  With this feature enabled, we can
      save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.
      
      Because there are vmemmap page tables reconstruction on the
      freeing/allocating path, it increases some overhead.  Here are some
      overhead analysis.
      
      1) Allocating 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.166s
         user     0m0.000s
         sys      0m0.166s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)           5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [16K, 32K)          4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
         [32K, 64K)             4 |                                                    |
      
         b) Without this patch series:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.067s
         user     0m0.000s
         sys      0m0.067s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)           10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)             93 |                                                    |
      
         Summarize: this feature is about ~2x slower than before.
      
      2) Freeing 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.213s
         user     0m0.000s
         sys      0m0.213s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)              6 |                                                    |
         [16K, 32K)         10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [32K, 64K)             7 |                                                    |
      
         b) Without this patch series:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.081s
         user     0m0.000s
         sys      0m0.081s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)            6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)           3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
         [16K, 32K)             8 |                                                    |
      
         Summary: The overhead of __free_hugepage is about ~2-3x slower than before.
      
      Although the overhead has increased, the overhead is not significant.
      Like Mike said, "However, remember that the majority of use cases create
      HugeTLB pages at or shortly after boot time and add them to the pool.  So,
      additional overhead is at pool creation time.  There is no change to
      'normal run time' operations of getting a page from or returning a page to
      the pool (think page fault/unmap)".
      
      Despite the overhead and in addition to the memory gains from this series.
      The following data is obtained by Joao Martins.  Very thanks to his
      effort.
      
      There's an additional benefit which is page (un)pinners will see an improvement
      and Joao presumes because there are fewer memmap pages and thus the tail/head
      pages are staying in cache more often.
      
      Out of the box Joao saw (when comparing linux-next against linux-next +
      this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):
      
      	get_user_pages(): ~32k -> ~9k
      	unpin_user_pages(): ~75k -> ~70k
      
      Usually any tight loop fetching compound_head(), or reading tail pages
      data (e.g.  compound_head) benefit a lot.  There's some unpinning
      inefficiencies Joao was fixing[2], but with that in added it shows even
      more:
      
      	unpin_user_pages(): ~27k -> ~3.8k
      
      [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/
      
      This patch (of 9):
      
      Move bootmem info registration common API to individual bootmem_info.c.
      And we will use {get,put}_page_bootmem() to initialize the page for the
      vmemmap pages or free the vmemmap pages to buddy in the later patch.  So
      move them out of CONFIG_MEMORY_HOTPLUG_SPARSE.  This is just code movement
      without any functional change.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      426e5c42
  2. 30 6月, 2021 8 次提交