1. 27 9月, 2022 1 次提交
    • Y
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao 提交于
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      018ee47f
  2. 12 9月, 2022 3 次提交
    • D
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand 提交于
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • K
      mm: kill find_min_pfn_with_active_regions() · fb70c487
      Kefeng Wang 提交于
      find_min_pfn_with_active_regions() is only called from free_area_init(). 
      Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
      and kill find_min_pfn_with_active_regions().
      
      Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fb70c487
    • H
      memory tiering: hot page selection with hint page fault latency · 33024536
      Huang Ying 提交于
      Patch series "memory tiering: hot page selection", v4.
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory nodes need to be identified. 
      Essentially, the original NUMA balancing implementation selects the mostly
      recently accessed (MRU) pages to promote.  But this isn't a perfect
      algorithm to identify the hot pages.  Because the pages with quite low
      access frequency may be accessed eventually given the NUMA balancing page
      table scanning period could be quite long (e.g.  60 seconds).  So in this
      patchset, we implement a new hot page identification algorithm based on
      the latency between NUMA balancing page table scanning and hint page
      fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
      
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patchset
      as the page promotion rate limit mechanism.
      
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patchset, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.
      
      We used the pmbench memory accessing benchmark tested the patchset on a
      2-socket server system with DRAM and PMEM installed.  The test results are
      as follows,
      
      		pmbench score		promote rate
      		 (accesses/s)			MB/s
      		-------------		------------
      base		  146887704.1		       725.6
      hot selection     165695601.2		       544.0
      rate limit	  162814569.8		       165.2
      auto adjustment	  170495294.0                  136.9
      
      From the results above,
      
      With hot page selection patch [1/3], the pmbench score increases about
      12.8%, and promote rate (overhead) decreases about 25.0%, compared with
      base kernel.
      
      With rate limit patch [2/3], pmbench score decreases about 1.7%, and
      promote rate decreases about 69.6%, compared with hot page selection
      patch.
      
      With threshold auto adjustment patch [3/3], pmbench score increases about
      4.7%, and promote rate decrease about 17.1%, compared with rate limit
      patch.
      
      Baolin helped to test the patchset with MySQL on a machine which contains
      1 DRAM node (30G) and 1 PMEM node (126G).
      
      sysbench /usr/share/sysbench/oltp_read_write.lua \
      ......
      --tables=200 \
      --table-size=1000000 \
      --report-interval=10 \
      --threads=16 \
      --time=120
      
      The tps can be improved about 5%.
      
      
      This patch (of 3):
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory node need to be identified.  Essentially,
      the original NUMA balancing implementation selects the mostly recently
      accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
      identify the hot pages.  Because the pages with quite low access frequency
      may be accessed eventually given the NUMA balancing page table scanning
      period could be quite long (e.g.  60 seconds).  The most frequently
      accessed (MFU) algorithm is better.
      
      So, in this patch we implemented a better hot page selection algorithm. 
      Which is based on NUMA balancing page table scanning and hint page fault
      as follows,
      
      - When the page tables of the processes are scanned to change PTE/PMD
        to be PROT_NONE, the current time is recorded in struct page as scan
        time.
      
      - When the page is accessed, hint page fault will occur.  The scan
        time is gotten from the struct page.  And The hint page fault
        latency is defined as
      
          hint page fault time - scan time
      
      The shorter the hint page fault latency of a page is, the higher the
      probability of their access frequency to be higher.  So the hint page
      fault latency is a better estimation of the page hot/cold.
      
      It's hard to find some extra space in struct page to hold the scan time. 
      Fortunately, we can reuse some bits used by the original NUMA balancing.
      
      NUMA balancing uses some bits in struct page to store the page accessing
      CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
      multi-stage node selection algorithm to avoid to migrate pages shared
      accessed by the NUMA nodes back and forth.  But for pages in the slow
      memory node, even if they are shared accessed by multiple NUMA nodes, as
      long as the pages are hot, they need to be promoted to the fast memory
      node.  So the accessing CPU and PID information are unnecessary for the
      slow memory pages.  We can reuse these bits in struct page to record the
      scan time.  For the fast memory pages, these bits are used as before.
      
      For the hot threshold, the default value is 1 second, which works well in
      our performance test.  All pages with hint page fault latency < hot
      threshold will be considered hot.
      
      It's hard for users to determine the hot threshold.  So we don't provide a
      kernel ABI to set it, just provide a debugfs interface for advanced users
      to experiment.  We will continue to work on a hot threshold automatic
      adjustment mechanism.
      
      The downside of the above method is that the response time to the workload
      hot spot changing may be much longer.  For example,
      
      - A previous cold memory area becomes hot
      
      - The hint page fault will be triggered.  But the hint page fault
        latency isn't shorter than the hot threshold.  So the pages will
        not be promoted.
      
      - When the memory area is scanned again, maybe after a scan period,
        the hint page fault latency measured will be shorter than the hot
        threshold and the pages will be promoted.
      
      To mitigate this, if there are enough free space in the fast memory node,
      the hot threshold will not be used, all pages will be promoted upon the
      hint page fault for fast response.
      
      Thanks Zhong Jiang reported and tested the fix for a bug when disabling
      memory tiering mode dynamically.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      33024536
  3. 29 8月, 2022 1 次提交
  4. 21 8月, 2022 1 次提交
    • D
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand 提交于
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5535be30
  5. 09 8月, 2022 3 次提交
  6. 19 7月, 2022 1 次提交
  7. 18 7月, 2022 8 次提交
    • A
      mm/mmap: drop ARCH_HAS_VM_GET_PAGE_PROT · 3d923c5f
      Anshuman Khandual 提交于
      Now all the platforms enable ARCH_HAS_GET_PAGE_PROT.  They define and
      export own vm_get_page_prot() whether custom or standard
      DECLARE_VM_GET_PAGE_PROT.  Hence there is no need for default generic
      fallback for vm_get_page_prot().  Just drop this fallback and also
      ARCH_HAS_GET_PAGE_PROT mechanism.
      
      Link: https://lkml.kernel.org/r/20220711070600.2378316-27-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3d923c5f
    • A
      mm/mmap: build protect protection_map[] with ARCH_HAS_VM_GET_PAGE_PROT · 09095f74
      Anshuman Khandual 提交于
      Now that protection_map[] has been moved inside those platforms that
      enable ARCH_HAS_VM_GET_PAGE_PROT.  Hence generic protection_map[] array
      now can be protected with CONFIG_ARCH_HAS_VM_GET_PAGE_PROT intead of
      __P000.
      
      Link: https://lkml.kernel.org/r/20220711070600.2378316-8-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      09095f74
    • A
      mm/mmap: build protect protection_map[] with __P000 · 84053271
      Anshuman Khandual 提交于
      Patch series "mm/mmap: Drop __SXXX/__PXXX macros from across platforms",
      v7.
      
      __SXXX/__PXXX macros are unnecessary abstraction layer in creating the
      generic protection_map[] array which is used for vm_get_page_prot().  This
      abstraction layer can be avoided, if the platforms just define the array
      protection_map[] for all possible vm_flags access permission combinations
      and also export vm_get_page_prot() implementation.
      
      This series drops __SXXX/__PXXX macros from across platforms in the tree. 
      First it build protects generic protection_map[] array with '#ifdef
      __P000' and moves it inside platforms which enable
      ARCH_HAS_VM_GET_PAGE_PROT.  Later this build protects same array with
      '#ifdef ARCH_HAS_VM_GET_PAGE_PROT' and moves inside remaining platforms
      while enabling ARCH_HAS_VM_GET_PAGE_PROT.  This adds a new macro
      DECLARE_VM_GET_PAGE_PROT defining the current generic vm_get_page_prot(),
      in order for it to be reused on platforms that do not require custom
      implementation.  Finally, ARCH_HAS_VM_GET_PAGE_PROT can just be dropped,
      as all platforms now define and export vm_get_page_prot(), via looking up
      a private and static protection_map[] array.  protection_map[] data type
      has been changed as 'static const' on all platforms that do not change it
      during boot.
      
      
      This patch (of 26):
      
      Build protect generic protection_map[] array with __P000, so that it can
      be moved inside all the platforms one after the other.  Otherwise there
      will be build failures during this process. 
      CONFIG_ARCH_HAS_VM_GET_PAGE_PROT cannot be used for this purpose as only
      certain platforms enable this config now.
      
      Link: https://lkml.kernel.org/r/20220711070600.2378316-1-anshuman.khandual@arm.com
      Link: https://lkml.kernel.org/r/20220711070600.2378316-2-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Suggested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      84053271
    • S
      mm: introduce mf_dax_kill_procs() for fsdax case · c36e2024
      Shiyang Ruan 提交于
      This new function is a variant of mf_generic_kill_procs that accepts a
      file, offset pair instead of a struct to support multiple files sharing a
      DAX mapping.  It is intended to be called by the file systems as part of
      the memory_failure handler after the file system performed a reverse
      mapping from the storage address to the file and file offset.
      
      Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.comSigned-off-by: NShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.wiliams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c36e2024
    • A
      mm: add zone device coherent type memory support · f25cbb7a
      Alex Sierra 提交于
      Device memory that is cache coherent from device and CPU point of view. 
      This is used on platforms that have an advanced system bus (like CAPI or
      CXL).  Any page of a process can be migrated to such memory.  However, no
      one should be allowed to pin such memory so that it can always be evicted.
      
      [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page]
      Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f25cbb7a
    • A
      mm: move page zone helpers from mm.h to mmzone.h · 5bb88dc5
      Alex Sierra 提交于
      It makes more sense to have these helpers in zone specific header
      file, rather than the generic mm.h
      
      Link: https://lkml.kernel.org/r/20220715150521.18165-3-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5bb88dc5
    • A
      mm: rename is_pinnable_page() to is_longterm_pinnable_page() · 6077c943
      Alex Sierra 提交于
      Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory
      mapping", v9.
      
      This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
      owned by a device that can be mapped into CPU page tables like
      MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
      
      This patch series is mostly self-contained except for a few places where
      it needs to update other subsystems to handle the new memory type.
      
      System stability and performance are not affected according to our ongoing
      testing, including xfstests.
      
      How it works: The system BIOS advertises the GPU device memory (aka VRAM)
      as SPM (special purpose memory) in the UEFI system address map.
      
      The amdgpu driver registers the memory with devmap as
      MEMORY_DEVICE_COHERENT using devm_memremap_pages.  The initial user for
      this hardware page migration capability is the Frontier supercomputer
      project.  This functionality is not AMD-specific.  We expect other GPU
      vendors to find this functionality useful, and possibly other hardware
      types in the future.
      
      Our test nodes in the lab are similar to the Frontier configuration, with
      .5 TB of system memory plus 256 GB of device memory split across 4 GPUs,
      all in a single coherent address space.  Page migration is expected to
      improve application efficiency significantly.  We will report empirical
      results as they become available.
      
      Coherent device type pages at gup are now migrated back to system memory
      if they are being pinned long-term (FOLL_LONGTERM).  The reason is, that
      long-term pinning would interfere with the device memory manager owning
      the device-coherent pages (e.g.  evictions in TTM).  These series
      incorporate Alistair Popple patches to do this migration from
      pin_user_pages() calls.  hmm_gup_test has been added to hmm-test to test
      different get user pages calls.
      
      This series includes handling of device-managed anonymous pages returned
      by vm_normal_pages.  Although they behave like normal pages for purposes
      of mapping in CPU page tables and for COW, they do not support LRU lists,
      NUMA migration or THP.
      
      We also introduced a FOLL_LRU flag that adds the same behaviour to
      follow_page and related APIs, to allow callers to specify that they expect
      to put pages on an LRU list.
      
      
      This patch (of 14):
      
      is_pinnable_page() and folio_is_pinnable() are renamed to
      is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively.
      These functions are used in the FOLL_LONGTERM flag context.
      
      Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com
      Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6077c943
    • D
      mm: Add PAGE_ALIGN_DOWN macro · 335e52c2
      David Gow 提交于
      This is just the same as PAGE_ALIGN(), but rounds the address down, not
      up.
      Suggested-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid Gow <davidgow@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      335e52c2
  8. 04 7月, 2022 4 次提交
  9. 17 6月, 2022 2 次提交
    • Z
      mm/memory-failure: disable unpoison once hw error happens · 67f22ba7
      zhenwei pi 提交于
      Currently unpoison_memory(unsigned long pfn) is designed for soft
      poison(hwpoison-inject) only.  Since 17fae129, the KPTE gets cleared
      on a x86 platform once hardware memory corrupts.
      
      Unpoisoning a hardware corrupted page puts page back buddy only, the
      kernel has a chance to access the page with *NOT PRESENT* KPTE.  This
      leads BUG during accessing on the corrupted KPTE.
      
      Suggested by David&Naoya, disable unpoison mechanism when a real HW error
      happens to avoid BUG like this:
      
       Unpoison: Software-unpoisoned page 0x61234
       BUG: unable to handle page fault for address: ffff888061234000
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
       Oops: 0002 [#1] PREEMPT SMP NOPTI
       CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0.bm.1-amd64 #7
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
       RIP: 0010:clear_page_erms+0x7/0x10
       Code: ...
       RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
       RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
       RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
       R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
       R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
       FS:  00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        <TASK>
        prep_new_page+0x151/0x170
        get_page_from_freelist+0xca0/0xe20
        ? sysvec_apic_timer_interrupt+0xab/0xc0
        ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
        __alloc_pages+0x17e/0x340
        __folio_alloc+0x17/0x40
        vma_alloc_folio+0x84/0x280
        __handle_mm_fault+0x8d4/0xeb0
        handle_mm_fault+0xd5/0x2a0
        do_user_addr_fault+0x1d0/0x680
        ? kvm_read_and_reset_apf_flags+0x3b/0x50
        exc_page_fault+0x78/0x170
        asm_exc_page_fault+0x27/0x30
      
      Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
      Fixes: 847ce401 ("HWPOISON: Add unpoisoning support")
      Fixes: 17fae129 ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
      Signed-off-by: Nzhenwei pi <pizhenwei@bytedance.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      67f22ba7
    • A
      mm: re-allow pinning of zero pfns · 034e5afa
      Alex Williamson 提交于
      The commit referenced below subtly and inadvertently changed the logic to
      disallow pinning of zero pfns.  This breaks device assignment with vfio
      and potentially various other users of gup.  Exclude the zero page test
      from the negation.
      
      Link: https://lkml.kernel.org/r/165490039431.944052.12458624139225785964.stgit@omen
      Fixes: 1c563432 ("mm: fix is_pinnable_page against a cma page")
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NYishai Hadas <yishaih@nvidia.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Yi Liu <yi.l.liu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      034e5afa
  10. 28 5月, 2022 1 次提交
    • M
      mm: fix is_pinnable_page against a cma page · 1c563432
      Minchan Kim 提交于
      Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so
      the current is_pinnable_page() could miss CMA pages which have
      MIGRATE_ISOLATE.  It ends up pinning CMA pages as longterm for the
      pin_user_pages() API so CMA allocations keep failing until the pin is
      released.
      
           CPU 0                                   CPU 1 - Task B
      
      cma_alloc
      alloc_contig_range
                                              pin_user_pages_fast(FOLL_LONGTERM)
      change pageblock as MIGRATE_ISOLATE
                                              internal_get_user_pages_fast
                                              lockless_pages_from_mm
                                              gup_pte_range
                                              try_grab_folio
                                              is_pinnable_page
                                                return true;
                                              So, pinned the page successfully.
      page migration failure with pinned page
                                              ..
                                              .. After 30 sec
                                              unpin_user_page(page)
      
      CMA allocation succeeded after 30 sec.
      
      The CMA allocation path protects the migration type change race using
      zone->lock but what GUP path need to know is just whether the page is on
      CMA area or not rather than exact migration type.  Thus, we don't need
      zone->lock but just checks migration type in either of (MIGRATE_ISOLATE
      and MIGRATE_CMA).
      
      Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting
      of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither
      CMA nor movable zone if the page is temporarily unmovable.  However, such
      a migration failure by unexpected temporal refcount holding is general
      issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also
      transient state like other temporal elevated refcount problem.
      
      Link: https://lkml.kernel.org/r/20220524171525.976723-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NPaul E. McKenney <paulmck@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1c563432
  11. 19 5月, 2022 1 次提交
    • J
      random: move randomize_page() into mm where it belongs · 5ad7dd88
      Jason A. Donenfeld 提交于
      randomize_page is an mm function. It is documented like one. It contains
      the history of one. It has the naming convention of one. It looks
      just like another very similar function in mm, randomize_stack_top().
      And it has always been maintained and updated by mm people. There is no
      need for it to be in random.c. In the "which shape does not look like
      the other ones" test, pointing to randomize_page() is correct.
      
      So move randomize_page() into mm/util.c, right next to the similar
      randomize_stack_top() function.
      
      This commit contains no actual code changes.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      5ad7dd88
  12. 13 5月, 2022 3 次提交
    • P
      mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0
      Peter Xu 提交于
      As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
      if unmapping an entire vma or synchronized such that faults can not race
      with the unmap operation.  This requires passing zap_flags all the way to
      the lowest level hugetlb unmap routine: __unmap_hugepage_range.
      
      In general, unmap calls originated in hugetlbfs code will pass the
      ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
      faults.  The exception is hole punch which will first unmap without any
      synchronization.  Later when hole punch actually removes the page from the
      file, it will check to see if there was a subsequent fault and if so take
      the hugetlb fault mutex while unmapping again.  This second unmap will
      pass in ZAP_FLAG_DROP_MARKER.
      
      The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
      unmap a hugetlb range" is (IMHO): we should never reach a state when a
      page fault could errornously fault in a page-cache page that was
      wr-protected to be writable, even in an extremely short period.  That
      could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
      hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
      faults after that call and before remove_inode_hugepages() is executed,
      the page cache can be mapped writable again in the small racy window, that
      can cause unexpected data overwritten.
      
      [peterx@redhat.com: fix sparse warning]
        Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
      [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
      Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      05e90bd0
    • P
      mm/shmem: persist uffd-wp bit across zapping for file-backed · 999dad82
      Peter Xu 提交于
      File-backed memory is prone to being unmapped at any time.  It means all
      information in the pte will be dropped, including the uffd-wp flag.
      
      To persist the uffd-wp flag, we'll use the pte markers.  This patch
      teaches the zap code to understand uffd-wp and know when to keep or drop
      the uffd-wp bit.
      
      Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we
      don't want to persist such an information, for example, when destroying
      the whole vma, or punching a hole in a shmem file.  For the rest cases we
      should never drop the uffd-wp bit, or the wr-protect information will get
      lost.
      
      The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than
      memory.c because it'll be further referenced in hugetlb files later.
      
      Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      999dad82
    • N
      mm/mprotect: use mmu_gather · 4a18419f
      Nadav Amit 提交于
      Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
      
      This patchset is intended to remove unnecessary TLB flushes during
      mprotect() syscalls.  Once this patch-set make it through, similar and
      further optimizations for MADV_COLD and userfaultfd would be possible.
      
      Basically, there are 3 optimizations in this patch-set:
      
      1. Use TLB batching infrastructure to batch flushes across VMAs and do
         better/fewer flushes.  This would also be handy for later userfaultfd
         enhancements.
      
      2. Avoid unnecessary TLB flushes.  This optimization is the one that
         provides most of the performance benefits.  Unlike previous versions,
         we now only avoid flushes that would not result in spurious
         page-faults.
      
      3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
         prevent the A/D bits from changing.
      
      Andrew asked for some benchmark numbers.  I do not have an easy
      determinate macrobenchmark in which it is easy to show benefit.  I
      therefore ran a microbenchmark: a loop that does the following on
      anonymous memory, just as a sanity check to see that time is saved by
      avoiding TLB flushes.  The loop goes:
      
      	mprotect(p, PAGE_SIZE, PROT_READ)
      	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
      	*p = 0; // make the page writable
      
      The test was run in KVM guest with 1 or 2 threads (the second thread was
      busy-looping).  I measured the time (cycles) of each operation:
      
      		1 thread		2 threads
      		mmots	+patch		mmots	+patch
      PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
      PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)
      
      [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
      
      The exact numbers are really meaningless, but the benefit is clear.  There
      are 2 interesting results though.  
      
      (1) PROT_READ is cheaper, while one can expect it not to be affected. 
      This is presumably due to TLB miss that is saved
      
      (2) Without memory access (*p = 0), the speedup of the patch is even
      greater.  In that scenario mprotect(PROT_READ) also avoids the TLB flush. 
      As a result both operations on the patched kernel take roughly ~1500
      cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
      high as presented in the table.
      
      
      This patch (of 3):
      
      change_pXX_range() currently does not use mmu_gather, but instead
      implements its own deferred TLB flushes scheme.  This both complicates the
      code, as developers need to be aware of different invalidation schemes,
      and prevents opportunities to avoid TLB flushes or perform them in finer
      granularity.
      
      The use of mmu_gather for modified PTEs has benefits in various scenarios
      even if pages are not released.  For instance, if only a single page needs
      to be flushed out of a range of many pages, only that page would be
      flushed.  If a THP page is flushed, on x86 a single TLB invlpg instruction
      can be used instead of 512 instructions (or a full TLB flush, which would
      Linux would actually use by default).  mprotect() over multiple VMAs
      requires a single flush.
      
      Use mmu_gather in change_pXX_range().  As the pages are not released, only
      record the flushed range using tlb_flush_pXX_range().
      
      Handle THP similarly and get rid of flush_cache_range() which becomes
      redundant since tlb_start_vma() calls it when needed.
      
      Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
      Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a18419f
  13. 10 5月, 2022 3 次提交
    • D
      mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page · a7f22660
      David Hildenbrand 提交于
      Whenever GUP currently ends up taking a R/O pin on an anonymous page that
      might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
      on the page table entry will end up replacing the mapped anonymous page
      due to COW, resulting in the GUP pin no longer being consistent with the
      page actually mapped into the page table.
      
      The possible ways to deal with this situation are:
       (1) Ignore and pin -- what we do right now.
       (2) Fail to pin -- which would be rather surprising to callers and
           could break user space.
       (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
           pins.
      
      Let's implement 3) because it provides the clearest semantics and allows
      for checking in unpin_user_pages() and friends for possible BUGs: when
      trying to unpin a page that's no longer exclusive, clearly something went
      very wrong and might result in memory corruptions that might be hard to
      debug.  So we better have a nice way to spot such issues.
      
      This change implies that whenever user space *wrote* to a private mapping
      (IOW, we have an anonymous page mapped), that GUP pins will always remain
      consistent: reliable R/O GUP pins of anonymous pages.
      
      As a side note, this commit fixes the COW security issue for hugetlb with
      FOLL_PIN as documented in:
        https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
      instead of FOLL_PIN.
      
      Note that follow_huge_pmd() doesn't apply because we cannot end up in
      there with FOLL_PIN.
      
      This commit is heavily based on prototype patches by Andrea.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Co-developed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a7f22660
    • D
      mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() · fb3d824d
      David Hildenbrand 提交于
      ...  and move the special check for pinned pages into
      page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
      via a new pageflag, clearing it only after making sure that there are no
      GUP pins on the anonymous page.
      
      We really only care about pins on anonymous pages, because they are prone
      to getting replaced in the COW handler once mapped R/O.  For !anon pages
      in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
      that, at least not that I could come up with an example.
      
      Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
      know we're dealing with anonymous pages.  Also, drop the handling of
      pinned pages from copy_huge_pud() and add a comment if ever supporting
      anonymous pages on the PUD level.
      
      This is a preparation for tracking exclusivity of anonymous pages in the
      rmap code, and disallowing marking a page shared (-> failing to duplicate)
      if there are GUP pins on a page.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fb3d824d
    • D
      mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() · 623a1ddf
      David Hildenbrand 提交于
      Let's do it just like copy_page_range(), taking the seqlock and making
      sure the mmap_lock is held in write mode.
      
      This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and properly
      synchronizes concurrent fork() with GUP-fast of hugetlb pages, which will
      be relevant for further changes.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      623a1ddf
  14. 29 4月, 2022 5 次提交
    • J
      mm/sparse-vmemmap: improve memory savings for compound devmaps · 4917f55b
      Joao Martins 提交于
      A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means
      that pages are mapped at a given huge page alignment and utilize uses
      compound pages as opposed to order-0 pages.
      
      Take advantage of the fact that most tail pages look the same (except the
      first two) to minimize struct page overhead.  Allocate a separate page for
      the vmemmap area which contains the head page and separate for the next 64
      pages.  The rest of the subsections then reuse this tail vmemmap page to
      initialize the rest of the tail pages.
      
      Sections are arch-dependent (e.g.  on x86 it's 64M, 128M or 512M) and when
      initializing compound devmap with big enough @vmemmap_shift (e.g.  1G PUD)
      it may cross multiple sections.  The vmemmap code needs to consult @pgmap
      so that multiple sections that all map the same tail data can refer back
      to the first copy of that data for a given gigantic page.
      
      On compound devmaps with 2M align, this mechanism lets 6 pages be saved
      out of the 8 necessary PFNs necessary to set the subsection's 512 struct
      pages being mapped.  On a 1G compound devmap it saves 4094 pages.
      
      Altmap isn't supported yet, given various restrictions in altmap pfn
      allocator, thus fallback to the already in use vmemmap_populate().  It is
      worth noting that altmap for devmap mappings was there to relieve the
      pressure of inordinate amounts of memmap space to map terabytes of pmem. 
      With compound pages the motivation for altmaps for pmem gets reduced.
      
      Link: https://lkml.kernel.org/r/20220420155310.9712-5-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4917f55b
    • J
      mm/sparse-vmemmap: add a pgmap argument to section activation · e3246d8f
      Joao Martins 提交于
      Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9.
      
      This series minimizes 'struct page' overhead by pursuing a similar
      approach as Muchun Song series "Free some vmemmap pages of hugetlb page"
      (now merged since v5.14), but applied to devmap with @vmemmap_shift
      (device-dax).  
      
      The vmemmap dedpulication original idea (already used in HugeTLB) is to
      reuse/deduplicate tail page vmemmap areas, particular the area which only
      describes tail pages.  So a vmemmap page describes 64 struct pages, and
      the first page for a given ZONE_DEVICE vmemmap would contain the head page
      and 63 tail pages.  The second vmemmap page would contain only tail pages,
      and that's what gets reused across the rest of the subsection/section. 
      The bigger the page size, the bigger the savings (2M hpage -> save 6
      vmemmap pages; 1G hpage -> save 4094 vmemmap pages).  
      
      This is done for PMEM /specifically only/ on device-dax configured
      namespaces, not fsdax.  In other words, a devmap with a @vmemmap_shift.
      
      In terms of savings, per 1Tb of memory, the struct page cost would go down
      with compound devmap:
      
      * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of
        total memory)
      
      * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of
        total memory)
      
      The series is mostly summed up by patch 4, and to summarize what the
      series does:
      
      Patches 1 - 3: Minor cleanups in preparation for patch 4.  Move the very
      nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry.
      
      Patch 4: Patch 4 is the one that takes care of the struct page savings
      (also referred to here as tail-page/vmemmap deduplication).  Much like
      Muchun series, we reuse the second PTE tail page vmemmap areas across a
      given @vmemmap_shift On important difference though, is that contrary to
      the hugetlbfs series, there's no vmemmap for the area because we are
      late-populating it as opposed to remapping a system-ram range.  IOW no
      freeing of pages of already initialized vmemmap like the case for
      hugetlbfs, which greatly simplifies the logic (besides not being
      arch-specific).  altmap case unchanged and still goes via the
      vmemmap_populate().  Also adjust the newly added docs to the device-dax
      case.
      
      [Note that device-dax is still a little behind HugeTLB in terms of
      savings.  I have an additional simple patch that reuses the head vmemmap
      page too, as a follow-up.  That will double the savings and namespaces
      initialization.]
      
      Patch 5: Initialize fewer struct pages depending on the page size with
      DRAM backed struct pages -- because fewer pages are unique and most tail
      pages (with bigger vmemmap_shift).
      
          NVDIMM namespace bootstrap improves from ~268-358 ms to
          ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally.  And struct
          page needed capacity will be 3.8x / 1071x smaller for 2M and 1G
          respectivelly.  Tested on x86 with 1.5Tb of pmem (including pinning,
          and RDMA registration/deregistration scalability with 2M MRs)
      
      
      This patch (of 5):
      
      In support of using compound pages for devmap mappings, plumb the pgmap
      down to the vmemmap_populate implementation.  Note that while altmap is
      retrievable from pgmap the memory hotplug code passes altmap without
      pgmap[*], so both need to be independently plumbed.
      
      So in addition to @altmap, pass @pgmap to sparse section populate
      functions namely:
      
      	sparse_add_section
      	  section_activate
      	    populate_section_memmap
         	      __populate_section_memmap
      
      Passing @pgmap allows __populate_section_memmap() to both fetch the
      vmemmap_shift in which memmap metadata is created for and also to let
      sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
      whether to just reuse tail pages from past onlined sections.
      
      While at it, fix the kdoc for @altmap for sparse_add_section().
      
      [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
      
      Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com
      Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e3246d8f
    • M
      mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP* · 47010c04
      Muchun Song 提交于
      The word of "free" is not expressive enough to express the feature of
      optimizing vmemmap pages associated with each HugeTLB, rename this keywork
      to "optimize".  In this patch , cheanup configs to make code more
      expressive.
      
      Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      47010c04
    • M
      mm: simplify follow_invalidate_pte() · 0e5e64c0
      Muchun Song 提交于
      The only user (DAX) of range and pmdpp parameters of
      follow_invalidate_pte() is gone, it is safe to remove them and make it
      static to simlify the code.  This is revertant of the following commits:
      
        09796395 ("mm: add follow_pte_pmd()")
        a4d1a885 ("dax: update to new mmu_notifier semantic")
      
      There is only one caller of the follow_invalidate_pte().  So just fold it
      into follow_pte() and remove it.
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0e5e64c0
    • N
      Revert "mm/memory-failure.c: fix race with changing page compound again" · 2ba2b008
      Naoya Horiguchi 提交于
      Reverts commit 888af270 ("mm/memory-failure.c: fix race with changing
      page compound again") because now we fetch the page refcount under
      hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
      longer necessary.
      
      Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Suggested-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2ba2b008
  15. 22 4月, 2022 1 次提交
  16. 25 3月, 2022 1 次提交
  17. 23 3月, 2022 1 次提交
    • N
      userfaultfd: provide unmasked address on page-fault · 824ddc60
      Nadav Amit 提交于
      Userfaultfd is supposed to provide the full address (i.e., unmasked) of
      the faulting access back to userspace.  However, that is not the case for
      quite some time.
      
      Even running "userfaultfd_demo" from the userfaultfd man page provides the
      wrong output (and contradicts the man page).  Notice that
      "UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
      not the first read address (0x7fc5e30b300f).
      
      	Address returned by mmap() = 0x7fc5e30b3000
      
      	fault_handler_thread():
      	    poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
      	    UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
      		(uffdio_copy.copy returned 4096)
      	Read address 0x7fc5e30b300f in main(): A
      	Read address 0x7fc5e30b340f in main(): A
      	Read address 0x7fc5e30b380f in main(): A
      	Read address 0x7fc5e30b3c0f in main(): A
      
      The exact address is useful for various reasons and specifically for
      prefetching decisions.  If it is known that the memory is populated by
      certain objects whose size is not page-aligned, then based on the faulting
      address, the uffd-monitor can decide whether to prefetch and prefault the
      adjacent page.
      
      This bug has been for quite some time in the kernel: since commit
      1a29d85e ("mm: use vmf->address instead of of vmf->virtual_address")
      vmf->virtual_address"), which dates back to 2016.  A concern has been
      raised that existing userspace application might rely on the old/wrong
      behavior in which the address is masked.  Therefore, it was suggested to
      provide the masked address unless the user explicitly asks for the exact
      address.
      
      Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
      userfaultfd to provide the exact address.  Add a new "real_address" field
      to vmf to hold the unmasked address.  Provide the address to userspace
      accordingly.
      
      Initialize real_address in various code-paths to be consistent with
      address, even when it is not used, to be on the safe side.
      
      [namit@vmware.com: initialize real_address on all code paths, per Jan]
        Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
      [akpm@linux-foundation.org: fix typo in comment, per Jan]
      
      Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      824ddc60