1. 27 9月, 2022 5 次提交
    • L
      mm: remove the vma linked list · 763ecb03
      Liam R. Howlett 提交于
      Replace any vm_next use with vma_find().
      
      Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
      maple tree.
      
      Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
      the same time, alter the loop to be more compact.
      
      Now that free_pgtables() and unmap_vmas() take a maple tree as an
      argument, rearrange do_mas_align_munmap() to use the new tree to hold the
      vmas to remove.
      
      Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
      used to update the linked list.
      
      Drop linked list update from __insert_vm_struct().
      
      Rework validation of tree as it was depending on the linked list.
      
      [yang.lee@linux.alibaba.com: fix one kernel-doc comment]
        Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949
        Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alibaba.comLink: https://lkml.kernel.org/r/20220906194824.2110408-69-Liam.Howlett@oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
      Tested-by: NYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      763ecb03
    • A
      mm/demotion: update node_is_toptier to work with memory tiers · 467b171a
      Aneesh Kumar K.V 提交于
      With memory tier support we can have memory only NUMA nodes in the top
      tier from which we want to avoid promotion tracking NUMA faults.  Update
      node_is_toptier to work with memory tiers.  All NUMA nodes are by default
      top tier nodes.  With lower(slower) memory tiers added we consider all
      memory tiers above a memory tier having CPU NUMA nodes as a top memory
      tier
      
      [sj@kernel.org: include missed header file, memory-tiers.h]
        Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
      [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
      [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
        Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      467b171a
    • Y
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao 提交于
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2
    • Y
      mm: x86, arm64: add arch_has_hw_pte_young() · e1fd09e3
      Yu Zhao 提交于
      Patch series "Multi-Gen LRU Framework", v14.
      
      What's new
      ==========
      1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
         Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
      2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
         machines. The old direct reclaim backoff, which tries to enforce a
         minimum fairness among all eligible memcgs, over-swapped by about
         (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
         pulls the plug on swapping once the target is met, trades some
         fairness for curtailed latency:
         https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
      3. Fixed minior build warnings and conflicts. More comments and nits.
      
      TLDR
      ====
      The current page reclaim is too expensive in terms of CPU usage and it
      often makes poor choices about what to evict. This patchset offers an
      alternative solution that is performant, versatile and
      straightforward.
      
      Patchset overview
      =================
      The design and implementation overview is in patch 14:
      https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
      
      01. mm: x86, arm64: add arch_has_hw_pte_young()
      02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
      Take advantage of hardware features when trying to clear the accessed
      bit in many PTEs.
      
      03. mm/vmscan.c: refactor shrink_node()
      04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
          its sole caller"
      Minor refactors to improve readability for the following patches.
      
      05. mm: multi-gen LRU: groundwork
      Adds the basic data structure and the functions that insert pages to
      and remove pages from the multi-gen LRU (MGLRU) lists.
      
      06. mm: multi-gen LRU: minimal implementation
      A minimal implementation without optimizations.
      
      07. mm: multi-gen LRU: exploit locality in rmap
      Exploits spatial locality to improve efficiency when using the rmap.
      
      08. mm: multi-gen LRU: support page table walks
      Further exploits spatial locality by optionally scanning page tables.
      
      09. mm: multi-gen LRU: optimize multiple memcgs
      Optimizes the overall performance for multiple memcgs running mixed
      types of workloads.
      
      10. mm: multi-gen LRU: kill switch
      Adds a kill switch to enable or disable MGLRU at runtime.
      
      11. mm: multi-gen LRU: thrashing prevention
      12. mm: multi-gen LRU: debugfs interface
      Provide userspace with features like thrashing prevention, working set
      estimation and proactive reclaim.
      
      13. mm: multi-gen LRU: admin guide
      14. mm: multi-gen LRU: design doc
      Add an admin guide and a design doc.
      
      Benchmark results
      =================
      Independent lab results
      -----------------------
      Based on the popularity of searches [01] and the memory usage in
      Google's public cloud, the most popular open-source memory-hungry
      applications, in alphabetical order, are:
            Apache Cassandra      Memcached
            Apache Hadoop         MongoDB
            Apache Spark          PostgreSQL
            MariaDB (MySQL)       Redis
      
      An independent lab evaluated MGLRU with the most widely used benchmark
      suites for the above applications. They posted 960 data points along
      with kernel metrics and perf profiles collected over more than 500
      hours of total benchmark time. Their final reports show that, with 95%
      confidence intervals (CIs), the above applications all performed
      significantly better for at least part of their benchmark matrices.
      
      On 5.14:
      1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
         less wall time to sort three billion random integers, respectively,
         under the medium- and the high-concurrency conditions, when
         overcommitting memory. There were no statistically significant
         changes in wall time for the rest of the benchmark matrix.
      2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
         more transactions per minute (TPM), respectively, under the medium-
         and the high-concurrency conditions, when overcommitting memory.
         There were no statistically significant changes in TPM for the rest
         of the benchmark matrix.
      3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
         and [21.59, 30.02]% more operations per second (OPS), respectively,
         for sequential access, random access and Gaussian (distribution)
         access, when THP=always; 95% CIs [13.85, 15.97]% and
         [23.94, 29.92]% more OPS, respectively, for random access and
         Gaussian access, when THP=never. There were no statistically
         significant changes in OPS for the rest of the benchmark matrix.
      4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
         [2.16, 3.55]% more operations per second (OPS), respectively, for
         exponential (distribution) access, random access and Zipfian
         (distribution) access, when underutilizing memory; 95% CIs
         [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
         respectively, for exponential access, random access and Zipfian
         access, when overcommitting memory.
      
      On 5.15:
      5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
         and [4.11, 7.50]% more operations per second (OPS), respectively,
         for exponential (distribution) access, random access and Zipfian
         (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
         [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
         exponential access, random access and Zipfian access, when swap was
         on.
      6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
         less average wall time to finish twelve parallel TeraSort jobs,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in average wall time for the rest of the
         benchmark matrix.
      7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
         minute (TPM) under the high-concurrency condition, when swap was
         off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in TPM for the rest of the benchmark matrix.
      8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
         [11.47, 19.36]% more total operations per second (OPS),
         respectively, for sequential access, random access and Gaussian
         (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
         [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
         for sequential access, random access and Gaussian access, when
         THP=never.
      
      Our lab results
      ---------------
      To supplement the above results, we ran the following benchmark suites
      on 5.16-rc7 and found no regressions [10].
            fs_fio_bench_hdd_mq      pft
            fs_lmbench               pgsql-hammerdb
            fs_parallelio            redis
            fs_postmark              stream
            hackbench                sysbenchthread
            kernbench                tpcc_spark
            memcached                unixbench
            multichase               vm-scalability
            mutilate                 will-it-scale
            nginx
      
      [01] https://trends.google.com
      [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
      [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
      [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
      [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
      [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
      [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
      [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
      [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
      [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
      
      Read-world applications
      =======================
      Third-party testimonials
      ------------------------
      Konstantin reported [11]:
         I have Archlinux with 8G RAM + zswap + swap. While developing, I
         have lots of apps opened such as multiple LSP-servers for different
         langs, chats, two browsers, etc... Usually, my system gets quickly
         to a point of SWAP-storms, where I have to kill LSP-servers,
         restart browsers to free memory, etc, otherwise the system lags
         heavily and is barely usable.
         
         1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
         patchset, and I started up by opening lots of apps to create memory
         pressure, and worked for a day like this. Till now I had not a
         single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
         getting to the point of 3G in SWAP before without a single
         SWAP-storm.
      
      Vaibhav from IBM reported [12]:
         In a synthetic MongoDB Benchmark, seeing an average of ~19%
         throughput improvement on POWER10(Radix MMU + 64K Page Size) with
         MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
         three different request distributions, namely, Exponential, Uniform
         and Zipfan.
      
      Shuang from U of Rochester reported [13]:
         With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
         and [9.26, 10.36]% higher throughput, respectively, for random
         access, Zipfian (distribution) access and Gaussian (distribution)
         access, when the average number of jobs per CPU is 1; 95% CIs
         [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
         throughput, respectively, for random access, Zipfian access and
         Gaussian access, when the average number of jobs per CPU is 2.
      
      Daniel from Michigan Tech reported [14]:
         With Memcached allocating ~100GB of byte-addressable Optante,
         performance improvement in terms of throughput (measured as queries
         per second) was about 10% for a series of workloads.
      
      Large-scale deployments
      -----------------------
      We've rolled out MGLRU to tens of millions of ChromeOS users and
      about a million Android users. Google's fleetwide profiling [15] shows
      an overall 40% decrease in kswapd CPU usage, in addition to
      improvements in other UX metrics, e.g., an 85% decrease in the number
      of low-memory kills at the 75th percentile and an 18% decrease in
      app launch time at the 50th percentile.
      
      The downstream kernels that have been using MGLRU include:
      1. Android [16]
      2. Arch Linux Zen [17]
      3. Armbian [18]
      4. ChromeOS [19]
      5. Liquorix [20]
      6. OpenWrt [21]
      7. post-factum [22]
      8. XanMod [23]
      
      [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
      [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
      [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
      [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
      [15] https://dl.acm.org/doi/10.1145/2749469.2750392
      [16] https://android.com
      [17] https://archlinux.org
      [18] https://armbian.com
      [19] https://chromium.org
      [20] https://liquorix.net
      [21] https://openwrt.org
      [22] https://codeberg.org/pf-kernel
      [23] https://xanmod.org
      
      Summary
      =======
      The facts are:
      1. The independent lab results and the real-world applications
         indicate substantial improvements; there are no known regressions.
      2. Thrashing prevention, working set estimation and proactive reclaim
         work out of the box; there are no equivalent solutions.
      3. There is a lot of new code; no smaller changes have been
         demonstrated similar effects.
      
      Our options, accordingly, are:
      1. Given the amount of evidence, the reported improvements will likely
         materialize for a wide range of workloads.
      2. Gauging the interest from the past discussions, the new features
         will likely be put to use for both personal computers and data
         centers.
      3. Based on Google's track record, the new code will likely be well
         maintained in the long term. It'd be more difficult if not
         impossible to achieve similar effects with other approaches.
      
      
      This patch (of 14):
      
      Some architectures automatically set the accessed bit in PTEs, e.g., x86
      and arm64 v8.2.  On architectures that do not have this capability,
      clearing the accessed bit in a PTE usually triggers a page fault following
      the TLB miss of this PTE (to emulate the accessed bit).
      
      Being aware of this capability can help make better decisions, e.g.,
      whether to spread the work out over a period of time to reduce bursty page
      faults when trying to clear the accessed bit in many PTEs.
      
      Note that theoretically this capability can be unreliable, e.g.,
      hotplugged CPUs might be different from builtin ones.  Therefore it should
      not be used in architecture-independent code that involves correctness,
      e.g., to determine whether TLB flushes are required (in combination with
      the accessed bit).
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e1fd09e3
    • S
      mm: bring back update_mmu_cache() to finish_fault() · 70427f6e
      Sergei Antonov 提交于
      Running this test program on ARMv4 a few times (sometimes just once)
      reproduces the bug.
      
      int main()
      {
              unsigned i;
              char paragon[SIZE];
              void* ptr;
      
              memset(paragon, 0xAA, SIZE);
              ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                         MAP_ANON | MAP_SHARED, -1, 0);
              if (ptr == MAP_FAILED) return 1;
              printf("ptr = %p\n", ptr);
              for (i=0;i<10000;i++){
                      memset(ptr, 0xAA, SIZE);
                      if (memcmp(ptr, paragon, SIZE)) {
                              printf("Unexpected bytes on iteration %u!!!\n", i);
                              break;
                      }
              }
              munmap(ptr, SIZE);
      }
      
      In the "ptr" buffer there appear runs of zero bytes which are aligned
      by 16 and their lengths are multiple of 16.
      
      Linux v5.11 does not have the bug, "git bisect" finds the first bad commit:
      f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      
      Before the commit update_mmu_cache() was called during a call to
      filemap_map_pages() as well as finish_fault(). After the commit
      finish_fault() lacks it.
      
      Bring back update_mmu_cache() to finish_fault() to fix the bug.
      Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more
      closely reproduce the code of alloc_set_pte() function that existed before
      the commit.
      
      On many platforms update_mmu_cache() is nop:
       x86, see arch/x86/include/asm/pgtable
       ARMv6+, see arch/arm/include/asm/tlbflush.h
      So, it seems, few users ran into this bug.
      
      Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com
      Fixes: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      Signed-off-by: NSergei Antonov <saproj@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      70427f6e
  2. 12 9月, 2022 2 次提交
    • H
      memory tiering: hot page selection with hint page fault latency · 33024536
      Huang Ying 提交于
      Patch series "memory tiering: hot page selection", v4.
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory nodes need to be identified. 
      Essentially, the original NUMA balancing implementation selects the mostly
      recently accessed (MRU) pages to promote.  But this isn't a perfect
      algorithm to identify the hot pages.  Because the pages with quite low
      access frequency may be accessed eventually given the NUMA balancing page
      table scanning period could be quite long (e.g.  60 seconds).  So in this
      patchset, we implement a new hot page identification algorithm based on
      the latency between NUMA balancing page table scanning and hint page
      fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
      
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patchset
      as the page promotion rate limit mechanism.
      
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patchset, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.
      
      We used the pmbench memory accessing benchmark tested the patchset on a
      2-socket server system with DRAM and PMEM installed.  The test results are
      as follows,
      
      		pmbench score		promote rate
      		 (accesses/s)			MB/s
      		-------------		------------
      base		  146887704.1		       725.6
      hot selection     165695601.2		       544.0
      rate limit	  162814569.8		       165.2
      auto adjustment	  170495294.0                  136.9
      
      From the results above,
      
      With hot page selection patch [1/3], the pmbench score increases about
      12.8%, and promote rate (overhead) decreases about 25.0%, compared with
      base kernel.
      
      With rate limit patch [2/3], pmbench score decreases about 1.7%, and
      promote rate decreases about 69.6%, compared with hot page selection
      patch.
      
      With threshold auto adjustment patch [3/3], pmbench score increases about
      4.7%, and promote rate decrease about 17.1%, compared with rate limit
      patch.
      
      Baolin helped to test the patchset with MySQL on a machine which contains
      1 DRAM node (30G) and 1 PMEM node (126G).
      
      sysbench /usr/share/sysbench/oltp_read_write.lua \
      ......
      --tables=200 \
      --table-size=1000000 \
      --report-interval=10 \
      --threads=16 \
      --time=120
      
      The tps can be improved about 5%.
      
      
      This patch (of 3):
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory node need to be identified.  Essentially,
      the original NUMA balancing implementation selects the mostly recently
      accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
      identify the hot pages.  Because the pages with quite low access frequency
      may be accessed eventually given the NUMA balancing page table scanning
      period could be quite long (e.g.  60 seconds).  The most frequently
      accessed (MFU) algorithm is better.
      
      So, in this patch we implemented a better hot page selection algorithm. 
      Which is based on NUMA balancing page table scanning and hint page fault
      as follows,
      
      - When the page tables of the processes are scanned to change PTE/PMD
        to be PROT_NONE, the current time is recorded in struct page as scan
        time.
      
      - When the page is accessed, hint page fault will occur.  The scan
        time is gotten from the struct page.  And The hint page fault
        latency is defined as
      
          hint page fault time - scan time
      
      The shorter the hint page fault latency of a page is, the higher the
      probability of their access frequency to be higher.  So the hint page
      fault latency is a better estimation of the page hot/cold.
      
      It's hard to find some extra space in struct page to hold the scan time. 
      Fortunately, we can reuse some bits used by the original NUMA balancing.
      
      NUMA balancing uses some bits in struct page to store the page accessing
      CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
      multi-stage node selection algorithm to avoid to migrate pages shared
      accessed by the NUMA nodes back and forth.  But for pages in the slow
      memory node, even if they are shared accessed by multiple NUMA nodes, as
      long as the pages are hot, they need to be promoted to the fast memory
      node.  So the accessing CPU and PID information are unnecessary for the
      slow memory pages.  We can reuse these bits in struct page to record the
      scan time.  For the fast memory pages, these bits are used as before.
      
      For the hot threshold, the default value is 1 second, which works well in
      our performance test.  All pages with hint page fault latency < hot
      threshold will be considered hot.
      
      It's hard for users to determine the hot threshold.  So we don't provide a
      kernel ABI to set it, just provide a debugfs interface for advanced users
      to experiment.  We will continue to work on a hot threshold automatic
      adjustment mechanism.
      
      The downside of the above method is that the response time to the workload
      hot spot changing may be much longer.  For example,
      
      - A previous cold memory area becomes hot
      
      - The hint page fault will be triggered.  But the hint page fault
        latency isn't shorter than the hot threshold.  So the pages will
        not be promoted.
      
      - When the memory area is scanned again, maybe after a scan period,
        the hint page fault latency measured will be shorter than the hot
        threshold and the pages will be promoted.
      
      To mitigate this, if there are enough free space in the fast memory node,
      the hot threshold will not be used, all pages will be promoted upon the
      hint page fault for fast response.
      
      Thanks Zhong Jiang reported and tested the fix for a bug when disabling
      memory tiering mode dynamically.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      33024536
    • Z
      mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() · a7f4e6e4
      Zach O'Keefe 提交于
      MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
      
      hugepage_vma_check() is the authority on determining if a VMA is eligible
      for THP allocation/collapse, and currently enforces the sysfs THP
      settings.  Add a flag to disable these checks.  For now, only apply this
      arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. 
      We can expand this to shmem, which uses
      /sys/kernel/transparent_hugepage/shmem_enabled, later.
      
      Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
      passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
      VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
      THP flag in hugepage_vma_check()", this check also didn't check "never"
      THP mode.  As such, this restores the previous behavior of
      collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
      comment in code for justification why this is OK.
      
      [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a7f4e6e4
  3. 30 7月, 2022 2 次提交
  4. 27 7月, 2022 1 次提交
    • Q
      mm: fix NULL pointer dereference in wp_page_reuse() · cdb281e6
      Qi Zheng 提交于
      The vmf->page can be NULL when the wp_page_reuse() is invoked by
      wp_pfn_shared(), it will cause the following panic:
      
        BUG: kernel NULL pointer dereference, address: 000000000000008
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP PTI
        CPU: 18 PID: 923 Comm: Xorg Not tainted 5.19.0-rc8.bm.1-amd64 #263
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g14
        RIP: 0010:_compound_head+0x0/0x40
        [...]
        Call Trace:
          wp_page_reuse+0x1c/0xa0
          do_wp_page+0x1a5/0x3f0
          __handle_mm_fault+0x8cf/0xd20
          handle_mm_fault+0xd5/0x2a0
          do_user_addr_fault+0x1d0/0x680
          exc_page_fault+0x78/0x170
          asm_exc_page_fault+0x22/0x30
      
      To fix it, this patch performs a NULL pointer check before dereferencing
      the vmf->page.
      
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdb281e6
  5. 19 7月, 2022 1 次提交
    • J
      mm: fix page leak with multiple threads mapping the same page · 3fe2895c
      Josef Bacik 提交于
      We have an application with a lot of threads that use a shared mmap backed
      by tmpfs mounted with -o huge=within_size.  This application started
      leaking loads of huge pages when we upgraded to a recent kernel.
      
      Using the page ref tracepoints and a BPF program written by Tejun Heo we
      were able to determine that these pages would have multiple refcounts from
      the page fault path, but when it came to unmap time we wouldn't drop the
      number of refs we had added from the faults.
      
      I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
      huge=always, and then spawned 20 threads all looping faulting random
      offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge
      page aligned ranges.  This very quickly reproduced the problem.
      
      The problem here is that we check for the case that we have multiple
      threads faulting in a range that was previously unmapped.  One thread maps
      the PMD, the other thread loses the race and then returns 0.  However at
      this point we already have the page, and we are no longer putting this
      page into the processes address space, and so we leak the page.  We
      actually did the correct thing prior to f9ce0be7, however it looks
      like Kirill copied what we do in the anonymous page case.  In the
      anonymous page case we don't yet have a page, so we don't have to drop a
      reference on anything.  Previously we did the correct thing for file based
      faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on
      the page we faulted in.
      
      Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
      case, this makes us drop the ref on the page properly, and now my
      reproducer no longer leaks the huge pages.
      
      [josef@toxicpanda.com: v2]
        Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com
      Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com
      Fixes: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3fe2895c
  6. 18 7月, 2022 3 次提交
    • M
      hugetlb: lazy page table copies in fork() · bcd51a3c
      Mike Kravetz 提交于
      Lazy page table copying at fork time was introduced with commit
      d992895b ("[PATCH] Lazy page table copies in fork()").  At the time,
      hugetlb was very new and did not support page faulting.  As a result, it
      was excluded.  When full page fault support was added for hugetlb, the
      exclusion was not removed.
      
      Simply remove the check that prevents lazy copying of hugetlb page tables
      at fork.  Of course, like other mappings this only applies to shared
      mappings.
      
      Lazy page table copying at fork will be less advantageous for hugetlb
      mappings because:
      - There are fewer page table entries with hugetlb
      - hugetlb pmds can be shared instead of copied
      
      In any case, completely eliminating the copy at fork time should speed
      things up.
      
      Link: https://lkml.kernel.org/r/20220621235620.291305-5-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bcd51a3c
    • Y
      mm: thp: kill __transhuge_page_enabled() · 7da4e2cb
      Yang Shi 提交于
      The page fault path checks THP eligibility with __transhuge_page_enabled()
      which does the similar thing as hugepage_vma_check(), so use
      hugepage_vma_check() instead.
      
      However page fault allows DAX and !anon_vma cases, so added a new flag,
      in_pf, to hugepage_vma_check() to make page fault work correctly.
      
      The in_pf flag is also used to skip shmem and file THP for page fault
      since shmem handles THP in its own shmem_fault() and file THP allocation
      on fault is not supported yet.
      
      Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only
      caller now, it is not necessary to have a helper function.
      
      Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7da4e2cb
    • A
      mm: handling Non-LRU pages returned by vm_normal_pages · 3218f871
      Alex Sierra 提交于
      With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
      device-managed anonymous pages that are not LRU pages.  Although they
      behave like normal pages for purposes of mapping in CPU page, and for COW.
      They do not support LRU lists, NUMA migration or THP.
      
      Callers to follow_page() currently don't expect ZONE_DEVICE pages,
      however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
      ZONE_DEVICE pages in applicable users of follow_page() as well.
      
      Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com>
      Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>	[v2]
      Reviewed-by: Alistair Popple <apopple@nvidia.com>	[v6]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3218f871
  7. 04 7月, 2022 1 次提交
  8. 17 6月, 2022 1 次提交
    • P
      mm: avoid unnecessary page fault retires on shared memory types · d9272525
      Peter Xu 提交于
      I observed that for each of the shared file-backed page faults, we're very
      likely to retry one more time for the 1st write fault upon no page.  It's
      because we'll need to release the mmap lock for dirty rate limit purpose
      with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
      
      Then after that throttling we return VM_FAULT_RETRY.
      
      We did that probably because VM_FAULT_RETRY is the only way we can return
      to the fault handler at that time telling it we've released the mmap lock.
      
      However that's not ideal because it's very likely the fault does not need
      to be retried at all since the pgtable was well installed before the
      throttling, so the next continuous fault (including taking mmap read lock,
      walk the pgtable, etc.) could be in most cases unnecessary.
      
      It's not only slowing down page faults for shared file-backed, but also add
      more mmap lock contention which is in most cases not needed at all.
      
      To observe this, one could try to write to some shmem page and look at
      "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
      shmem write simply because we retried, and vm event "pgfault" will capture
      that.
      
      To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
      show that we've completed the whole fault and released the lock.  It's also
      a hint that we should very possibly not need another fault immediately on
      this page because we've just completed it.
      
      This patch provides a ~12% perf boost on my aarch64 test VM with a simple
      program sequentially dirtying 400MB shmem file being mmap()ed and these are
      the time it needs:
      
        Before: 650.980 ms (+-1.94%)
        After:  569.396 ms (+-1.38%)
      
      I believe it could help more than that.
      
      We need some special care on GUP and the s390 pgfault handler (for gmap
      code before returning from pgfault), the rest changes in the page fault
      handlers should be relatively straightforward.
      
      Another thing to mention is that mm_account_fault() does take this new
      fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
      
      I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
      not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
      them as-is.
      
      Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVineet Gupta <vgupta@kernel.org>
      Acked-by: NGuo Ren <guoren@kernel.org>
      Acked-by: NMax Filippov <jcmvbkbc@gmail.com>
      Acked-by: NChristian Borntraeger <borntraeger@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	[arm part]
      Acked-by: NHeiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Will Deacon <will@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d9272525
  9. 02 6月, 2022 1 次提交
  10. 28 5月, 2022 1 次提交
    • M
      mm/swapfile: unuse_pte can map random data if swap read fails · 9f186f9e
      Miaohe Lin 提交于
      Patch series "A few fixup patches for mm", v4.
      
      This series contains a few patches to avoid mapping random data if swap
      read fails and fix lost swap bits in unuse_pte.  Also we free hwpoison and
      swapin error entry in madvise_free_pte_range and so on.  More details can
      be found in the respective changelogs.  
      
      
      This patch (of 5):
      
      There is a bug in unuse_pte(): when swap page happens to be unreadable,
      page filled with random data is mapped into user address space.  In case
      of error, a special swap entry indicating swap read fails is set to the
      page table.  So the swapcache page can be freed and the user won't end up
      with a permanently mounted swap because a sector is bad.  And if the page
      is accessed later, the user process will be killed so that corrupted data
      is never consumed.  On the other hand, if the page is never accessed, the
      user won't even notice it.
      
      Link: https://lkml.kernel.org/r/20220519125030.21486-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220519125030.21486-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Howells <dhowells@redhat.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      9f186f9e
  11. 20 5月, 2022 1 次提交
  12. 13 5月, 2022 8 次提交
    • W
      mm/shmem: remove duplicate include in memory.c · 54943a1a
      Wan Jiabing 提交于
      Fix following checkincludes.pl warning:
      mm/memory.c: linux/mm_inline.h is included more than once.
      
      The include is in line 44. Remove the duplicated here.
      
      Link: https://lkml.kernel.org/r/20220427064717.803019-1-wanjiabing@vivo.comSigned-off-by: NWan Jiabing <wanjiabing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      54943a1a
    • P
      mm/hugetlb: handle uffd-wp during fork() · bc70fbf2
      Peter Xu 提交于
      Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
      because for uffd-wp it's the dst vma that matters on deciding how we
      should treat uffd-wp protected ptes.
      
      We should recognize pte markers during fork and do the pte copy if needed.
      
      [lkp@intel.com: vma_needs_copy can be static]
        Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
      Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bc70fbf2
    • P
      mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0
      Peter Xu 提交于
      As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
      if unmapping an entire vma or synchronized such that faults can not race
      with the unmap operation.  This requires passing zap_flags all the way to
      the lowest level hugetlb unmap routine: __unmap_hugepage_range.
      
      In general, unmap calls originated in hugetlbfs code will pass the
      ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
      faults.  The exception is hole punch which will first unmap without any
      synchronization.  Later when hole punch actually removes the page from the
      file, it will check to see if there was a subsequent fault and if so take
      the hugetlb fault mutex while unmapping again.  This second unmap will
      pass in ZAP_FLAG_DROP_MARKER.
      
      The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
      unmap a hugetlb range" is (IMHO): we should never reach a state when a
      page fault could errornously fault in a page-cache page that was
      wr-protected to be writable, even in an extremely short period.  That
      could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
      hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
      faults after that call and before remove_inode_hugepages() is executed,
      the page cache can be mapped writable again in the small racy window, that
      can cause unexpected data overwritten.
      
      [peterx@redhat.com: fix sparse warning]
        Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
      [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
      Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      05e90bd0
    • P
      mm/shmem: handle uffd-wp during fork() · c56d1b62
      Peter Xu 提交于
      Normally we skip copy page when fork() for VM_SHARED shmem, but we can't
      skip it anymore if uffd-wp is enabled on dst vma.  This should only happen
      when the src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem
      vma, so that VM_UFFD_WP will be propagated onto dst vma too, then we
      should copy the pgtables with uffd-wp bit and pte markers, because these
      information will be lost otherwise.
      
      Since the condition checks will become even more complicated for deciding
      "whether a vma needs to copy the pgtable during fork()", introduce a
      helper vma_needs_copy() for it, so everything will be clearer.
      
      Link: https://lkml.kernel.org/r/20220405014855.14468-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c56d1b62
    • P
      mm/shmem: persist uffd-wp bit across zapping for file-backed · 999dad82
      Peter Xu 提交于
      File-backed memory is prone to being unmapped at any time.  It means all
      information in the pte will be dropped, including the uffd-wp flag.
      
      To persist the uffd-wp flag, we'll use the pte markers.  This patch
      teaches the zap code to understand uffd-wp and know when to keep or drop
      the uffd-wp bit.
      
      Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we
      don't want to persist such an information, for example, when destroying
      the whole vma, or punching a hole in a shmem file.  For the rest cases we
      should never drop the uffd-wp bit, or the wr-protect information will get
      lost.
      
      The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than
      memory.c because it'll be further referenced in hugetlb files later.
      
      Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      999dad82
    • P
      mm/shmem: handle uffd-wp special pte in page fault handler · 9c28a205
      Peter Xu 提交于
      File-backed memories are prone to unmap/swap so the ptes are always
      unstable, because they can be easily faulted back later using the page
      cache.  This could lead to uffd-wp getting lost when unmapping or swapping
      out such memory.  One example is shmem.  PTE markers are needed to store
      those information.
      
      This patch prepares it by handling uffd-wp pte markers first it is applied
      elsewhere, so that the page fault handler can recognize uffd-wp pte
      markers.
      
      The handling of uffd-wp pte markers is similar to missing fault, it's just
      that we'll handle this "missing fault" when we see the pte markers,
      meanwhile we need to make sure the marker information is kept during
      processing the fault.
      
      This is a slow path of uffd-wp handling, because zapping of wr-protected
      shmem ptes should be rare.  So far it should only trigger in two
      conditions:
      
        (1) When trying to punch holes in shmem_fallocate(), there is an
            optimization to zap the pgtables before evicting the page.
      
        (2) When swapping out shmem pages.
      
      Because of this, the page fault handling is simplifed too by not sending
      the wr-protect message in the 1st page fault, instead the page will be
      installed read-only, so the uffd-wp message will be generated in the next
      fault, which will trigger the do_wp_page() path of general uffd-wp
      handling.
      
      Disable fault-around for all uffd-wp registered ranges for extra safety
      just like uffd-minor fault, and clean the code up.
      
      Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      9c28a205
    • P
      mm: check against orig_pte for finish_fault() · f46f2ade
      Peter Xu 提交于
      This patch allows do_fault() to trigger on !pte_none() cases too.  This
      prepares for the pte markers to be handled by do_fault() just like none
      pte.
      
      To achieve this, instead of unconditionally check against pte_none() in
      finish_fault(), we may hit the case that the orig_pte was some pte marker
      so what we want to do is to replace the pte marker with some valid pte
      entry.  Then if orig_pte was set we'd want to check the current *pte
      (under pgtable lock) against orig_pte rather than none pte.
      
      Right now there's no solid way to safely reference orig_pte because when
      pmd is not allocated handle_pte_fault() will not initialize orig_pte, so
      it's not safe to reference it.
      
      There's another solution proposed before this patch to do pte_clear() for
      vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32
      because arm32 could have assumption that pte_t* pointer will always reside
      on a real ram32 pgtable, not any kernel stack variable.
      
      To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be
      set along with orig_pte when there is valid orig_pte, or it'll be cleared
      when orig_pte was not initialized.
      
      It'll be updated every time we call handle_pte_fault(), so e.g.  if a page
      fault retry happened it'll be properly updated along with orig_pte.
      
      [1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/
      
      [akpm@linux-foundation.org: coding-style cleanups]
      [peterx@redhat.com: fix crash reported by Marek]
        Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f46f2ade
    • P
      mm: teach core mm about pte markers · 5c041f5d
      Peter Xu 提交于
      This patch still does not use pte marker in any way, however it teaches
      the core mm about the pte marker idea.
      
      For example, handle_pte_marker() is introduced that will parse and handle
      all the pte marker faults.
      
      Many of the places are more about commenting it up - so that we know
      there's the possibility of pte marker showing up, and why we don't need
      special code for the cases.
      
      [peterx@redhat.com: userfaultfd.c needs swapops.h]
        Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5c041f5d
  13. 10 5月, 2022 13 次提交
    • N
      mm: submit multipage reads for SWP_FS_OPS swap-space · 5169b844
      NeilBrown 提交于
      swap_readpage() is given one page at a time, but may be called repeatedly
      in succession.
      
      For block-device swap-space, the blk_plug functionality allows the
      multiple pages to be combined together at lower layers.  That cannot be
      used for SWP_FS_OPS as blk_plug may not exist - it is only active when
      CONFIG_BLOCK=y.  Consequently all swap reads over NFS are single page
      reads.
      
      With this patch we pass in a pointer-to-pointer when swap_readpage can
      store state between calls - much like the effect of blk_plug.  After
      calling swap_readpage() some number of times, the state will be passed to
      swap_read_unplug() which can submit the combined request.
      
      Link: https://lkml.kernel.org/r/164859778127.29473.14059420492644907783.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5169b844
    • N
      mm: create new mm/swap.h header file · 014bb1de
      NeilBrown 提交于
      Patch series "MM changes to improve swap-over-NFS support".
      
      Assorted improvements for swap-via-filesystem.
      
      This is a resend of these patches, rebased on current HEAD.  The only
      substantial changes is that swap_dirty_folio has replaced
      swap_set_page_dirty.
      
      Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
      has previously worked for NFS but that broke a few releases back.  This
      series changes to use a new ->swap_rw rather than ->readpage and
      ->direct_IO.  It also makes other improvements.
      
      There is a companion series already in linux-next which fixes various
      issues with NFS.  Once both series land, a final patch is needed which
      changes NFS over to use ->swap_rw.
      
      
      This patch (of 10):
      
      Many functions declared in include/linux/swap.h are only used within mm/
      
      Create a new "mm/swap.h" and move some of these declarations there.
      Remove the redundant 'extern' from the function declarations.
      
      [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
      Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
      Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      014bb1de
    • D
      mm/swap: remember PG_anon_exclusive via a swp pte bit · 1493a191
      David Hildenbrand 提交于
      Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.
      
      This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
      | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
      exclusivity of the page to then replacing the anonymous page by a copy in
      the page table: The GUP reference lost synchronicity with the pages mapped
      into the page tables.  This series focuses on x86, arm64, s390x and
      ppc64/book3s -- other architectures are fairly easy to support by
      implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
      
      This primarily fixes the O_DIRECT memory corruptions that can happen on
      concurrent swapout, whereby we lose DMA reads to a page (modifying the
      user page by writing to it).
      
      O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
      from/to a user page.  In the long run, we want to convert it to properly
      use FOLL_PIN, and John is working on it, but that might take a while and
      might not be easy to backport.  In the meantime, let's restore what used
      to work before we started modifying our COW logic: make R/W FOLL_GET
      references reliable as long as there is no fork() after GUP involved.
      
      This is just the natural follow-up of part 2, that will also further
      reduce "wrong COW" on the swapin path, for example, when we cannot remove
      a page from the swapcache due to concurrent writeback, or if we have two
      threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
      nice side-product
      
      This issue, including other related COW issues, has been summarized in [3]
      under 2):
      "
        2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)
      
        It was discovered that we can create a memory corruption by reading a
        file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
        concurrently writing to an unrelated part (e.g., last byte) of the same
        page, and concurrently write-protecting the page via clear_refs
        SOFTDIRTY tracking [6].
      
        For the reproducer, the issue is that O_DIRECT grabs a reference of the
        target page (via FOLL_GET) and clear_refs write-protects the relevant
        page table entry. On successive write access to the page from the
        process itself, we wrongly COW the page when resolving the write fault,
        resulting in a loss of synchronicity and consequently a memory corruption.
      
        While some people might think that using clear_refs in this combination
        is a corner cases, it turns out to be a more generic problem unfortunately.
      
        For example, it was just recently discovered that we can similarly
        create a memory corruption without clear_refs, simply by concurrently
        swapping out the buffer pages [7]. Note that we nowadays even use the
        swap infrastructure in Linux without an actual swap disk/partition: the
        prime example is zram which is enabled as default under Fedora [10].
      
        The root issue is that a write-fault on a page that has additional
        references results in a COW and thereby a loss of synchronicity
        and consequently a memory corruption if two parties believe they are
        referencing the same page.
      "
      
      We don't particularly care about R/O FOLL_GET references: they were never
      reliable and O_DIRECT doesn't expect to observe modifications from a page
      after DMA was started.
      
      Note that:
      * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
        ("enterprise architectures"). Other architectures have to implement
        __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
      * this does *not * consider any kind of fork() after taking the reference:
        fork() after GUP never worked reliably with FOLL_GET.
      * Not losing PG_anon_exclusive during swapout was the last remaining
        piece. KSM already makes sure that there are no other references on
        a page before considering it for sharing. Page migration maintains
        PG_anon_exclusive and simply fails when there are additional references
        (freezing the refcount fails). Only swapout code dropped the
        PG_anon_exclusive flag because it requires more work to remember +
        restore it.
      
      With this series in place, most COW issues of [3] are fixed on said
      architectures. Other architectures can implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.
      
      [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      
      This patch (of 8):
      
      Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
      it.  We do this, to keep fork() logic on swap entries easy and efficient:
      for example, if we wouldn't clear it when unmapping, we'd have to lookup
      the page in the swapcache for each and every swap entry during fork() and
      clear PG_anon_exclusive if set.
      
      Instead, we want to store that information directly in the swap pte,
      protected by the page table lock, similarly to how we handle
      SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
      swap entries, we don't want to mess with the swap type (e.g., still one
      bit) because it overcomplicates swap code.
      
      In try_to_unmap(), we already reject to unmap in case the page might be
      pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
      Checking if there are other unexpected references reliably *before*
      completely unmapping a page is unfortunately not really possible: THP
      heavily overcomplicate the situation.  Once fully unmapped it's easier --
      we, for example, make sure that there are no unexpected references *after*
      unmapping a page before starting writeback on that page.
      
      So, we currently might end up unmapping a page and clearing
      PG_anon_exclusive if that page has additional references, for example, due
      to a FOLL_GET.
      
      do_swap_page() has to re-determine if a page is exclusive, which will
      easily fail if there are other references on a page, most prominently GUP
      references via FOLL_GET.  This can currently result in memory corruptions
      when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
      is never involved: try_to_unmap() will succeed, and when refaulting the
      page, it cannot be marked exclusive and will get replaced by a copy in the
      page tables on the next write access, resulting in writes via the GUP
      reference to the page being lost.
      
      In an ideal world, everybody that uses GUP and wants to modify page
      content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
      conversion will take a while.  It's easier to fix what used to work in the
      past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
      by remembering PG_anon_exclusive we can further reduce unnecessary COW in
      some cases, so it's the natural thing to do.
      
      So let's transfer the PG_anon_exclusive information to the swap pte and
      store it via an architecture-dependant pte bit; use that information when
      restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
      we simply have to clear the pte bit and are done.
      
      Of course, there is one corner case to handle: swap backends that don't
      support concurrent page modifications while the page is under writeback. 
      Special case these, and drop the exclusive marker.  Add a comment why that
      is just fine (also, reuse_swap_page() would have done the same in the
      past).
      
      In the future, we'll hopefully have all architectures support
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
      and the define completely.  Then, we can also convert
      SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
      support: either simply use a yet unused pte bit that can be used for swap
      entries, steal one from the arch type bits if they exceed 5, or steal one
      from the offset bits.
      
      Note: R/O FOLL_GET references were never really reliable, especially when
      taking one on a shared page and then writing to the page (e.g., GUP after
      fork()).  FOLL_GET, including R/W references, were never really reliable
      once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
      steps back in case it stumbles over unexpected references and is,
      therefore, fine.
      
      [david@redhat.com: fix SWP_STABLE_WRITES test]
        Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1493a191
    • D
      mm: support GUP-triggered unsharing of anonymous pages · c89357e2
      David Hildenbrand 提交于
      Whenever GUP currently ends up taking a R/O pin on an anonymous page that
      might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
      on the page table entry will end up replacing the mapped anonymous page
      due to COW, resulting in the GUP pin no longer being consistent with the
      page actually mapped into the page table.
      
      The possible ways to deal with this situation are:
       (1) Ignore and pin -- what we do right now.
       (2) Fail to pin -- which would be rather surprising to callers and
           could break user space.
       (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
           pins.
      
      We want to implement 3) because it provides the clearest semantics and
      allows for checking in unpin_user_pages() and friends for possible BUGs:
      when trying to unpin a page that's no longer exclusive, clearly something
      went very wrong and might result in memory corruptions that might be hard
      to debug.  So we better have a nice way to spot such issues.
      
      To implement 3), we need a way for GUP to trigger unsharing:
      FAULT_FLAG_UNSHARE.  FAULT_FLAG_UNSHARE is only applicable to R/O mapped
      anonymous pages and resembles COW logic during a write fault.  However, in
      contrast to a write fault, GUP-triggered unsharing will, for example,
      still maintain the write protection.
      
      Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
      fault handlers for all applicable anonymous page types: ordinary pages,
      THP and hugetlb.
      
      * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
        marked exclusive in the meantime by someone else, there is nothing to do.
      * If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
        marked exclusive, it will try detecting if the process is the exclusive
        owner. If exclusive, it can be set exclusive similar to reuse logic
        during write faults via page_move_anon_rmap() and there is nothing
        else to do; otherwise, we either have to copy and map a fresh,
        anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
        THP.
      
      This commit is heavily based on patches by Andrea.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Co-developed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c89357e2
    • D
      mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
      David Hildenbrand 提交于
      Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
      exclusive, and use that information to make GUP pins reliable and stay
      consistent with the page mapped into the page table even if the page table
      entry gets write-protected.
      
      With that information at hand, we can extend our COW logic to always reuse
      anonymous pages that are exclusive.  For anonymous pages that might be
      shared, the existing logic applies.
      
      As already documented, PG_anon_exclusive is usually only expressive in
      combination with a page table entry.  Especially PTE vs.  PMD-mapped
      anonymous pages require more thought, some examples: due to mremap() we
      can easily have a single compound page PTE-mapped into multiple page
      tables exclusively in a single process -- multiple page table locks apply.
      Further, due to MADV_WIPEONFORK we might not necessarily write-protect
      all PTEs, and only some subpages might be pinned.  Long story short: once
      PTE-mapped, we have to track information about exclusivity per sub-page,
      but until then, we can just track it for the compound page in the head
      page and not having to update a whole bunch of subpages all of the time
      for a simple PMD mapping of a THP.
      
      For simplicity, this commit mostly talks about "anonymous pages", while
      it's for THP actually "the part of an anonymous folio referenced via a
      page table entry".
      
      To not spill PG_anon_exclusive code all over the mm code-base, we let the
      anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
      
      If a writable, present page table entry points at an anonymous (sub)page,
      that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
      pin (FOLL_PIN) on an anonymous page references via a present page table
      entry, it must only pin if PG_anon_exclusive is set for the mapped
      (sub)page.
      
      This commit doesn't adjust GUP, so this is only implicitly handled for
      FOLL_WRITE, follow-up commits will teach GUP to also respect it for
      FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
      reliable.
      
      Whenever an anonymous page is to be shared (fork(), KSM), or when
      temporarily unmapping an anonymous page (swap, migration), the relevant
      PG_anon_exclusive bit has to be cleared to mark the anonymous page
      possibly shared.  Clearing will fail if there are GUP pins on the page:
      
      * For fork(), this means having to copy the page and not being able to
        share it.  fork() protects against concurrent GUP using the PT lock and
        the src_mm->write_protect_seq.
      
      * For KSM, this means sharing will fail.  For swap this means, unmapping
        will fail, For migration this means, migration will fail early.  All
        three cases protect against concurrent GUP using the PT lock and a
        proper clear/invalidate+flush of the relevant page table entry.
      
      This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
      pinned page gets mapped R/O and the successive write fault ends up
      replacing the page instead of reusing it.  It improves the situation for
      O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
      fork() is *not* involved, however swapout and fork() are still
      problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
      users will fix the issue for them.
      
      I. Details about basic handling
      
      I.1. Fresh anonymous pages
      
      page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
      given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
      the mechanism fresh anonymous pages come into life (besides migration code
      where we copy the page->mapping), all fresh anonymous pages will start out
      as exclusive.
      
      I.2. COW reuse handling of anonymous pages
      
      When a COW handler stumbles over a (sub)page that's marked exclusive, it
      simply reuses it.  Otherwise, the handler tries harder under page lock to
      detect if the (sub)page is exclusive and can be reused.  If exclusive,
      page_move_anon_rmap() will mark the given (sub)page exclusive.
      
      Note that hugetlb code does not yet check for PageAnonExclusive(), as it
      still uses the old COW logic that is prone to the COW security issue
      because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
      pages are a scarce resource.
      
      I.3. Migration handling
      
      try_to_migrate() has to try marking an exclusive anonymous page shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  migrate_vma_collect_pmd() and
      __split_huge_pmd_locked() are handled similarly.
      
      Writable migration entries implicitly point at shared anonymous pages. 
      For readable migration entries that information is stored via a new
      "readable-exclusive" migration entry, specific to anonymous pages.
      
      When restoring a migration entry in remove_migration_pte(), information
      about exlusivity is detected via the migration entry type, and
      RMAP_EXCLUSIVE is set accordingly for
      page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
      
      I.4. Swapout handling
      
      try_to_unmap() has to try marking the mapped page possibly shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  For now, information about exclusivity is lost.  In
      the future, we might want to remember that information in the swap entry
      in some cases, however, it requires more thought, care, and a way to store
      that information in swap entries.
      
      I.5. Swapin handling
      
      do_swap_page() will never stumble over exclusive anonymous pages in the
      swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
      to detect manually if an anonymous page is exclusive and has to set
      RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
      
      I.6. THP handling
      
      __split_huge_pmd_locked() has to move the information about exclusivity
      from the PMD to the PTEs.
      
      a) In case we have a readable-exclusive PMD migration entry, simply
         insert readable-exclusive PTE migration entries.
      
      b) In case we have a present PMD entry and we don't want to freeze
         ("convert to migration entries"), simply forward PG_anon_exclusive to
         all sub-pages, no need to temporarily clear the bit.
      
      c) In case we have a present PMD entry and want to freeze, handle it
         similar to try_to_migrate(): try marking the page shared first.  In
         case we fail, we ignore the "freeze" instruction and simply split
         ordinarily.  try_to_migrate() will properly fail because the THP is
         still mapped via PTEs.
      
      When splitting a compound anonymous folio (THP), the information about
      exclusivity is implicitly handled via the migration entries: no need to
      replicate PG_anon_exclusive manually.
      
      I.7.  fork() handling fork() handling is relatively easy, because
      PG_anon_exclusive is only expressive for some page table entry types.
      
      a) Present anonymous pages
      
      page_try_dup_anon_rmap() will mark the given subpage shared -- which will
      fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
      PMD to handle it on the PTE level).
      
      Note that device exclusive entries are just a pointer at a PageAnon()
      page.  fork() will first convert a device exclusive entry to a present
      page table and handle it just like present anonymous pages.
      
      b) Device private entry
      
      Device private entries point at PageAnon() pages that cannot be mapped
      directly and, therefore, cannot get pinned.
      
      page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
      fail because they cannot get pinned.
      
      c) HW poison entries
      
      PG_anon_exclusive will remain untouched and is stale -- the page table
      entry is just a placeholder after all.
      
      d) Migration entries
      
      Writable and readable-exclusive entries are converted to readable entries:
      possibly shared.
      
      I.8. mprotect() handling
      
      mprotect() only has to properly handle the new readable-exclusive
      migration entry:
      
      When write-protecting a migration entry that points at an anonymous page,
      remember the information about exclusivity via the "readable-exclusive"
      migration entry type.
      
      II. Migration and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a migration entry, we have to mark the page possibly
      shared and synchronize against GUP-fast by a proper clear/invalidate+flush
      to make the following scenario impossible:
      
      1. try_to_migrate() places a migration entry after checking for GUP pins
         and marks the page possibly shared.
      
      2. GUP-fast pins the page due to lack of synchronization
      
      3. fork() converts the "writable/readable-exclusive" migration entry into a
         readable migration entry
      
      4. Migration fails due to the GUP pin (failing to freeze the refcount)
      
      5. Migration entries are restored. PG_anon_exclusive is lost
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      Note that we move information about exclusivity from the page to the
      migration entry as it otherwise highly overcomplicates fork() and
      PTE-mapping a THP.
      
      III. Swapout and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a swap entry, we have to mark the page possibly shared
      and synchronize against GUP-fast by a proper clear/invalidate+flush to
      make the following scenario impossible:
      
      1. try_to_unmap() places a swap entry after checking for GUP pins and
         clears exclusivity information on the page.
      
      2. GUP-fast pins the page due to lack of synchronization.
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      If we'd ever store information about exclusivity in the swap entry,
      similar to migration handling, the same considerations as in II would
      apply.  This is future work.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6c287605
    • D
      mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages · 78fbe906
      David Hildenbrand 提交于
      The basic question we would like to have a reliable and efficient answer
      to is: is this anonymous page exclusive to a single process or might it be
      shared?  We need that information for ordinary/single pages, hugetlb
      pages, and possibly each subpage of a THP.
      
      Introduce a way to mark an anonymous page as exclusive, with the ultimate
      goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
      lose consistency with the pages mapped into the page table, resulting in
      reported memory corruptions.
      
      Most pageflags already have semantics for anonymous pages, however,
      PG_mappedtodisk should never apply to pages in the swapcache, so let's
      reuse that flag.
      
      As PG_has_hwpoisoned also uses that flag on the second tail page of a
      compound page, convert it to PG_error instead, which is marked as
      PF_NO_TAIL, so never used for tail pages.
      
      Use custom page flag modification functions such that we can do additional
      sanity checks.  The semantics we'll put into some kernel doc in the future
      are:
      
      "
        PG_anon_exclusive is *usually* only expressive in combination with a
        page table entry. Depending on the page table entry type it might
        store the following information:
      
             Is what's mapped via this page table entry exclusive to the
             single process and can be mapped writable without further
             checks? If not, it might be shared and we might have to COW.
      
        For now, we only expect PTE-mapped THPs to make use of
        PG_anon_exclusive in subpages. For other anonymous compound
        folios (i.e., hugetlb), only the head page is logically mapped and
        holds this information.
      
        For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
        set on the head page. When replacing the PMD by a page table full
        of PTEs, PG_anon_exclusive, if set on the head page, will be set on
        all tail pages accordingly. Note that converting from a PTE-mapping
        to a PMD mapping using the same compound page is currently not
        possible and consequently doesn't require care.
      
        If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
        it should only pin if the relevant PG_anon_exclusive is set. In that
        case, the pin will be fully reliable and stay consistent with the pages
        mapped into the page table, as the bit cannot get cleared (e.g., by
        fork(), KSM) while the page is pinned. For anonymous pages that
        are mapped R/W, PG_anon_exclusive can be assumed to always be set
        because such pages cannot possibly be shared.
      
        The page table lock protecting the page table entry is the primary
        synchronization mechanism for PG_anon_exclusive; GUP-fast that does
        not take the PT lock needs special care when trying to clear the
        flag.
      
        Page table entry types and PG_anon_exclusive:
        * Present: PG_anon_exclusive applies.
        * Swap: the information is lost. PG_anon_exclusive was cleared.
        * Migration: the entry holds this information instead.
                     PG_anon_exclusive was cleared.
        * Device private: PG_anon_exclusive applies.
        * Device exclusive: PG_anon_exclusive applies.
        * HW Poison: PG_anon_exclusive is stale and not changed.
      
        If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
        not allowed and the flag will stick around until the page is freed
        and folio->mapping is cleared.
      "
      
      We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
      zapping) of page table entries, page freeing code will handle that when
      also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
      information about exclusivity stick around will be an important property
      when adding sanity checks to unpinning code.
      
      Note that we properly clear the flag in free_pages_prepare() via
      PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
      so there is no need to manually clear the flag.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      78fbe906
    • D
      mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively · 6c54dc6c
      David Hildenbrand 提交于
      We want to mark anonymous pages exclusive, and when using
      page_move_anon_rmap() we know that we are the exclusive user, as properly
      documented.  This is a preparation for marking anonymous pages exclusive
      in page_move_anon_rmap().
      
      In both instances, we're holding page lock and are sure that we're the
      exclusive owner (page_count() == 1).  hugetlb already properly uses
      page_move_anon_rmap() in the write fault handler.
      
      Note that in case of a PTE-mapped THP, we'll only end up calling this
      function if the whole THP is only referenced by the single PTE mapping a
      single subpage (page_count() == 1); consequently, it's fine to modify the
      compound page mapping inside page_move_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6c54dc6c
    • D
      mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() · 40f2bbf7
      David Hildenbrand 提交于
      New anonymous pages are always mapped natively: only THP/khugepaged code
      maps a new compound anonymous page and passes "true".  Otherwise, we're
      just dealing with simple, non-compound pages.
      
      Let's give the interface clearer semantics and document these.  Remove the
      PageTransCompound() sanity check from page_add_new_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      40f2bbf7
    • D
      mm/rmap: remove do_page_add_anon_rmap() · f1e2db12
      David Hildenbrand 提交于
      ... and instead convert page_add_anon_rmap() to accept flags.
      
      Passing flags instead of bools is usually nicer either way, and we want to
      more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
      that an anonymous page is exclusive: for example, when restoring an
      anonymous page from a writable migration entry.
      
      This is a preparation for marking an anonymous page inside
      page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f1e2db12
    • D
      mm/rmap: convert RMAP flags to a proper distinct rmap_t type · 14f9135d
      David Hildenbrand 提交于
      We want to pass the flags to more than one anon rmap function, getting rid
      of special "do_page_add_anon_rmap()".  So let's pass around a distinct
      __bitwise type and refine documentation.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      14f9135d
    • D
      mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() · fb3d824d
      David Hildenbrand 提交于
      ...  and move the special check for pinned pages into
      page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
      via a new pageflag, clearing it only after making sure that there are no
      GUP pins on the anonymous page.
      
      We really only care about pins on anonymous pages, because they are prone
      to getting replaced in the COW handler once mapped R/O.  For !anon pages
      in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
      that, at least not that I could come up with an example.
      
      Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
      know we're dealing with anonymous pages.  Also, drop the handling of
      pinned pages from copy_huge_pud() and add a comment if ever supporting
      anonymous pages on the PUD level.
      
      This is a preparation for tracking exclusivity of anonymous pages in the
      rmap code, and disallowing marking a page shared (-> failing to duplicate)
      if there are GUP pins on a page.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fb3d824d
    • D
      mm/memory: slightly simplify copy_present_pte() · b51ad4f8
      David Hildenbrand 提交于
      Let's move the pinning check into the caller, to simplify return code
      logic and prepare for further changes: relocating the
      page_needs_cow_for_dma() into rmap handling code.
      
      While at it, remove the unused pte parameter and simplify the comments a
      bit.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b51ad4f8
    • M
      mm,fs: Remove aops->readpage · 7e0a1265
      Matthew Wilcox (Oracle) 提交于
      With all implementations of aops->readpage converted to aops->read_folio,
      we can stop checking whether it's set and remove the member from aops.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      7e0a1265