1. 27 9月, 2022 39 次提交
    • Y
      mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG · eed9a328
      Yu Zhao 提交于
      Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
      x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
      linear address translation [1].  Page table walkers that clear the
      accessed bit may use this capability to reduce their search space.
      
      Note that:
      1. Although an inline function is preferable, this capability is added
         as a configuration option for consistency with the existing macros.
      2. Due to the little interest in other varieties, this capability was
         only tested on Intel and AMD CPUs.
      
      Thanks to the following developers for their efforts [2][3].
        Randy Dunlap <rdunlap@infradead.org>
        Stephen Rothwell <sfr@canb.auug.org.au>
      
      [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
           Volume 3 (June 2021), section 4.8
      [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
      [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      eed9a328
    • Y
      mm: x86, arm64: add arch_has_hw_pte_young() · e1fd09e3
      Yu Zhao 提交于
      Patch series "Multi-Gen LRU Framework", v14.
      
      What's new
      ==========
      1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
         Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
      2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
         machines. The old direct reclaim backoff, which tries to enforce a
         minimum fairness among all eligible memcgs, over-swapped by about
         (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
         pulls the plug on swapping once the target is met, trades some
         fairness for curtailed latency:
         https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
      3. Fixed minior build warnings and conflicts. More comments and nits.
      
      TLDR
      ====
      The current page reclaim is too expensive in terms of CPU usage and it
      often makes poor choices about what to evict. This patchset offers an
      alternative solution that is performant, versatile and
      straightforward.
      
      Patchset overview
      =================
      The design and implementation overview is in patch 14:
      https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
      
      01. mm: x86, arm64: add arch_has_hw_pte_young()
      02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
      Take advantage of hardware features when trying to clear the accessed
      bit in many PTEs.
      
      03. mm/vmscan.c: refactor shrink_node()
      04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
          its sole caller"
      Minor refactors to improve readability for the following patches.
      
      05. mm: multi-gen LRU: groundwork
      Adds the basic data structure and the functions that insert pages to
      and remove pages from the multi-gen LRU (MGLRU) lists.
      
      06. mm: multi-gen LRU: minimal implementation
      A minimal implementation without optimizations.
      
      07. mm: multi-gen LRU: exploit locality in rmap
      Exploits spatial locality to improve efficiency when using the rmap.
      
      08. mm: multi-gen LRU: support page table walks
      Further exploits spatial locality by optionally scanning page tables.
      
      09. mm: multi-gen LRU: optimize multiple memcgs
      Optimizes the overall performance for multiple memcgs running mixed
      types of workloads.
      
      10. mm: multi-gen LRU: kill switch
      Adds a kill switch to enable or disable MGLRU at runtime.
      
      11. mm: multi-gen LRU: thrashing prevention
      12. mm: multi-gen LRU: debugfs interface
      Provide userspace with features like thrashing prevention, working set
      estimation and proactive reclaim.
      
      13. mm: multi-gen LRU: admin guide
      14. mm: multi-gen LRU: design doc
      Add an admin guide and a design doc.
      
      Benchmark results
      =================
      Independent lab results
      -----------------------
      Based on the popularity of searches [01] and the memory usage in
      Google's public cloud, the most popular open-source memory-hungry
      applications, in alphabetical order, are:
            Apache Cassandra      Memcached
            Apache Hadoop         MongoDB
            Apache Spark          PostgreSQL
            MariaDB (MySQL)       Redis
      
      An independent lab evaluated MGLRU with the most widely used benchmark
      suites for the above applications. They posted 960 data points along
      with kernel metrics and perf profiles collected over more than 500
      hours of total benchmark time. Their final reports show that, with 95%
      confidence intervals (CIs), the above applications all performed
      significantly better for at least part of their benchmark matrices.
      
      On 5.14:
      1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
         less wall time to sort three billion random integers, respectively,
         under the medium- and the high-concurrency conditions, when
         overcommitting memory. There were no statistically significant
         changes in wall time for the rest of the benchmark matrix.
      2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
         more transactions per minute (TPM), respectively, under the medium-
         and the high-concurrency conditions, when overcommitting memory.
         There were no statistically significant changes in TPM for the rest
         of the benchmark matrix.
      3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
         and [21.59, 30.02]% more operations per second (OPS), respectively,
         for sequential access, random access and Gaussian (distribution)
         access, when THP=always; 95% CIs [13.85, 15.97]% and
         [23.94, 29.92]% more OPS, respectively, for random access and
         Gaussian access, when THP=never. There were no statistically
         significant changes in OPS for the rest of the benchmark matrix.
      4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
         [2.16, 3.55]% more operations per second (OPS), respectively, for
         exponential (distribution) access, random access and Zipfian
         (distribution) access, when underutilizing memory; 95% CIs
         [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
         respectively, for exponential access, random access and Zipfian
         access, when overcommitting memory.
      
      On 5.15:
      5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
         and [4.11, 7.50]% more operations per second (OPS), respectively,
         for exponential (distribution) access, random access and Zipfian
         (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
         [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
         exponential access, random access and Zipfian access, when swap was
         on.
      6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
         less average wall time to finish twelve parallel TeraSort jobs,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in average wall time for the rest of the
         benchmark matrix.
      7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
         minute (TPM) under the high-concurrency condition, when swap was
         off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in TPM for the rest of the benchmark matrix.
      8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
         [11.47, 19.36]% more total operations per second (OPS),
         respectively, for sequential access, random access and Gaussian
         (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
         [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
         for sequential access, random access and Gaussian access, when
         THP=never.
      
      Our lab results
      ---------------
      To supplement the above results, we ran the following benchmark suites
      on 5.16-rc7 and found no regressions [10].
            fs_fio_bench_hdd_mq      pft
            fs_lmbench               pgsql-hammerdb
            fs_parallelio            redis
            fs_postmark              stream
            hackbench                sysbenchthread
            kernbench                tpcc_spark
            memcached                unixbench
            multichase               vm-scalability
            mutilate                 will-it-scale
            nginx
      
      [01] https://trends.google.com
      [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
      [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
      [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
      [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
      [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
      [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
      [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
      [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
      [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
      
      Read-world applications
      =======================
      Third-party testimonials
      ------------------------
      Konstantin reported [11]:
         I have Archlinux with 8G RAM + zswap + swap. While developing, I
         have lots of apps opened such as multiple LSP-servers for different
         langs, chats, two browsers, etc... Usually, my system gets quickly
         to a point of SWAP-storms, where I have to kill LSP-servers,
         restart browsers to free memory, etc, otherwise the system lags
         heavily and is barely usable.
         
         1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
         patchset, and I started up by opening lots of apps to create memory
         pressure, and worked for a day like this. Till now I had not a
         single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
         getting to the point of 3G in SWAP before without a single
         SWAP-storm.
      
      Vaibhav from IBM reported [12]:
         In a synthetic MongoDB Benchmark, seeing an average of ~19%
         throughput improvement on POWER10(Radix MMU + 64K Page Size) with
         MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
         three different request distributions, namely, Exponential, Uniform
         and Zipfan.
      
      Shuang from U of Rochester reported [13]:
         With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
         and [9.26, 10.36]% higher throughput, respectively, for random
         access, Zipfian (distribution) access and Gaussian (distribution)
         access, when the average number of jobs per CPU is 1; 95% CIs
         [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
         throughput, respectively, for random access, Zipfian access and
         Gaussian access, when the average number of jobs per CPU is 2.
      
      Daniel from Michigan Tech reported [14]:
         With Memcached allocating ~100GB of byte-addressable Optante,
         performance improvement in terms of throughput (measured as queries
         per second) was about 10% for a series of workloads.
      
      Large-scale deployments
      -----------------------
      We've rolled out MGLRU to tens of millions of ChromeOS users and
      about a million Android users. Google's fleetwide profiling [15] shows
      an overall 40% decrease in kswapd CPU usage, in addition to
      improvements in other UX metrics, e.g., an 85% decrease in the number
      of low-memory kills at the 75th percentile and an 18% decrease in
      app launch time at the 50th percentile.
      
      The downstream kernels that have been using MGLRU include:
      1. Android [16]
      2. Arch Linux Zen [17]
      3. Armbian [18]
      4. ChromeOS [19]
      5. Liquorix [20]
      6. OpenWrt [21]
      7. post-factum [22]
      8. XanMod [23]
      
      [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
      [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
      [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
      [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
      [15] https://dl.acm.org/doi/10.1145/2749469.2750392
      [16] https://android.com
      [17] https://archlinux.org
      [18] https://armbian.com
      [19] https://chromium.org
      [20] https://liquorix.net
      [21] https://openwrt.org
      [22] https://codeberg.org/pf-kernel
      [23] https://xanmod.org
      
      Summary
      =======
      The facts are:
      1. The independent lab results and the real-world applications
         indicate substantial improvements; there are no known regressions.
      2. Thrashing prevention, working set estimation and proactive reclaim
         work out of the box; there are no equivalent solutions.
      3. There is a lot of new code; no smaller changes have been
         demonstrated similar effects.
      
      Our options, accordingly, are:
      1. Given the amount of evidence, the reported improvements will likely
         materialize for a wide range of workloads.
      2. Gauging the interest from the past discussions, the new features
         will likely be put to use for both personal computers and data
         centers.
      3. Based on Google's track record, the new code will likely be well
         maintained in the long term. It'd be more difficult if not
         impossible to achieve similar effects with other approaches.
      
      
      This patch (of 14):
      
      Some architectures automatically set the accessed bit in PTEs, e.g., x86
      and arm64 v8.2.  On architectures that do not have this capability,
      clearing the accessed bit in a PTE usually triggers a page fault following
      the TLB miss of this PTE (to emulate the accessed bit).
      
      Being aware of this capability can help make better decisions, e.g.,
      whether to spread the work out over a period of time to reduce bursty page
      faults when trying to clear the accessed bit in many PTEs.
      
      Note that theoretically this capability can be unreliable, e.g.,
      hotplugged CPUs might be different from builtin ones.  Therefore it should
      not be used in architecture-independent code that involves correctness,
      e.g., to determine whether TLB flushes are required (in combination with
      the accessed bit).
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e1fd09e3
    • Y
      mm/page_io: count submission time as thrashing delay for delayacct · 3a9bb7b1
      Yang Yang 提交于
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      Likes PSI, we count submission time as thrashing delay because when the
      device is congested, or the submitting cgroup IO-throttled, submission can
      be a significant part of overall IO time.
      
      Without this patch, swap thrashing through frontswap or some block
      device supporting rw_page operation isn't measured correctly.
      
      This patch is based on "delayacct: support re-entrance detection of
      thrashing accounting".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815072835.74876-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: Nwangyong <wang.yong12@zte.com.cn>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3a9bb7b1
    • Y
      delayacct: support re-entrance detection of thrashing accounting · aa1cf99b
      Yang Yang 提交于
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      For page cache thrashing accounting, there is no suitable place to do it
      in fs level likes swap_readpage().  So we have to do it in
      folio_wait_bit_common().
      
      Then for anonymous pages thrashing accounting, we have to do it in both
      swap_readpage() and folio_wait_bit_common().  This likes PSI, so we should
      let thrashing accounting supports re-entrance detection.
      
      This patch is to prepare complete thrashing accounting, and is based on
      patch "filemap: make the accounting of thrashing more consistent".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815071134.74551-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: Nwangyong <wang.yong12@zte.com.cn>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      aa1cf99b
    • B
      mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP · 7047b5a4
      Baolin Wang 提交于
      If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will
      be split, and the subpages of fail-to-migrate THP will be tried to migrate
      again, so we should not account the retry counter in the second loop,
      since we already accounted 'nr_thp_failed' in the first loop.
      
      Moreover we also do not need retry 10 times for -EAGAIN case for the
      subpages of fail-to-migrate THP in the second loop, since we already
      regarded the THP as migration failure, and save some migration time (for
      the worst case, will try 512 * 10 times) according to previous discussion
      [1].
      
      [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.comTested-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7047b5a4
    • H
      migrate_pages(): fix failure counting for retry · 077309bc
      Huang Ying 提交于
      After 10 retries, we will give up and the remaining pages will be counted
      as failure in nr_failed and nr_thp_failed.  We should count the failure in
      nr_failed_pages too.  This is done in this patch.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      077309bc
    • H
      migrate_pages(): fix failure counting for THP splitting · e6fa8a79
      Huang Ying 提交于
      If THP is failed to be migrated, it may be split and retry.  But after
      splitting, the head page will be left in "from" list, although THP
      migration failure has been counted already.  If the head page is failed to
      be migrated too, the failure will be counted twice incorrectly.  So this
      is fixed in this patch via moving the head page of THP after splitting to
      "thp_split_pages" too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e6fa8a79
    • H
      migrate_pages(): fix failure counting for THP on -ENOSYS · 577be05c
      Huang Ying 提交于
      If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
      unmap_and_move_huge_page() will return -ENOSYS.  For THP, splitting will
      be tried, but if splitting doesn't succeed, the THP will be left in "from"
      list wrongly.  If some other pages are retried, the THP migration failure
      will counted again.  This is fixed via moving the failure THP from "from"
      to "ret_pages".
      
      Another issue of the original code is that the unsupported failure
      processing isn't consistent between THP and hugetlbfs page.  Make them
      consistent in this patch to make the code easier to be understood too.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      577be05c
    • H
      migrate_pages(): fix failure counting for THP subpages retrying · 5fc30916
      Huang Ying 提交于
      If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
      split into thp_split_pages, and after other pages are migrated, pages in
      thp_split_pages will be migrated with no_subpage_counting == true, because
      its failure have been counted already.  If some pages in thp_split_pages
      are retried during migration, we should not count their failure if
      no_subpage_counting == true too.  This is done this patch to fix the
      failure counting for THP subpages retrying.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5fc30916
    • H
      migrate_pages(): fix THP failure counting for -ENOMEM · fbed53b4
      Huang Ying 提交于
      In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
      returned, and migrate_pages() will try to split the THP unless "reason" is
      MR_NUMA_MISPLACED (that is, nosplit == true).  But when nosplit == true,
      the THP migration failure will not be counted.
      
      This is incorrect, so in this patch, the THP migration failure will be
      counted for -ENOMEM regardless of nosplit is true or false.  The nr_failed
      counting isn't fixed because it's not used.  Added some comments for it
      per Baolin's suggestion.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fbed53b4
    • H
      migrate_pages(): remove unnecessary list_safe_reset_next() · 9c62ff00
      Huang Ying 提交于
      Before commit b5bade97 ("mm: migrate: fix the return value of
      migrate_pages()"), the tail pages of THP will be put in the "from"
      list directly.  So one of the loop cursors (page2) needs to be reset,
      as is done in try_split_thp() via list_safe_reset_next().  But after
      the commit, the tail pages of THP will be put in a dedicated
      list (thp_split_pages).  That is, the "from" list will not be changed
      during splitting.  So, it's unnecessary to call list_safe_reset_next()
      anymore.
      
      This is a code cleanup, no functionality changes are expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      9c62ff00
    • H
      migrate: fix syscall move_pages() return value for failure · a7504ed1
      Huang Ying 提交于
      Patch series "migrate_pages(): fix several bugs in error path", v3.
      
      During review the code of migrate_pages() and build a test program for
      it.  Several bugs in error path are identified and fixed in this
      series.
      
      Most patches are tested via
      
      - Apply error-inject.patch in Linux kernel
      - Compile test-migrate.c (with -lnuma)
      - Test with test-migrate.sh
      
      error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
      It turns out that error injection is an important tool to fix bugs in
      error path.
      
      
      This patch (of 8):
      
      The return value of move_pages() syscall is incorrect when counting
      the remaining pages to be migrated.  For example, for the following
      test program,
      
      "
       #define _GNU_SOURCE
      
       #include <stdbool.h>
       #include <stdio.h>
       #include <string.h>
       #include <stdlib.h>
       #include <errno.h>
      
       #include <fcntl.h>
       #include <sys/uio.h>
       #include <sys/mman.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <numaif.h>
       #include <numa.h>
      
       #ifndef MADV_FREE
       #define MADV_FREE	8		/* free pages only if memory pressure */
       #endif
      
       #define ONE_MB		(1024 * 1024)
       #define MAP_SIZE	(16 * ONE_MB)
       #define THP_SIZE	(2 * ONE_MB)
       #define THP_MASK	(THP_SIZE - 1)
      
       #define ERR_EXIT_ON(cond, msg)					\
      	 do {							\
      		 int __cond_in_macro = (cond);			\
      		 if (__cond_in_macro)				\
      			 error_exit(__cond_in_macro, (msg));	\
      	 } while (0)
      
       void error_msg(int ret, int nr, int *status, const char *msg)
       {
      	 int i;
      
      	 fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
      		 msg, ret, strerror(errno));
      
      	 if (!nr)
      		 return;
      	 fprintf(stderr, "status: ");
      	 for (i = 0; i < nr; i++)
      		 fprintf(stderr, "%d ", status[i]);
      	 fprintf(stderr, "\n");
       }
      
       void error_exit(int ret, const char *msg)
       {
      	 error_msg(ret, 0, NULL, msg);
      	 exit(1);
       }
      
       int page_size;
      
       bool do_vmsplice;
       bool do_thp;
      
       static int pipe_fds[2];
       void *addr;
       char *pn;
       char *pn1;
       void *pages[2];
       int status[2];
      
       void prepare()
       {
      	 int ret;
      	 struct iovec iov;
      
      	 if (addr) {
      		 munmap(addr, MAP_SIZE);
      		 close(pipe_fds[0]);
      		 close(pipe_fds[1]);
      	 }
      
      	 ret = pipe(pipe_fds);
      	 ERR_EXIT_ON(ret, "pipe");
      
      	 addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      		     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	 ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
      	 if (do_thp) {
      		 ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
      		 ERR_EXIT_ON(ret, "advise hugepage");
      	 }
      
      	 pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
      	 pn1 = pn + THP_SIZE;
      	 pages[0] = pn;
      	 pages[1] = pn1;
      	 *pn = 1;
      
      	 if (do_vmsplice) {
      		 iov.iov_base = pn;
      		 iov.iov_len = page_size;
      		 ret = vmsplice(pipe_fds[1], &iov, 1, 0);
      		 ERR_EXIT_ON(ret < 0, "vmsplice");
      	 }
      
      	 status[0] = status[1] = 1024;
       }
      
       void test_migrate()
       {
      	 int ret;
      	 int nodes[2] = { 1, 1 };
      	 pid_t pid = getpid();
      
      	 prepare();
      	 ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 1, status, "move 1 page");
      
      	 prepare();
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");
      
      	 prepare();
      	 *pn1 = 1;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages");
      
      	 prepare();
      	 *pn1 = 1;
      	 nodes[1] = 0;
      	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
      	 error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
       }
      
       int main(int argc, char *argv[])
       {
      	 numa_run_on_node(0);
      	 page_size = getpagesize();
      
      	 test_migrate();
      
      	 fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTest THP:\n");
      	 do_thp = true;
      	 do_vmsplice = false;
      	 test_migrate();
      
      	 fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
      	 do_vmsplice = true;
      	 test_migrate();
      
      	 return 0;
       }
      "
      
      The output of the current kernel is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 0, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
      status: 1024 1024
      "
      
      While the expected output is,
      
      "
      Error: move 1 page, ret : 0, error: Success
      status: 1
      Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
      status: 1 -14
      Error: move 2 pages, ret : 0, error: Success
      status: 1 1
      Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
      status: 1 0
      
      Make page 0 cannot be migrated:
      Error: move 1 page, ret : 1, error: Success
      status: 1024
      Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
      status: 1024 -14
      Error: move 2 pages, ret : 1, error: Success
      status: 1024 1024
      Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
      status: 1024 1024
      "
      
      Fix this via correcting the remaining pages counting.  With the fix,
      the output for the test program as above is expected.
      
      Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
      Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a7504ed1
    • Y
      filemap: make the accounting of thrashing more consistent · f347c9d2
      Yang Yang 提交于
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      So let delayacct account both the thrashing of page cache and anonymous
      pages, this could make the codes more consistent and simpler.
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220805033838.1714674-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f347c9d2
    • P
      mm/swap: cache swap migration A/D bits support · 5154e607
      Peter Xu 提交于
      Introduce a variable swap_migration_ad_supported to cache whether the arch
      supports swap migration A/D bits.
      
      Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
      reference the other macro MAX_PHYSMEM_BITS, which is a function call on
      x86 (constant on all the rest of archs).
      
      It's safe to reference it in swapfile_init() because when reaching here
      we're already during initcalls level 4 so we must have initialized 5-level
      pgtable for x86_64 (right after early_identify_cpu() finishes).
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
            - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      This should slightly speed up the migration swap entry handlings.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-8-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5154e607
    • P
      mm/swap: cache maximum swapfile size when init swap · be45a490
      Peter Xu 提交于
      We used to have swapfile_maximum_size() fetching a maximum value of
      swapfile size per-arch.
      
      As the caller of max_swapfile_size() grows, this patch introduce a
      variable "swapfile_maximum_size" and cache the value of old
      max_swapfile_size(), so that we don't need to calculate the value every
      time.
      
      Caching the value in swapfile_init() is safe because when reaching the
      phase we should have initialized all the relevant information.  Here the
      major arch to take care of is x86, which defines the max swapfile size
      based on L1TF mitigation.
      
      Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
      when reaching swapfile_init().  As a reference, the code path looks like
      this for x86:
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - early_identify_cpu --> setup X86_BUG_L1TF
        - parse_early_param
          - l1tf_cmdline --> set l1tf_mitigation
        - check_bugs
          - l1tf_select_mitigation --> set l1tf_mitigation
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      The swapfile size only depends on swp pte format on non-x86 archs, so
      caching it is safe too.
      
      Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
      because arch can define its own function, so it's more straightforward to
      have "arch_" as its prefix.  At the meantime, export swapfile_maximum_size
      to replace the old usages of max_swapfile_size().
      
      [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
        Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
      Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      be45a490
    • P
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu 提交于
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2e346877
    • P
      mm/thp: carry over dirty bit when thp splits on pmd · 0ccf7f16
      Peter Xu 提交于
      Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
      shouldn't be a correctness issue since when pmd_dirty() we'll have the
      page marked dirty anyway, however having dirty bit carried over helps the
      next initial writes of split ptes on some archs like x86.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NHuang Ying <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0ccf7f16
    • P
      mm/swap: add swp_offset_pfn() to fetch PFN from swap entry · 0d206b5d
      Peter Xu 提交于
      We've got a bunch of special swap entries that stores PFN inside the swap
      offset fields.  To fetch the PFN, normally the user just calls
      swp_offset() assuming that'll be the PFN.
      
      Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
      max possible length of a PFN on the host, meanwhile doing proper check
      with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
      PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
      
      One reason to do so is we never tried to sanitize whether swap offset can
      really fit for storing PFN.  At the meantime, this patch also prepares us
      with the future possibility to store more information inside the swp
      offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
      any more very soon.
      
      Replace many of the swp_offset() callers to use swp_offset_pfn() where
      proper.  Note that many of the existing users are not candidates for the
      replacement, e.g.:
      
        (1) When the swap entry is not a pfn swap entry at all, or,
        (2) when we wanna keep the whole swp_offset but only change the swp type.
      
      For the latter, it can happen when fork() triggered on a write-migration
      swap entry pte, we may want to only change the migration type from
      write->read but keep the rest, so it's not "fetching PFN" but "changing
      swap type only".  They're left aside so that when there're more
      information within the swp offset they'll be carried over naturally in
      those cases.
      
      Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
      the new swp_offset_pfn() is about.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0d206b5d
    • P
      mm/swap: comment all the ifdef in swapops.h · eba4d770
      Peter Xu 提交于
      swapops.h contains quite a few layers of ifdef, some of the "else" and
      "endif" doesn't get proper comment on the macro so it's hard to follow on
      what are they referring to.  Add the comments.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NHuang Ying <ying.huang@intel.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      eba4d770
    • P
      mm/x86: use SWP_TYPE_BITS in 3-level swap macros · 9c61d532
      Peter Xu 提交于
      Patch series "mm: Remember a/d bits for migration entries", v4.
      
      
      Problem
      =======
      
      When migrating a page, right now we always mark the migrated page as old &
      clean.
      
      However that could lead to at least two problems:
      
        (1) We lost the real hot/cold information while we could have persisted.
            That information shouldn't change even if the backing page is changed
            after the migration,
      
        (2) There can be always extra overhead on the immediate next access to
            any migrated page, because hardware MMU needs cycles to set the young
            bit again for reads, and dirty bits for write, as long as the
            hardware MMU supports these bits.
      
      Many of the recent upstream works showed that (2) is not something trivial
      and actually very measurable.  In my test case, reading 1G chunk of memory
      - jumping in page size intervals - could take 99ms just because of the
      extra setting on the young bit on a generic x86_64 system, comparing to
      4ms if young set.
      
      This issue is originally reported by Andrea Arcangeli.
      
      Solution
      ========
      
      To solve this problem, this patchset tries to remember the young/dirty
      bits in the migration entries and carry them over when recovering the
      ptes.
      
      We have the chance to do so because in many systems the swap offset is not
      really fully used.  Migration entries use swp offset to store PFN only,
      while the PFN is normally not as large as swp offset and normally smaller.
      It means we do have some free bits in swp offset that we can use to store
      things like A/D bits, and that's how this series tried to approach this
      problem.
      
      max_swapfile_size() is used here to detect per-arch offset length in swp
      entries.  We'll automatically remember the A/D bits when we find that we
      have enough swp offset field to keep both the PFN and the extra bits.
      
      Since max_swapfile_size() can be slow, the last two patches cache the
      results for it and also swap_migration_ad_supported as a whole.
      
      Known Issues / TODOs
      ====================
      
      We still haven't taught madvise() to recognize the new A/D bits in
      migration entries, namely MADV_COLD/MADV_FREE.  E.g.  when MADV_COLD upon
      a migration entry.  It's not clear yet on whether we should clear the A
      bit, or we should just drop the entry directly.
      
      We didn't teach idle page tracking on the new migration entries, because
      it'll need larger rework on the tree on rmap pgtable walk.  However it
      should make it already better because before this patchset page will be
      old page after migration, so the series will fix potential false negative
      of idle page tracking when pages were migrated before observing.
      
      The other thing is migration A/D bits will not start to working for
      private device swap entries.  The code is there for completeness but since
      private device swap entries do not yet have fields to store A/D bits, even
      if we'll persistent A/D across present pte switching to migration entry,
      we'll lose it again when the migration entry converted to private device
      swap entry.
      
      Tests
      =====
      
      After the patchset applied, the immediate read access test [1] of above 1G
      chunk after migration can shrink from 99ms to 4ms.  The test is done by
      moving 1G pages from node 0->1->0 then read it in page size jumps.  The
      test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
      
      Similar effect can also be measured when writting the memory the 1st time
      after migration.
      
      After applying the patchset, both initial immediate read/write after page
      migrated will perform similarly like before migration happened.
      
      Patch Layout
      ============
      
      Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.
      
      Patch 3-4:  Prepare for the introduction of migration A/D bits
      
      Patch 5:    The core patch to remember young/dirty bit in swap offsets.
      
      Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.
      
      [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
      
      
      This patch (of 7):
      
      Replace all the magic "5" with the macro.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NHuang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      9c61d532
    • M
      mm, hwpoison: cleanup some obsolete comments · 9cf28191
      Miaohe Lin 提交于
      1.Remove meaningless comment in kill_proc(). That doesn't tell anything.
      2.Fix the wrong function name get_hwpoison_unless_zero(). It should be
      get_page_unless_zero().
      3.The gate keeper for free hwpoison page has moved to check_new_page().
      Update the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      9cf28191
    • M
      mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings() · b680dae9
      Miaohe Lin 提交于
      PageTable can't be handled by memory_failure(). Filter it out explicitly in
      hwpoison_user_mappings(). This will also make code more consistent with the
      relevant check in unpoison_memory().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b680dae9
    • M
      mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon() · 36537a67
      Miaohe Lin 提交于
      If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as
      add_to_kill() won't be called in this case. Move up the mm check to avoid
      possible unneeded calling to page_mapped_in_vma().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      36537a67
    • M
      mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages · 21c9e90a
      Miaohe Lin 提交于
      Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
      num_poisoned_pages_dec() can be killed as there's no caller now.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      21c9e90a
    • M
      mm, hwpoison: use __PageMovable() to detect non-lru movable pages · da294991
      Miaohe Lin 提交于
      It's more recommended to use __PageMovable() to detect non-lru movable
      pages. We can avoid bumping page refcnt via isolate_movable_page() for
      the isolated lru pages. Also if pages become PageLRU just after they're
      checked but before trying to isolate them, isolate_lru_page() will be
      called to do the right work.
      
      [linmiaohe@huawei.com: fixes per Naoya Horiguchi]
        Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      da294991
    • M
      mm, hwpoison: use ClearPageHWPoison() in memory_failure() · 2fe62e22
      Miaohe Lin 提交于
      Patch series "A few cleanup patches for memory-failure".
      
      his series contains a few cleanup patches to use __PageMovable() to detect
      non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple
      atomic ops overheads and so on.  More details can be found in the
      respective changelogs.
      
      
      This patch (of 6):
      
      Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page
      hwpoison flags to avoid unneeded full memory barrier overhead.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2fe62e22
    • Y
      mm: MADV_COLLAPSE: refetch vm_end after reacquiring mmap_lock · 4d24de94
      Yang Shi 提交于
      The syzbot reported the below problem:
      
      BUG: Bad page map in process syz-executor198  pte:8000000071c00227 pmd:74b30067
      addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
      file:(null) fault:0x0 mmap:0x0 read_folio:0x0
      CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
       vm_normal_page+0x10c/0x2a0 mm/memory.c:636
       hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
       madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
       do_madvise mm/madvise.c:1428 [inline]
       __do_sys_madvise mm/madvise.c:1428 [inline]
       __se_sys_madvise mm/madvise.c:1426 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f770ba87929
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
      R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
      R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
      
      Basically the test program does the below conceptually:
      1. mmap 0x2000000 - 0x21000000 as anonymous region
      2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap()
         actually remaps the pages with special PTEs
      3. call MADV_COLLAPSE for 0x20000000 - 0x21000000
      
      It actually triggered the below race:
      
                   CPU A                                          CPU B
      mmap 0x20000000 - 0x21000000 as anon
                                                 madvise_collapse is called on this area
                                                   Retrieve start and end address from the vma (NEVER updated later!)
                                                   Collapsed the first 2M area and dropped mmap_lock
      Acquire mmap_lock
      mmap io_uring file at 0x20563000
      Release mmap_lock
                                                   Reacquire mmap_lock
                                                   revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000
                                                   scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock
                                                   scan the 3rd 2M area (start from 0x20400000)
                                                     get into the vma created by io_uring
      
      The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since
      the vma may be shrunk.  We don't have to worry about shink from the other
      direction since it could be caught by hugepage_vma_revalidate().  Either
      no valid vma is found or the vma doesn't fit anymore.
      
      Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4d24de94
    • A
      Merge branch 'mm-hotfixes-stable' into mm-stable · 6d751329
      Andrew Morton 提交于
      6d751329
    • K
      x86/uaccess: avoid check_object_size() in copy_from_user_nmi() · 59298997
      Kees Cook 提交于
      The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed
      to skip any checks where the length is known at compile time as a
      reasonable heuristic to avoid "likely known-good" cases.  However, it can
      only do this when the copy_*_user() helpers are, themselves, inline too.
      
      Using find_vmap_area() requires taking a spinlock.  The
      check_object_size() helper can call find_vmap_area() when the destination
      is in vmap memory.  If show_regs() is called in interrupt context, it will
      attempt a call to copy_from_user_nmi(), which may call check_object_size()
      and then find_vmap_area().  If something in normal context happens to be
      in the middle of calling find_vmap_area() (with the spinlock held), the
      interrupt handler will hang forever.
      
      The copy_from_user_nmi() call is actually being called with a fixed-size
      length, so check_object_size() should never have been called in the first
      place.  Given the narrow constraints, just replace the
      __copy_from_user_inatomic() call with an open-coded version that calls
      only into the sanitizers and not check_object_size(), followed by a call
      to raw_copy_from_user().
      
      [akpm@linux-foundation.org: no instrument_copy_from_user() in my tree...]
      Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org
      Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com
      Fixes: 0aef499f ("mm/usercopy: Detect vmalloc overruns")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NYu Zhao <yuzhao@google.com>
      Reported-by: NFlorian Lehner <dev@der-flo.net>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NFlorian Lehner <dev@der-flo.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      59298997
    • Z
      mm/page_isolation: fix isolate_single_pageblock() isolation behavior · 80e2b584
      Zi Yan 提交于
      set_migratetype_isolate() does not allow isolating MIGRATE_CMA pageblocks
      unless it is used for CMA allocation.  isolate_single_pageblock() did not
      have the same behavior when it is used together with
      set_migratetype_isolate() in start_isolate_page_range().  This allows
      alloc_contig_range() with migratetype other than MIGRATE_CMA, like
      MIGRATE_MOVABLE (used by alloc_contig_pages()), to isolate first and last
      pageblock but fail the rest.  The failure leads to changing migratetype of
      the first and last pageblock to MIGRATE_MOVABLE from MIGRATE_CMA,
      corrupting the CMA region.  This can happen during gigantic page
      allocations.
      
      Like Doug said here:
      https://lore.kernel.org/linux-mm/a3363a52-883b-dcd1-b77f-f2bb378d6f2d@gmail.com/T/#u,
      for gigantic page allocations, the user would notice no difference,
      since the allocation on CMA region will fail as well as it did before. 
      But it might hurt the performance of device drivers that use CMA, since
      CMA region size decreases.
      
      Fix it by passing migratetype into isolate_single_pageblock(), so that
      set_migratetype_isolate() used by isolate_single_pageblock() will prevent
      the isolation happening.
      
      Link: https://lkml.kernel.org/r/20220914023913.1855924-1-zi.yan@sent.com
      Fixes: b2c9e2fb ("mm: make alloc_contig_range work at pageblock granularity")
      Signed-off-by: NZi Yan <ziy@nvidia.com>
      Reported-by: NDoug Berger <opendmb@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Doug Berger <opendmb@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      80e2b584
    • S
      mm,hwpoison: check mm when killing accessing process · 77677cdb
      Shuai Xue 提交于
      The GHES code calls memory_failure_queue() from IRQ context to queue work
      into workqueue and schedule it on the current CPU.  Then the work is
      processed in memory_failure_work_func() by kworker and calls
      memory_failure().
      
      When a page is already poisoned, commit a3f5d80e ("mm,hwpoison: send
      SIGBUS with error virutal address") make memory_failure() call
      kill_accessing_process() that:
      
          - holds mmap locking of current->mm
          - does pagetable walk to find the error virtual address
          - and sends SIGBUS to the current process with error info.
      
      However, the mm of kworker is not valid, resulting in a null-pointer
      dereference.  So check mm when killing the accessing process.
      
      [akpm@linux-foundation.org: remove unrelated whitespace alteration]
      Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
      Fixes: a3f5d80e ("mm,hwpoison: send SIGBUS with error virutal address")
      Signed-off-by: NShuai Xue <xueshuai@linux.alibaba.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Bixuan Cui <cuibixuan@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      77677cdb
    • D
      mm/hugetlb: correct demote page offset logic · 31731452
      Doug Berger 提交于
      With gigantic pages it may not be true that struct page structures are
      contiguous across the entire gigantic page.  The nth_page macro is used
      here in place of direct pointer arithmetic to correct for this.
      
      Mike said:
      
      : This error could cause addressing exceptions.  However, this is only
      : possible in configurations where CONFIG_SPARSEMEM &&
      : !CONFIG_SPARSEMEM_VMEMMAP.  Such a configuration option is rare and
      : unknown to be the default anywhere.
      
      Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com
      Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      31731452
    • M
      mm: prevent page_frag_alloc() from corrupting the memory · dac22531
      Maurizio Lombardi 提交于
      A number of drivers call page_frag_alloc() with a fragment's size >
      PAGE_SIZE.
      
      In low memory conditions, __page_frag_cache_refill() may fail the order
      3 cache allocation and fall back to order 0; In this case, the cache
      will be smaller than the fragment, causing memory corruptions.
      
      Prevent this from happening by checking if the newly allocated cache is
      large enough for the fragment; if not, the allocation will fail and
      page_frag_alloc() will return NULL.
      
      Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com
      Fixes: b63ae8ca ("mm/net: Rename and move page fragment handling from net/ to mm/")
      Signed-off-by: NMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Cc: Chen Lin <chen45464546@163.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      dac22531
    • S
      mm: bring back update_mmu_cache() to finish_fault() · 70427f6e
      Sergei Antonov 提交于
      Running this test program on ARMv4 a few times (sometimes just once)
      reproduces the bug.
      
      int main()
      {
              unsigned i;
              char paragon[SIZE];
              void* ptr;
      
              memset(paragon, 0xAA, SIZE);
              ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                         MAP_ANON | MAP_SHARED, -1, 0);
              if (ptr == MAP_FAILED) return 1;
              printf("ptr = %p\n", ptr);
              for (i=0;i<10000;i++){
                      memset(ptr, 0xAA, SIZE);
                      if (memcmp(ptr, paragon, SIZE)) {
                              printf("Unexpected bytes on iteration %u!!!\n", i);
                              break;
                      }
              }
              munmap(ptr, SIZE);
      }
      
      In the "ptr" buffer there appear runs of zero bytes which are aligned
      by 16 and their lengths are multiple of 16.
      
      Linux v5.11 does not have the bug, "git bisect" finds the first bad commit:
      f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      
      Before the commit update_mmu_cache() was called during a call to
      filemap_map_pages() as well as finish_fault(). After the commit
      finish_fault() lacks it.
      
      Bring back update_mmu_cache() to finish_fault() to fix the bug.
      Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more
      closely reproduce the code of alloc_set_pte() function that existed before
      the commit.
      
      On many platforms update_mmu_cache() is nop:
       x86, see arch/x86/include/asm/pgtable
       ARMv6+, see arch/arm/include/asm/tlbflush.h
      So, it seems, few users ran into this bug.
      
      Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com
      Fixes: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      Signed-off-by: NSergei Antonov <saproj@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      70427f6e
    • C
      frontswap: don't call ->init if no ops are registered · 37dcc673
      Christoph Hellwig 提交于
      If no frontswap module (i.e.  zswap) was registered, frontswap_ops will be
      NULL.  In such situation, swapon crashes with the following stack trace:
      
        Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
        Mem abort info:
          ESR = 0x0000000096000004
          EC = 0x25: DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
          FSC = 0x04: level 0 translation fault
        Data abort info:
          ISV = 0, ISS = 0x00000004
          CM = 0, WnR = 0
        user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a4fab000
        [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
        Internal error: Oops: 96000004 [#1] SMP
        Modules linked in: zram fsl_dpaa2_eth pcs_lynx phylink ahci_qoriq crct10dif_ce ghash_ce sbsa_gwdt fsl_mc_dpio nvme lm90 nvme_core at803x xhci_plat_hcd rtc_fsl_ftm_alarm xgmac_mdio ahci_platform i2c_imx ip6_tables ip_tables fuse
        Unloaded tainted modules: cppc_cpufreq():1
        CPU: 10 PID: 761 Comm: swapon Not tainted 6.0.0-rc2-00454-g22100432cf14 #1
        Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
        pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        pc : frontswap_init+0x38/0x60
        lr : __do_sys_swapon+0x8a8/0x9f4
        sp : ffff80000969bcf0
        x29: ffff80000969bcf0 x28: ffff37bee0d8fc00 x27: ffff80000a7f5000
        x26: fffffcdefb971e80 x25: ffffaba797453b90 x24: 0000000000000064
        x23: ffff37c1f209d1a8 x22: ffff37bee880e000 x21: ffffaba797748560
        x20: ffff37bee0d8fce4 x19: ffffaba797748488 x18: 0000000000000014
        x17: 0000000030ec029a x16: ffffaba795a479b0 x15: 0000000000000000
        x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000001
        x11: ffff37c63c0aba18 x10: 0000000000000000 x9 : ffffaba7956b8c88
        x8 : ffff80000969bcd0 x7 : 0000000000000000 x6 : 0000000000000000
        x5 : 0000000000000001 x4 : 0000000000000000 x3 : ffffaba79730f000
        x2 : ffff37bee0d8fc00 x1 : 0000000000000000 x0 : 0000000000000000
        Call trace:
        frontswap_init+0x38/0x60
        __do_sys_swapon+0x8a8/0x9f4
        __arm64_sys_swapon+0x28/0x3c
        invoke_syscall+0x78/0x100
        el0_svc_common.constprop.0+0xd4/0xf4
        do_el0_svc+0x38/0x4c
        el0_svc+0x34/0x10c
        el0t_64_sync_handler+0x11c/0x150
        el0t_64_sync+0x190/0x194
        Code: d000e283 910003fd f9006c41 f946d461 (f9400021)
        ---[ end trace 0000000000000000 ]---
      
      Link: https://lkml.kernel.org/r/20220909130829.3262926-1-hch@lst.de
      Fixes: 1da0d94a ("frontswap: remove support for multiple ops")
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      37dcc673
    • N
      mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() · 2b7aa91b
      Naoya Horiguchi 提交于
      NULL pointer dereference is triggered when calling thp split via debugfs
      on the system with offlined memory blocks.  With debug option enabled, the
      following kernel messages are printed out:
      
        page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
        flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
        raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
        page dumped because: unmovable page
        page:000000007d7ab72e is uninitialized and poisoned
        page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
        ------------[ cut here ]------------
        kernel BUG at include/linux/mm.h:1248!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
        ...
        RIP: 0010:split_huge_pages_write+0xcf4/0xe30
      
      This shows that page_to_nid() in page_zone() is unexpectedly called for an
      offlined memmap.
      
      Use pfn_to_online_page() to get struct page in PFN walker.
      
      Link: https://lkml.kernel.org/r/20220908041150.3430269-1-naoya.horiguchi@linux.dev
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")      [visible after d0dc12e8]
      Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Co-developed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2b7aa91b
    • M
      mm: fix madivse_pageout mishandling on non-LRU page · 58d426a7
      Minchan Kim 提交于
      MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from
      isolate_lru_page below.
      
      Fix it by checking PageLRU in advance.
      
      ------------[ cut here ]------------
      trying to isolate tail page
      WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140
      Modules linked in:
      CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:isolate_lru_page+0x130/0x140
      
      Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantianshuo@iie.ac.cn/
      Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: N韩天ç`• <hantianshuo@iie.ac.cn>
      Suggested-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NYang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      58d426a7
    • Y
      powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush · bedf0341
      Yang Shi 提交于
      The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
      move to use RCU instead of disabling local interrupts in fast-GUP.  Using
      an IPI is the old-styled way of serializing against fast-GUP although it
      still works as expected now.
      
      And fast-GUP now fixed the potential race with THP collapse by checking
      whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
      flush is not necessary anymore.  But it is still needed for hash TLB.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.comSuggested-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bedf0341
    • Y
      mm: gup: fix the fast GUP race against THP collapse · 70cbc3cc
      Yang Shi 提交于
      Since general RCU GUP fast was introduced in commit 2667f50e ("mm:
      introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
      sufficient to handle concurrent GUP-fast in all cases, it only handles
      traditional IPI-based GUP-fast correctly.  On architectures that send an
      IPI broadcast on TLB flush, it works as expected.  But on the
      architectures that do not use IPI to broadcast TLB flush, it may have the
      below race:
      
         CPU A                                          CPU B
      THP collapse                                     fast GUP
                                                    gup_pmd_range() <-- see valid pmd
                                                        gup_pte_range() <-- work on pte
      pmdp_collapse_flush() <-- clear pmd and flush
      __collapse_huge_page_isolate()
          check page pinned <-- before GUP bump refcount
                                                            pin the page
                                                            check PTE <-- no change
      __collapse_huge_page_copy()
          copy data to huge page
          ptep_clear()
      install huge pmd for the huge page
                                                            return the stale page
      discard the stale page
      
      The race can be fixed by checking whether PMD is changed or not after
      taking the page pin in fast GUP, just like what it does for PTE.  If the
      PMD is changed it means there may be parallel THP collapse, so GUP should
      back off.
      
      Also update the stale comment about serializing against fast GUP in
      khugepaged.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
      Fixes: 2667f50e ("mm: introduce a general RCU get_user_pages_fast()")
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      70cbc3cc
  2. 12 9月, 2022 1 次提交
    • D
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand 提交于
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      088b8aa5