1. 06 4月, 2023 1 次提交
    • R
      mm/swap: fix swap_info_struct race between swapoff and get_swap_pages() · 6fe7d6b9
      Rongwei Wang 提交于
      The si->lock must be held when deleting the si from the available list. 
      Otherwise, another thread can re-add the si to the available list, which
      can lead to memory corruption.  The only place we have found where this
      happens is in the swapoff path.  This case can be described as below:
      
      core 0                       core 1
      swapoff
      
      del_from_avail_list(si)      waiting
      
      try lock si->lock            acquire swap_avail_lock
                                   and re-add si into
                                   swap_avail_head
      
      acquire si->lock but missing si already being added again, and continuing
      to clear SWP_WRITEOK, etc.
      
      It can be easily found that a massive warning messages can be triggered
      inside get_swap_pages() by some special cases, for example, we call
      madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile,
      run much swapon-swapoff operations (e.g.  stress-ng-swap).
      
      However, in the worst case, panic can be caused by the above scene.  In
      swapoff(), the memory used by si could be kept in swap_info[] after
      turning off a swap.  This means memory corruption will not be caused
      immediately until allocated and reset for a new swap in the swapon path. 
      A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
      
      ------------[ cut here ]------------
      top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
      prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
      next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
      WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
      Modules linked in: rfkill(E) crct10dif_ce(E)...
      CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
      Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
      pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
      pc : plist_check_prev_next_node+0x50/0x70
      lr : plist_check_prev_next_node+0x50/0x70
      sp : ffff0018009d3c30
      x29: ffff0018009d3c40 x28: ffff800011b32a98
      x27: 0000000000000000 x26: ffff001803908000
      x25: ffff8000128ea088 x24: ffff800011b32a48
      x23: 0000000000000028 x22: ffff001800875c00
      x21: ffff800010f9e520 x20: ffff001800875c00
      x19: ffff001800fdc6e0 x18: 0000000000000030
      x17: 0000000000000000 x16: 0000000000000000
      x15: 0736076307640766 x14: 0730073007380731
      x13: 0736076307640766 x12: 0730073007380731
      x11: 000000000004058d x10: 0000000085a85b76
      x9 : ffff8000101436e4 x8 : ffff800011c8ce08
      x7 : 0000000000000000 x6 : 0000000000000001
      x5 : ffff0017df9ed338 x4 : 0000000000000001
      x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
      x1 : 0000000000000000 x0 : 0000000000000000
      Call trace:
       plist_check_prev_next_node+0x50/0x70
       plist_check_head+0x80/0xf0
       plist_add+0x28/0x140
       add_to_avail_list+0x9c/0xf0
       _enable_swap_info+0x78/0xb4
       __do_sys_swapon+0x918/0xa10
       __arm64_sys_swapon+0x20/0x30
       el0_svc_common+0x8c/0x220
       do_el0_svc+0x2c/0x90
       el0_svc+0x1c/0x30
       el0_sync_handler+0xa8/0xb0
       el0_sync+0x148/0x180
      irq event stamp: 2082270
      
      Now, si->lock locked before calling 'del_from_avail_list()' to make sure
      other thread see the si had been deleted and SWP_WRITEOK cleared together,
      will not reinsert again.
      
      This problem exists in versions after stable 5.10.y.
      
      Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com
      Fixes: a2468cc9 ("swap: choose swap device according to numa node") 
      Tested-by: NYongchen Yin <wb-yyc939293@alibaba-inc.com>
      Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6fe7d6b9
  2. 10 2月, 2023 2 次提交
  3. 03 2月, 2023 2 次提交
  4. 01 2月, 2023 1 次提交
  5. 19 1月, 2023 1 次提交
  6. 01 12月, 2022 1 次提交
  7. 23 11月, 2022 1 次提交
    • C
      swapfile: fix soft lockup in scan_swap_map_slots · de1ccfb6
      Chen Wandun 提交于
      A softlockup occurs in scan free swap slot under huge memory pressure. 
      The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
      disksize of each zram device is 50MB.
      
      LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
      the real loop number would more than LATENCY_LIMIT because of "goto checks
      and goto scan" repeatly without decreasing latency limit.
      
      In order to fix it, decrease latency_ration in advance.
      
      There is also a suspicious place that will cause softlockups in
      get_swap_pages().  In this function, the "goto start_over" may result in
      continuous scanning of the swap partition.  If there is no cond_sched in
      scan_swap_map_slots(), it would cause a softlockup (I am not sure about
      this).
      
      WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
      CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
      dump backtrace+0x0/0x1le4
      show stack+0x20/@x2c
      dump_stack+0xd8/0x140
      watchdog print_info+0x48/0x54
      watchdog_process_before_softlockup+0x98/0xa0
      watchdog_timer_fn+0xlac/0x2d0
      hrtimer_rum_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handLe_percpu_devid_irq+0x90/0x1f4
      handle domain irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      e11 ira+0xhB/Bx140
      scan_swap_map_slots+0x678/0x890
      get_swap_pages+0x29c/0x440
      get_swap_page+0x120/0x2e0
      add_to_swap+UX2U/0XyC
      shrink_page_list+0x5d0/0x152c
      shrink_inactive_list+0xl6c/Bx500
      shrink_lruvec+0x270/0x304
      
      WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
      watchdog_timer_fn+0x1ac/0x2d0
      __run_hrtimer+0x98/0x2a0
      __hrtimer_run_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handle_percpu_devid_irq+0x90/0x1f4
      __handle_domain_irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      el1_irq+0xb8/0x140
      get_swap_pages+0x1e8/0x440
      get_swap_page+0x1c8/0x2e0
      add_to_swap+0x20/0x9c
      shrink_page_list+0x5d0/0x152c
      reclaim_pages+0x160/0x310
      madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
      walk_pmd_range.isra.0+0xac/0x22c
      walk_pud_range+0xfc/0x1c0
      walk_pgd_range+0x158/0x1b0
      __walk_page_range+0x64/0x100
      walk_page_range+0x104/0x150
      
      Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
      Fixes: 048c27fd ("[PATCH] swap: scan_swap_map latency breaks")
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: <xialonglong1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      de1ccfb6
  8. 18 11月, 2022 2 次提交
  9. 04 10月, 2022 7 次提交
  10. 27 9月, 2022 4 次提交
    • L
      mm/swapfile: use vma iterator instead of vma linked list · 208c09db
      Liam R. Howlett 提交于
      unuse_mm() no longer needs to reference the linked list.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-64-Liam.Howlett@oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: NYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      208c09db
    • P
      mm/swap: cache swap migration A/D bits support · 5154e607
      Peter Xu 提交于
      Introduce a variable swap_migration_ad_supported to cache whether the arch
      supports swap migration A/D bits.
      
      Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
      reference the other macro MAX_PHYSMEM_BITS, which is a function call on
      x86 (constant on all the rest of archs).
      
      It's safe to reference it in swapfile_init() because when reaching here
      we're already during initcalls level 4 so we must have initialized 5-level
      pgtable for x86_64 (right after early_identify_cpu() finishes).
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
            - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      This should slightly speed up the migration swap entry handlings.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-8-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5154e607
    • P
      mm/swap: cache maximum swapfile size when init swap · be45a490
      Peter Xu 提交于
      We used to have swapfile_maximum_size() fetching a maximum value of
      swapfile size per-arch.
      
      As the caller of max_swapfile_size() grows, this patch introduce a
      variable "swapfile_maximum_size" and cache the value of old
      max_swapfile_size(), so that we don't need to calculate the value every
      time.
      
      Caching the value in swapfile_init() is safe because when reaching the
      phase we should have initialized all the relevant information.  Here the
      major arch to take care of is x86, which defines the max swapfile size
      based on L1TF mitigation.
      
      Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
      when reaching swapfile_init().  As a reference, the code path looks like
      this for x86:
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - early_identify_cpu --> setup X86_BUG_L1TF
        - parse_early_param
          - l1tf_cmdline --> set l1tf_mitigation
        - check_bugs
          - l1tf_select_mitigation --> set l1tf_mitigation
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      The swapfile size only depends on swp pte format on non-x86 archs, so
      caching it is safe too.
      
      Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
      because arch can define its own function, so it's more straightforward to
      have "arch_" as its prefix.  At the meantime, export swapfile_maximum_size
      to replace the old usages of max_swapfile_size().
      
      [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
        Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
      Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      be45a490
    • C
      blk-cgroup: pass a gendisk to blkcg_schedule_throttle · de185b56
      Christoph Hellwig 提交于
      Pass the gendisk to blkcg_schedule_throttle as part of moving the
      blk-cgroup infrastructure to be gendisk based.  Remove the unused
      !BLK_CGROUP stub while we're at it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAndreas Herrmann <aherrmann@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20220921180501.1539876-17-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      de185b56
  11. 04 7月, 2022 3 次提交
  12. 28 5月, 2022 2 次提交
  13. 20 5月, 2022 7 次提交
  14. 13 5月, 2022 1 次提交
  15. 10 5月, 2022 5 次提交
    • N
      mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space · e1209d3a
      NeilBrown 提交于
      swap currently uses ->readpage to read swap pages.  This can only request
      one page at a time from the filesystem, which is not most efficient.
      
      swap uses ->direct_IO for writes which while this is adequate is an
      inappropriate over-loading.  ->direct_IO may need to had handle allocate
      space for holes or other details that are not relevant for swap.
      
      So this patch introduces a new address_space operation: ->swap_rw.  In
      this patch it is used for reads, and a subsequent patch will switch writes
      to use it.
      
      No filesystem yet supports ->swap_rw, but that is not a problem because
      no filesystem actually works with filesystem-based swap.
      Only two filesystems set SWP_FS_OPS:
      - cifs sets the flag, but ->direct_IO always fails so swap cannot work.
      - nfs sets the flag, but ->direct_IO calls generic_write_checks()
        which has failed on swap files for several releases.
      
      To ensure that a NULL ->swap_rw isn't called, ->activate_swap() for both
      NFS and cifs are changed to fail if ->swap_rw is not set.  This can be
      removed if/when the function is added.
      
      Future patches will restore swap-over-NFS functionality.
      
      To submit an async read with ->swap_rw() we need to allocate a structure
      to hold the kiocb and other details.  swap_readpage() cannot handle
      transient failure, so we create a mempool to provide the structures.
      
      Link: https://lkml.kernel.org/r/164859778125.29473.13430559328221330589.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e1209d3a
    • N
      mm: move responsibility for setting SWP_FS_OPS to ->swap_activate · 4b60c0ff
      NeilBrown 提交于
      If a filesystem wishes to handle all swap IO itself (via ->direct_IO and
      ->readpage), rather than just providing devices addresses for
      submit_bio(), SWP_FS_OPS must be set.
      
      Currently the protocol for setting this it to have ->swap_activate return
      zero.  In that case SWP_FS_OPS is set, and add_swap_extent() is called for
      the entire file.
      
      This is a little clumsy as different return values for ->swap_activate
      have quite different meanings, and it makes it hard to search for which
      filesystems require SWP_FS_OPS to be set.
      
      So remove the special meaning of a zero return, and require the filesystem
      to set SWP_FS_OPS if it so desires, and to always call add_swap_extent()
      as required.
      
      Currently only NFS and CIFS return zero for add_swap_extent().
      
      Link: https://lkml.kernel.org/r/164859778123.29473.17908205846599043598.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4b60c0ff
    • N
      mm: create new mm/swap.h header file · 014bb1de
      NeilBrown 提交于
      Patch series "MM changes to improve swap-over-NFS support".
      
      Assorted improvements for swap-via-filesystem.
      
      This is a resend of these patches, rebased on current HEAD.  The only
      substantial changes is that swap_dirty_folio has replaced
      swap_set_page_dirty.
      
      Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
      has previously worked for NFS but that broke a few releases back.  This
      series changes to use a new ->swap_rw rather than ->readpage and
      ->direct_IO.  It also makes other improvements.
      
      There is a companion series already in linux-next which fixes various
      issues with NFS.  Once both series land, a final patch is needed which
      changes NFS over to use ->swap_rw.
      
      
      This patch (of 10):
      
      Many functions declared in include/linux/swap.h are only used within mm/
      
      Create a new "mm/swap.h" and move some of these declarations there.
      Remove the redundant 'extern' from the function declarations.
      
      [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
      Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
      Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      014bb1de
    • D
      mm/swap: remember PG_anon_exclusive via a swp pte bit · 1493a191
      David Hildenbrand 提交于
      Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.
      
      This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
      | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
      exclusivity of the page to then replacing the anonymous page by a copy in
      the page table: The GUP reference lost synchronicity with the pages mapped
      into the page tables.  This series focuses on x86, arm64, s390x and
      ppc64/book3s -- other architectures are fairly easy to support by
      implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
      
      This primarily fixes the O_DIRECT memory corruptions that can happen on
      concurrent swapout, whereby we lose DMA reads to a page (modifying the
      user page by writing to it).
      
      O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
      from/to a user page.  In the long run, we want to convert it to properly
      use FOLL_PIN, and John is working on it, but that might take a while and
      might not be easy to backport.  In the meantime, let's restore what used
      to work before we started modifying our COW logic: make R/W FOLL_GET
      references reliable as long as there is no fork() after GUP involved.
      
      This is just the natural follow-up of part 2, that will also further
      reduce "wrong COW" on the swapin path, for example, when we cannot remove
      a page from the swapcache due to concurrent writeback, or if we have two
      threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
      nice side-product
      
      This issue, including other related COW issues, has been summarized in [3]
      under 2):
      "
        2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)
      
        It was discovered that we can create a memory corruption by reading a
        file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
        concurrently writing to an unrelated part (e.g., last byte) of the same
        page, and concurrently write-protecting the page via clear_refs
        SOFTDIRTY tracking [6].
      
        For the reproducer, the issue is that O_DIRECT grabs a reference of the
        target page (via FOLL_GET) and clear_refs write-protects the relevant
        page table entry. On successive write access to the page from the
        process itself, we wrongly COW the page when resolving the write fault,
        resulting in a loss of synchronicity and consequently a memory corruption.
      
        While some people might think that using clear_refs in this combination
        is a corner cases, it turns out to be a more generic problem unfortunately.
      
        For example, it was just recently discovered that we can similarly
        create a memory corruption without clear_refs, simply by concurrently
        swapping out the buffer pages [7]. Note that we nowadays even use the
        swap infrastructure in Linux without an actual swap disk/partition: the
        prime example is zram which is enabled as default under Fedora [10].
      
        The root issue is that a write-fault on a page that has additional
        references results in a COW and thereby a loss of synchronicity
        and consequently a memory corruption if two parties believe they are
        referencing the same page.
      "
      
      We don't particularly care about R/O FOLL_GET references: they were never
      reliable and O_DIRECT doesn't expect to observe modifications from a page
      after DMA was started.
      
      Note that:
      * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
        ("enterprise architectures"). Other architectures have to implement
        __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
      * this does *not * consider any kind of fork() after taking the reference:
        fork() after GUP never worked reliably with FOLL_GET.
      * Not losing PG_anon_exclusive during swapout was the last remaining
        piece. KSM already makes sure that there are no other references on
        a page before considering it for sharing. Page migration maintains
        PG_anon_exclusive and simply fails when there are additional references
        (freezing the refcount fails). Only swapout code dropped the
        PG_anon_exclusive flag because it requires more work to remember +
        restore it.
      
      With this series in place, most COW issues of [3] are fixed on said
      architectures. Other architectures can implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.
      
      [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      
      This patch (of 8):
      
      Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
      it.  We do this, to keep fork() logic on swap entries easy and efficient:
      for example, if we wouldn't clear it when unmapping, we'd have to lookup
      the page in the swapcache for each and every swap entry during fork() and
      clear PG_anon_exclusive if set.
      
      Instead, we want to store that information directly in the swap pte,
      protected by the page table lock, similarly to how we handle
      SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
      swap entries, we don't want to mess with the swap type (e.g., still one
      bit) because it overcomplicates swap code.
      
      In try_to_unmap(), we already reject to unmap in case the page might be
      pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
      Checking if there are other unexpected references reliably *before*
      completely unmapping a page is unfortunately not really possible: THP
      heavily overcomplicate the situation.  Once fully unmapped it's easier --
      we, for example, make sure that there are no unexpected references *after*
      unmapping a page before starting writeback on that page.
      
      So, we currently might end up unmapping a page and clearing
      PG_anon_exclusive if that page has additional references, for example, due
      to a FOLL_GET.
      
      do_swap_page() has to re-determine if a page is exclusive, which will
      easily fail if there are other references on a page, most prominently GUP
      references via FOLL_GET.  This can currently result in memory corruptions
      when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
      is never involved: try_to_unmap() will succeed, and when refaulting the
      page, it cannot be marked exclusive and will get replaced by a copy in the
      page tables on the next write access, resulting in writes via the GUP
      reference to the page being lost.
      
      In an ideal world, everybody that uses GUP and wants to modify page
      content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
      conversion will take a while.  It's easier to fix what used to work in the
      past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
      by remembering PG_anon_exclusive we can further reduce unnecessary COW in
      some cases, so it's the natural thing to do.
      
      So let's transfer the PG_anon_exclusive information to the swap pte and
      store it via an architecture-dependant pte bit; use that information when
      restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
      we simply have to clear the pte bit and are done.
      
      Of course, there is one corner case to handle: swap backends that don't
      support concurrent page modifications while the page is under writeback. 
      Special case these, and drop the exclusive marker.  Add a comment why that
      is just fine (also, reuse_swap_page() would have done the same in the
      past).
      
      In the future, we'll hopefully have all architectures support
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
      and the define completely.  Then, we can also convert
      SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
      support: either simply use a yet unused pte bit that can be used for swap
      entries, steal one from the arch type bits if they exceed 5, or steal one
      from the offset bits.
      
      Note: R/O FOLL_GET references were never really reliable, especially when
      taking one on a shared page and then writing to the page (e.g., GUP after
      fork()).  FOLL_GET, including R/W references, were never really reliable
      once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
      steps back in case it stumbles over unexpected references and is,
      therefore, fine.
      
      [david@redhat.com: fix SWP_STABLE_WRITES test]
        Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1493a191
    • D
      mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages · 78fbe906
      David Hildenbrand 提交于
      The basic question we would like to have a reliable and efficient answer
      to is: is this anonymous page exclusive to a single process or might it be
      shared?  We need that information for ordinary/single pages, hugetlb
      pages, and possibly each subpage of a THP.
      
      Introduce a way to mark an anonymous page as exclusive, with the ultimate
      goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
      lose consistency with the pages mapped into the page table, resulting in
      reported memory corruptions.
      
      Most pageflags already have semantics for anonymous pages, however,
      PG_mappedtodisk should never apply to pages in the swapcache, so let's
      reuse that flag.
      
      As PG_has_hwpoisoned also uses that flag on the second tail page of a
      compound page, convert it to PG_error instead, which is marked as
      PF_NO_TAIL, so never used for tail pages.
      
      Use custom page flag modification functions such that we can do additional
      sanity checks.  The semantics we'll put into some kernel doc in the future
      are:
      
      "
        PG_anon_exclusive is *usually* only expressive in combination with a
        page table entry. Depending on the page table entry type it might
        store the following information:
      
             Is what's mapped via this page table entry exclusive to the
             single process and can be mapped writable without further
             checks? If not, it might be shared and we might have to COW.
      
        For now, we only expect PTE-mapped THPs to make use of
        PG_anon_exclusive in subpages. For other anonymous compound
        folios (i.e., hugetlb), only the head page is logically mapped and
        holds this information.
      
        For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
        set on the head page. When replacing the PMD by a page table full
        of PTEs, PG_anon_exclusive, if set on the head page, will be set on
        all tail pages accordingly. Note that converting from a PTE-mapping
        to a PMD mapping using the same compound page is currently not
        possible and consequently doesn't require care.
      
        If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
        it should only pin if the relevant PG_anon_exclusive is set. In that
        case, the pin will be fully reliable and stay consistent with the pages
        mapped into the page table, as the bit cannot get cleared (e.g., by
        fork(), KSM) while the page is pinned. For anonymous pages that
        are mapped R/W, PG_anon_exclusive can be assumed to always be set
        because such pages cannot possibly be shared.
      
        The page table lock protecting the page table entry is the primary
        synchronization mechanism for PG_anon_exclusive; GUP-fast that does
        not take the PT lock needs special care when trying to clear the
        flag.
      
        Page table entry types and PG_anon_exclusive:
        * Present: PG_anon_exclusive applies.
        * Swap: the information is lost. PG_anon_exclusive was cleared.
        * Migration: the entry holds this information instead.
                     PG_anon_exclusive was cleared.
        * Device private: PG_anon_exclusive applies.
        * Device exclusive: PG_anon_exclusive applies.
        * HW Poison: PG_anon_exclusive is stale and not changed.
      
        If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
        not allowed and the flag will stick around until the page is freed
        and folio->mapping is cleared.
      "
      
      We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
      zapping) of page table entries, page freeing code will handle that when
      also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
      information about exclusivity stick around will be an important property
      when adding sanity checks to unpinning code.
      
      Note that we properly clear the flag in free_pages_prepare() via
      PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
      so there is no need to manually clear the flag.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      78fbe906