1. 13 5月, 2022 1 次提交
    • P
      mm: introduce PTE_MARKER swap entry · 679d1033
      Peter Xu 提交于
      Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.
      
      
      Overview
      ========
      
      Userfaultfd-wp anonymous support was merged two years ago.  There're quite
      a few applications that started to leverage this capability either to take
      snapshots for user-app memory, or use it for full user controled swapping.
      
      This series tries to complete the feature for uffd-wp so as to cover all
      the RAM-based memory types.  So far uffd-wp is the only missing piece of
      the rest features (uffd-missing & uffd-minor mode).
      
      One major reason to do so is that anonymous pages are sometimes not
      satisfying the need of applications, and there're growing users of either
      shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
      between hypervisor process and device emulation process, shmem local live
      migration for upgrades), or for performance on tlb hits.
      
      All these mean that if a uffd-wp app wants to switch to any of the memory
      types, it'll stop working.  I think it's worthwhile to have the kernel to
      cover all these aspects.
      
      This series chose to protect pages in pte level not page level.
      
      One major reason is safety.  I have no idea how we could make it safe if
      any of the uffd-privileged app can wr-protect a page that any other
      application can use.  It means this app can block any process potentially
      for any time it wants.
      
      The other reason is that it aligns very well with not only the anonymous
      uffd-wp solution, but also uffd as a whole.  For example, userfaultfd is
      implemented fundamentally based on VMAs.  We set flags to VMAs showing the
      status of uffd tracking.  For another per-page based protection solution,
      it'll be crossing the fundation line on VMA-based, and it could simply be
      too far away already from what's called userfaultfd.
      
      PTE markers
      ===========
      
      The patchset is based on the idea called PTE markers.  It was discussed in
      one of the mm alignment sessions, proposed starting from v6, and this is
      the 2nd version of it using PTE marker idea.
      
      PTE marker is a new type of swap entry that is ony applicable to file
      backed memories like shmem and hugetlbfs.  It's used to persist some
      pte-level information even if the original present ptes in pgtable are
      zapped.
      
      Logically pte markers can store more than uffd-wp information, but so far
      only one bit is used for uffd-wp purpose.  When the pte marker is
      installed with uffd-wp bit set, it means this pte is wr-protected by uffd.
      
      It solves the problem on e.g.  file-backed memory mapped ptes got zapped
      due to any reason (e.g.  thp split, or swapped out), we can still keep the
      wr-protect information in the ptes.  Then when the page fault triggers
      again, we'll know this pte is wr-protected so we can treat the pte the
      same as a normal uffd wr-protected pte.
      
      The extra information is encoded into the swap entry, or swp_offset to be
      explicit, with the swp_type being PTE_MARKER.  So far uffd-wp only uses
      one bit out of the swap entry, the rest bits of swp_offset are still
      reserved for other purposes.
      
      There're two configs to enable/disable PTE markers:
      
        CONFIG_PTE_MARKER
        CONFIG_PTE_MARKER_UFFD_WP
      
      We can set !PTE_MARKER to completely disable all the PTE markers, along
      with uffd-wp support.  I made two config so we can also enable PTE marker
      but disable uffd-wp file-backed for other purposes.  At the end of current
      series, I'll enable CONFIG_PTE_MARKER by default, but that patch is
      standalone and if anyone worries about having it by default, we can also
      consider turn it off by dropping that oneliner patch.  So far I don't see
      a huge risk of doing so, so I kept that patch.
      
      In most cases, PTE markers should be treated as none ptes.  It is because
      that unlike most of the other swap entry types, there's no PFN or block
      offset information encoded into PTE markers but some extra well-defined
      bits showing the status of the pte.  These bits should only be used as
      extra data when servicing an upcoming page fault, and then we behave as if
      it's a none pte.
      
      I did spend a lot of time observing all the pte_none() users this time. 
      It is indeed a challenge because there're a lot, and I hope I didn't miss
      a single of them when we should take care of pte markers.  Luckily, I
      don't think it'll need to be considered in many cases, for example: boot
      code, arch code (especially non-x86), kernel-only page handlings (e.g. 
      CPA), or device driver codes when we're tackling with pure PFN mappings.
      
      I introduced pte_none_mostly() in this series when we need to handle pte
      markers the same as none pte, the "mostly" is the other way to write
      "either none pte or a pte marker".
      
      I didn't replace pte_none() to cover pte markers for below reasons:
      
        - Very rare case of pte_none() callers will handle pte markers.  E.g., all
          the kernel pages do not require knowledge of pte markers.  So we don't
          pollute the major use cases.
      
        - Unconditionally change pte_none() semantics could confuse people, because
          pte_none() existed for so long a time.
      
        - Unconditionally change pte_none() semantics could make pte_none() slower
          even if in many cases pte markers do not exist.
      
        - There're cases where we'd like to handle pte markers differntly from
          pte_none(), so a full replace is also impossible.  E.g. khugepaged should
          still treat pte markers as normal swap ptes rather than none ptes, because
          pte markers will always need a fault-in to merge the marker with a valid
          pte.  Or the smap code will need to parse PTE markers not none ptes.
      
      Patch Layout
      ============
      
      Introducing PTE marker and uffd-wp bit in PTE marker:
      
        mm: Introduce PTE_MARKER swap entry
        mm: Teach core mm about pte markers
        mm: Check against orig_pte for finish_fault()
        mm/uffd: PTE_MARKER_UFFD_WP
      
      Adding support for shmem uffd-wp:
      
        mm/shmem: Take care of UFFDIO_COPY_MODE_WP
        mm/shmem: Handle uffd-wp special pte in page fault handler
        mm/shmem: Persist uffd-wp bit across zapping for file-backed
        mm/shmem: Allow uffd wr-protect none pte for file-backed mem
        mm/shmem: Allows file-back mem to be uffd wr-protected on thps
        mm/shmem: Handle uffd-wp during fork()
      
      Adding support for hugetlbfs uffd-wp:
      
        mm/hugetlb: Introduce huge pte version of uffd-wp helpers
        mm/hugetlb: Hook page faults for uffd write protection
        mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
        mm/hugetlb: Handle UFFDIO_WRITEPROTECT
        mm/hugetlb: Handle pte markers in page faults
        mm/hugetlb: Allow uffd wr-protect none ptes
        mm/hugetlb: Only drop uffd-wp special pte if required
        mm/hugetlb: Handle uffd-wp during fork()
      
      Misc handling on the rest mm for uffd-wp file-backed:
      
        mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
        mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
      
      Enabling of uffd-wp on file-backed memory:
      
        mm/uffd: Enable write protection for shmem & hugetlbfs
        mm: Enable PTE markers by default
        selftests/uffd: Enable uffd-wp for shmem/hugetlbfs
      
      Tests
      =====
      
      - Compile test on x86_64 and aarch64 on different configs
      - Kernel selftests
      - uffd-test [0]
      - Umapsort [1,2] test for shmem/hugetlb, with swap on/off
      
      [0] https://github.com/xzpeter/clibs/tree/master/uffd-test
      [1] https://github.com/xzpeter/umap-apps/tree/peter
      [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs
      
      
      This patch (of 23):
      
      Introduces a new swap entry type called PTE_MARKER.  It can be installed
      for any pte that maps a file-backed memory when the pte is temporarily
      zapped, so as to maintain per-pte information.
      
      The information that kept in the pte is called a "marker".  Here we define
      the marker as "unsigned long" just to match pgoff_t, however it will only
      work if it still fits in swp_offset(), which is e.g.  currently 58 bits on
      x86_64.
      
      A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A
      bunch of helpers are defined altogether to service the rest of the pte
      marker code.
      
      [peterx@redhat.com: fixup]
        Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      679d1033
  2. 10 5月, 2022 4 次提交
    • N
      mm: move responsibility for setting SWP_FS_OPS to ->swap_activate · 4b60c0ff
      NeilBrown 提交于
      If a filesystem wishes to handle all swap IO itself (via ->direct_IO and
      ->readpage), rather than just providing devices addresses for
      submit_bio(), SWP_FS_OPS must be set.
      
      Currently the protocol for setting this it to have ->swap_activate return
      zero.  In that case SWP_FS_OPS is set, and add_swap_extent() is called for
      the entire file.
      
      This is a little clumsy as different return values for ->swap_activate
      have quite different meanings, and it makes it hard to search for which
      filesystems require SWP_FS_OPS to be set.
      
      So remove the special meaning of a zero return, and require the filesystem
      to set SWP_FS_OPS if it so desires, and to always call add_swap_extent()
      as required.
      
      Currently only NFS and CIFS return zero for add_swap_extent().
      
      Link: https://lkml.kernel.org/r/164859778123.29473.17908205846599043598.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4b60c0ff
    • N
      mm: drop swap_dirty_folio · 4c4a7634
      NeilBrown 提交于
      folios that are written to swap are owned by the MM subsystem - not any
      filesystem.
      
      When such a folio is passed to a filesystem to be written out to a
      swap-file, the filesystem handles the data, but the folio itself does not
      belong to the filesystem.  So calling the filesystem's ->dirty_folio()
      address_space operation makes no sense.  This is for folios in the given
      address space, and a folio to be written to swap does not exist in the
      given address space.
      
      So drop swap_dirty_folio() which calls the address-space's
      ->dirty_folio(), and always use noop_dirty_folio(), which is appropriate
      for folios being swapped out.
      
      Link: https://lkml.kernel.org/r/164859778123.29473.6900942583784889976.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4c4a7634
    • N
      mm: create new mm/swap.h header file · 014bb1de
      NeilBrown 提交于
      Patch series "MM changes to improve swap-over-NFS support".
      
      Assorted improvements for swap-via-filesystem.
      
      This is a resend of these patches, rebased on current HEAD.  The only
      substantial changes is that swap_dirty_folio has replaced
      swap_set_page_dirty.
      
      Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem.  It
      has previously worked for NFS but that broke a few releases back.  This
      series changes to use a new ->swap_rw rather than ->readpage and
      ->direct_IO.  It also makes other improvements.
      
      There is a companion series already in linux-next which fixes various
      issues with NFS.  Once both series land, a final patch is needed which
      changes NFS over to use ->swap_rw.
      
      
      This patch (of 10):
      
      Many functions declared in include/linux/swap.h are only used within mm/
      
      Create a new "mm/swap.h" and move some of these declarations there.
      Remove the redundant 'extern' from the function declarations.
      
      [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
      Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
      Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brownSigned-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      014bb1de
    • D
      mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
      David Hildenbrand 提交于
      Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
      exclusive, and use that information to make GUP pins reliable and stay
      consistent with the page mapped into the page table even if the page table
      entry gets write-protected.
      
      With that information at hand, we can extend our COW logic to always reuse
      anonymous pages that are exclusive.  For anonymous pages that might be
      shared, the existing logic applies.
      
      As already documented, PG_anon_exclusive is usually only expressive in
      combination with a page table entry.  Especially PTE vs.  PMD-mapped
      anonymous pages require more thought, some examples: due to mremap() we
      can easily have a single compound page PTE-mapped into multiple page
      tables exclusively in a single process -- multiple page table locks apply.
      Further, due to MADV_WIPEONFORK we might not necessarily write-protect
      all PTEs, and only some subpages might be pinned.  Long story short: once
      PTE-mapped, we have to track information about exclusivity per sub-page,
      but until then, we can just track it for the compound page in the head
      page and not having to update a whole bunch of subpages all of the time
      for a simple PMD mapping of a THP.
      
      For simplicity, this commit mostly talks about "anonymous pages", while
      it's for THP actually "the part of an anonymous folio referenced via a
      page table entry".
      
      To not spill PG_anon_exclusive code all over the mm code-base, we let the
      anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
      
      If a writable, present page table entry points at an anonymous (sub)page,
      that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
      pin (FOLL_PIN) on an anonymous page references via a present page table
      entry, it must only pin if PG_anon_exclusive is set for the mapped
      (sub)page.
      
      This commit doesn't adjust GUP, so this is only implicitly handled for
      FOLL_WRITE, follow-up commits will teach GUP to also respect it for
      FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
      reliable.
      
      Whenever an anonymous page is to be shared (fork(), KSM), or when
      temporarily unmapping an anonymous page (swap, migration), the relevant
      PG_anon_exclusive bit has to be cleared to mark the anonymous page
      possibly shared.  Clearing will fail if there are GUP pins on the page:
      
      * For fork(), this means having to copy the page and not being able to
        share it.  fork() protects against concurrent GUP using the PT lock and
        the src_mm->write_protect_seq.
      
      * For KSM, this means sharing will fail.  For swap this means, unmapping
        will fail, For migration this means, migration will fail early.  All
        three cases protect against concurrent GUP using the PT lock and a
        proper clear/invalidate+flush of the relevant page table entry.
      
      This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
      pinned page gets mapped R/O and the successive write fault ends up
      replacing the page instead of reusing it.  It improves the situation for
      O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
      fork() is *not* involved, however swapout and fork() are still
      problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
      users will fix the issue for them.
      
      I. Details about basic handling
      
      I.1. Fresh anonymous pages
      
      page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
      given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
      the mechanism fresh anonymous pages come into life (besides migration code
      where we copy the page->mapping), all fresh anonymous pages will start out
      as exclusive.
      
      I.2. COW reuse handling of anonymous pages
      
      When a COW handler stumbles over a (sub)page that's marked exclusive, it
      simply reuses it.  Otherwise, the handler tries harder under page lock to
      detect if the (sub)page is exclusive and can be reused.  If exclusive,
      page_move_anon_rmap() will mark the given (sub)page exclusive.
      
      Note that hugetlb code does not yet check for PageAnonExclusive(), as it
      still uses the old COW logic that is prone to the COW security issue
      because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
      pages are a scarce resource.
      
      I.3. Migration handling
      
      try_to_migrate() has to try marking an exclusive anonymous page shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  migrate_vma_collect_pmd() and
      __split_huge_pmd_locked() are handled similarly.
      
      Writable migration entries implicitly point at shared anonymous pages. 
      For readable migration entries that information is stored via a new
      "readable-exclusive" migration entry, specific to anonymous pages.
      
      When restoring a migration entry in remove_migration_pte(), information
      about exlusivity is detected via the migration entry type, and
      RMAP_EXCLUSIVE is set accordingly for
      page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
      
      I.4. Swapout handling
      
      try_to_unmap() has to try marking the mapped page possibly shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  For now, information about exclusivity is lost.  In
      the future, we might want to remember that information in the swap entry
      in some cases, however, it requires more thought, care, and a way to store
      that information in swap entries.
      
      I.5. Swapin handling
      
      do_swap_page() will never stumble over exclusive anonymous pages in the
      swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
      to detect manually if an anonymous page is exclusive and has to set
      RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
      
      I.6. THP handling
      
      __split_huge_pmd_locked() has to move the information about exclusivity
      from the PMD to the PTEs.
      
      a) In case we have a readable-exclusive PMD migration entry, simply
         insert readable-exclusive PTE migration entries.
      
      b) In case we have a present PMD entry and we don't want to freeze
         ("convert to migration entries"), simply forward PG_anon_exclusive to
         all sub-pages, no need to temporarily clear the bit.
      
      c) In case we have a present PMD entry and want to freeze, handle it
         similar to try_to_migrate(): try marking the page shared first.  In
         case we fail, we ignore the "freeze" instruction and simply split
         ordinarily.  try_to_migrate() will properly fail because the THP is
         still mapped via PTEs.
      
      When splitting a compound anonymous folio (THP), the information about
      exclusivity is implicitly handled via the migration entries: no need to
      replicate PG_anon_exclusive manually.
      
      I.7.  fork() handling fork() handling is relatively easy, because
      PG_anon_exclusive is only expressive for some page table entry types.
      
      a) Present anonymous pages
      
      page_try_dup_anon_rmap() will mark the given subpage shared -- which will
      fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
      PMD to handle it on the PTE level).
      
      Note that device exclusive entries are just a pointer at a PageAnon()
      page.  fork() will first convert a device exclusive entry to a present
      page table and handle it just like present anonymous pages.
      
      b) Device private entry
      
      Device private entries point at PageAnon() pages that cannot be mapped
      directly and, therefore, cannot get pinned.
      
      page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
      fail because they cannot get pinned.
      
      c) HW poison entries
      
      PG_anon_exclusive will remain untouched and is stale -- the page table
      entry is just a placeholder after all.
      
      d) Migration entries
      
      Writable and readable-exclusive entries are converted to readable entries:
      possibly shared.
      
      I.8. mprotect() handling
      
      mprotect() only has to properly handle the new readable-exclusive
      migration entry:
      
      When write-protecting a migration entry that points at an anonymous page,
      remember the information about exclusivity via the "readable-exclusive"
      migration entry type.
      
      II. Migration and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a migration entry, we have to mark the page possibly
      shared and synchronize against GUP-fast by a proper clear/invalidate+flush
      to make the following scenario impossible:
      
      1. try_to_migrate() places a migration entry after checking for GUP pins
         and marks the page possibly shared.
      
      2. GUP-fast pins the page due to lack of synchronization
      
      3. fork() converts the "writable/readable-exclusive" migration entry into a
         readable migration entry
      
      4. Migration fails due to the GUP pin (failing to freeze the refcount)
      
      5. Migration entries are restored. PG_anon_exclusive is lost
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      Note that we move information about exclusivity from the page to the
      migration entry as it otherwise highly overcomplicates fork() and
      PTE-mapping a THP.
      
      III. Swapout and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a swap entry, we have to mark the page possibly shared
      and synchronize against GUP-fast by a proper clear/invalidate+flush to
      make the following scenario impossible:
      
      1. try_to_unmap() places a swap entry after checking for GUP pins and
         clears exclusivity information on the page.
      
      2. GUP-fast pins the page due to lack of synchronization.
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      If we'd ever store information about exclusivity in the swap entry,
      similar to migration handling, the same considerations as in II would
      apply.  This is future work.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6c287605
  3. 25 3月, 2022 1 次提交
    • D
      mm/swapfile: remove stale reuse_swap_page() · 03104c2c
      David Hildenbrand 提交于
      All users are gone, let's remove it.  We'll let SWP_STABLE_WRITES stick
      around for now, as it might come in handy in the near future.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-8-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03104c2c
  4. 23 3月, 2022 2 次提交
  5. 22 3月, 2022 4 次提交
  6. 15 3月, 2022 1 次提交
  7. 15 1月, 2022 2 次提交
  8. 07 11月, 2021 1 次提交
    • M
      include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h · a1554c00
      Mianhan Liu 提交于
      nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
      The advantage of this change is that it can reduce the obsolete
      includes.  For example, net/ipv4/tcp.c wouldn't need swap.h any more
      since it has already included mm.h.  Similarly, after checking all the
      other files, it comes that tcp.c, udp.c meter.c ,...  follow the same
      rule, so these files can have swap.h removed too.
      
      Moreover, after preprocessing all the files that use
      nr_free_buffer_pages, it turns out that those files have already
      included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
      safely.  This change will not affect the compilation of other files.
      
      Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cnSigned-off-by: NMianhan Liu <liumh1@shanghaitech.edu.cn>
      Cc: Jakub Kicinski <kuba@kernel.org>
      CC: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: "David S . Miller" <davem@davemloft.net>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1554c00
  9. 18 10月, 2021 3 次提交
  10. 27 9月, 2021 3 次提交
  11. 04 9月, 2021 2 次提交
  12. 02 7月, 2021 3 次提交
    • A
      mm: device exclusive memory access · b756a3b5
      Alistair Popple 提交于
      Some devices require exclusive write access to shared virtual memory (SVM)
      ranges to perform atomic operations on that memory.  This requires CPU
      page tables to be updated to deny access whilst atomic operations are
      occurring.
      
      In order to do this introduce a new swap entry type
      (SWP_DEVICE_EXCLUSIVE).  When a SVM range needs to be marked for exclusive
      access by a device all page table mappings for the particular range are
      replaced with device exclusive swap entries.  This causes any CPU access
      to the page to result in a fault.
      
      Faults are resovled by replacing the faulting entry with the original
      mapping.  This results in MMU notifiers being called which a driver uses
      to update access permissions such as revoking atomic access.  After
      notifiers have been called the device will no longer have exclusive access
      to the region.
      
      Walking of the page tables to find the target pages is handled by
      get_user_pages() rather than a direct page table walk.  A direct page
      table walk similar to what migrate_vma_collect()/unmap() does could also
      have been utilised.  However this resulted in more code similar in
      functionality to what get_user_pages() provides as page faulting is
      required to make the PTEs present and to break COW.
      
      [dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
        Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b756a3b5
    • A
      mm: remove special swap entry functions · af5cdaf8
      Alistair Popple 提交于
      Patch series "Add support for SVM atomics in Nouveau", v11.
      
      Introduction
      ============
      
      Some devices have features such as atomic PTE bits that can be used to
      implement atomic access to system memory.  To support atomic operations to
      a shared virtual memory page such a device needs access to that page which
      is exclusive of the CPU.  This series introduces a mechanism to
      temporarily unmap pages granting exclusive access to a device.
      
      These changes are required to support OpenCL atomic operations in Nouveau
      to shared virtual memory (SVM) regions allocated with the
      CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
      OpenCL SVM feature is available at
      https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
      OpenCL_API.html#_shared_virtual_memory .
      
      Implementation
      ==============
      
      Exclusive device access is implemented by adding a new swap entry type
      (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
      difference is that on fault the original entry is immediately restored by
      the fault handler instead of waiting.
      
      Restoring the entry triggers calls to MMU notifers which allows a device
      driver to revoke the atomic access permission from the GPU prior to the
      CPU finalising the entry.
      
      Patches
      =======
      
      Patches 1 & 2 refactor existing migration and device private entry
      functions.
      
      Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
      functionality into separate functions - try_to_migrate_one() and
      try_to_munlock_one().
      
      Patch 5 renames some existing code but does not introduce functionality.
      
      Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
      
      Patch 7 contains the bulk of the implementation for device exclusive
      memory.
      
      Patch 8 contains some additions to the HMM selftests to ensure everything
      works as expected.
      
      Patch 9 is a cleanup for the Nouveau SVM implementation.
      
      Patch 10 contains the implementation of atomic access for the Nouveau
      driver.
      
      Testing
      =======
      
      This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
      which checks that GPU atomic accesses to system memory are atomic.
      Without this series the test fails as there is no way of write-protecting
      the page mapping which results in the device clobbering CPU writes.  For
      reference the test is available at
      https://ozlabs.org/~apopple/opencl_svm_atomics/
      
      Further testing has been performed by adding support for testing exclusive
      access to the hmm-tests kselftests.
      
      This patch (of 10):
      
      Remove multiple similar inline functions for dealing with different types
      of special swap entries.
      
      Both migration and device private swap entries use the swap offset to
      store a pfn.  Instead of multiple inline functions to obtain a struct page
      for each swap entry type use a common function pfn_swap_entry_to_page().
      Also open-code the various entry_to_pfn() functions as this results is
      shorter code that is easier to understand.
      
      Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
      Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5cdaf8
    • M
      mm/swap: make swap_address_space an inline function · 2bb6a033
      Mel Gorman 提交于
      make W=1 generates the following warning in page_mapping() for allnoconfig
      
        mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable]
           swp_entry_t entry;
                       ^~~~~
      
      swap_address is a #define on !CONFIG_SWAP configurations.  Make the helper
      an inline function to suppress the warning, add type checking and to apply
      any side-effects in the parameter list.
      
      Link: https://lkml.kernel.org/r/20210520084809.8576-12-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bb6a033
  13. 30 6月, 2021 3 次提交
    • H
      mm: free idle swap cache page after COW · f4c4a3f4
      Huang Ying 提交于
      With commit 09854ba9 ("mm: do_wp_page() simplification"), after COW,
      the idle swap cache page (neither the page nor the corresponding swap
      entry is mapped by any process) will be left in the LRU list, even if it's
      in the active list or the head of the inactive list.  So, the page
      reclaimer may take quite some overhead to reclaim these actually unused
      pages.
      
      To help the page reclaiming, in this patch, after COW, the idle swap cache
      page will be tried to be freed.  To avoid to introduce much overhead to
      the hot COW code path,
      
      a) there's almost zero overhead for non-swap case via checking
         PageSwapCache() firstly.
      
      b) the page lock is acquired via trylock only.
      
      To test the patch, we used pmbench memory accessing benchmark with
      working-set larger than available memory on a 2-socket Intel server with a
      NVMe SSD as swap device.  Test results shows that the pmbench score
      increases up to 23.8% with the decreased size of swap cache and swapin
      throughput.
      
      Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: Johannes Weiner <hannes@cmpxchg.org>	[use free_swap_cache()]
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c4a3f4
    • M
      swap: fix do_swap_page() race with swapoff · 2799e775
      Miaohe Lin 提交于
      When I was investigating the swap code, I found the below possible race
      window:
      
      CPU 1                                   	CPU 2
      -----                                   	-----
      do_swap_page
        if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
        swap_readpage
          if (data_race(sis->flags & SWP_FS_OPS)) {
                                              	swapoff
      					  	  ..
      					  	  p->swap_file = NULL;
      					  	  ..
          struct file *swap_file = sis->swap_file;
          struct address_space *mapping = swap_file->f_mapping;[oops!]
      
      Note that for the pages that are swapped in through swap cache, this isn't
      an issue. Because the page is locked, and the swap entry will be marked
      with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
      unlocked.
      
      Fix this race by using get/put_swap_device() to guard against concurrent
      swapoff.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com
      Fixes: 0bcac06f ("mm,swap: skip swapcache for swapin of synchronous device")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2799e775
    • M
      mm/swapfile: use percpu_ref to serialize against concurrent swapoff · 63d8620e
      Miaohe Lin 提交于
      Patch series "close various race windows for swap", v6.
      
      When I was investigating the swap code, I found some possible race
      windows.  This series aims to fix all these races.  But using current
      get/put_swap_device() to guard against concurrent swapoff for
      swap_readpage() looks terrible because swap_readpage() may take really
      long time.  And to reduce the performance overhead on the hot-path as much
      as possible, it appears we can use the percpu_ref to close this race
      window(as suggested by Huang, Ying).  The patch 1 adds percpu_ref support
      for swap and most of the remaining patches try to use this to close
      various race windows.  More details can be found in the respective
      changelogs.
      
      This patch (of 4):
      
      Using current get/put_swap_device() to guard against concurrent swapoff
      for some swap ops, e.g.  swap_readpage(), looks terrible because they
      might take really long time.  This patch adds the percpu_ref support to
      serialize against concurrent swapoff(as suggested by Huang, Ying).  Also
      we remove the SWP_VALID flag because it's used together with RCU solution.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63d8620e
  14. 07 5月, 2021 1 次提交
  15. 06 5月, 2021 2 次提交
    • M
      mm: disable LRU pagevec during the migration temporarily · d479960e
      Minchan Kim 提交于
      LRU pagevec holds refcount of pages until the pagevec are drained.  It
      could prevent migration since the refcount of the page is greater than
      the expection in migration logic.  To mitigate the issue, callers of
      migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
      before migrate_pages call.
      
      However, it's not enough because pages coming into pagevec after the
      draining call still could stay at the pagevec so it could keep
      preventing page migration.  Since some callers of migrate_pages have
      retrial logic with LRU draining, the page would migrate at next trail
      but it is still fragile in that it doesn't close the fundamental race
      between upcoming LRU pages into pagvec and migration so the migration
      failure could cause contiguous memory allocation failure in the end.
      
      To close the race, this patch disables lru caches(i.e, pagevec) during
      ongoing migration until migrate is done.
      
      Since it's really hard to reproduce, I measured how many times
      migrate_pages retried with force mode(it is about a fallback to a sync
      migration) with below debug code.
      
      int migrate_pages(struct list_head *from, new_page_t get_new_page,
      			..
      			..
      
        if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
               printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
               dump_page(page, "fail to migrate");
        }
      
      The test was repeating android apps launching with cma allocation in
      background every five seconds.  Total cma allocation count was about 500
      during the testing.  With this patch, the dump_page count was reduced
      from 400 to 30.
      
      The new interface is also useful for memory hotplug which currently
      drains lru pcp caches after each migration failure.  This is rather
      suboptimal as it has to disrupt others running during the operation.
      With the new interface the operation happens only once.  This is also in
      line with pcp allocator cache which are disabled for the offlining as
      well.
      
      Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NChris Goldsworthy <cgoldswo@codeaurora.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oliver Sang <oliver.sang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d479960e
    • D
      mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks · 202e35db
      Dave Hansen 提交于
      RECLAIM_ZONE was assumed to be unused because it was never explicitly
      used in the kernel.  However, there were a number of places where it was
      checked implicitly by checking 'node_reclaim_mode' for a zero value.
      
      These zero checks are not great because it is not obvious what a zero
      mode *means* in the code.  Replace them with a helper which makes it
      more obvious: node_reclaim_enabled().
      
      This helper also provides a handy place to explicitly check the
      RECLAIM_ZONE bit itself.  Check it explicitly there to make it more
      obvious where the bit can affect behavior.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172559.BF589C44@viggo.jf.intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Daniel Wagner <dwagner@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      202e35db
  16. 03 3月, 2021 1 次提交
  17. 25 2月, 2021 2 次提交
    • A
      mm/vmscan: __isolate_lru_page_prepare() cleanup · c2135f7c
      Alex Shi 提交于
      The function just returns 2 results, so using a 'switch' to deal with its
      result is unnecessary.  Also simplify it to a bool func as Vlastimil
      suggested.
      
      Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's
      suggestion to update comments in function.
      
      Link: https://lkml.kernel.org/r/728874d7-2d93-4049-68c1-dcc3b2d52ccd@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2135f7c
    • S
      mm: memcg: add swapcache stat for memcg v2 · b6038942
      Shakeel Butt 提交于
      This patch adds swapcache stat for the cgroup v2.  The swapcache
      represents the memory that is accounted against both the memory and the
      swap limit of the cgroup.  The main motivation behind exposing the
      swapcache stat is for enabling users to gracefully migrate from cgroup
      v1's memsw counter to cgroup v2's memory and swap counters.
      
      Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
      workload but without control on the exact proportion of memory and swap.
      Cgroup v2 provides separate limits for memory and swap which enables more
      control on the exact usage of memory and swap individually for the
      workload.
      
      With some little subtleties, the v1's memsw limit can be switched with the
      sum of the v2's memory and swap limits.  However the alternative for memsw
      usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
      stat enables that alternative.  Adding the memory usage and swap usage and
      subtracting the swapcache will approximate the memsw usage.  This will
      help in the transparent migration of the workloads depending on memsw
      usage and limit to v2' memory and swap counters.
      
      The reasons these applications are still interested in this approximate
      memsw usage are: (1) these applications are not really interested in two
      separate memory and swap usage metrics.  A single usage metric is more
      simple to use and reason about for them.
      
      (2) The memsw usage metric hides the underlying system's swap setup from
      the applications.  Applications with multiple instances running in a
      datacenter with heterogeneous systems (some have swap and some don't) will
      keep seeing a consistent view of their usage.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      
      Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6038942
  18. 28 1月, 2021 1 次提交
  19. 16 12月, 2020 2 次提交
    • A
      mm/compaction: do page isolation first in compaction · 9df41314
      Alex Shi 提交于
      Currently, compaction would get the lru_lock and then do page isolation
      which works fine with pgdat->lru_lock, since any page isoltion would
      compete for the lru_lock.  If we want to change to memcg lru_lock, we have
      to isolate the page before getting lru_lock, thus isoltion would block
      page's memcg change which relay on page isoltion too.  Then we could
      safely use per memcg lru_lock later.
      
      The new page isolation use previous introduced TestClearPageLRU() + pgdat
      lru locking which will be changed to memcg lru lock later.
      
      Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early
      version:
      
      Fix lots of crashes under compaction load: isolate_migratepages_block()
      must clean up appropriately when rejecting a page, setting PageLRU again
      if it had been cleared; and a put_page() after get_page_unless_zero()
      cannot safely be done while holding locked_lruvec - it may turn out to be
      the final put_page(), which will take an lruvec lock when PageLRU.
      
      And move __isolate_lru_page_prepare back after get_page_unless_zero to
      make trylock_page() safe: trylock_page() is not safe to use at this time:
      its setting PG_locked can race with the page being freed or allocated
      ("Bad page"), and can also erase flags being set by one of those "sole
      owners" of a freshly allocated page who use non-atomic __SetPageFlag().
      
      Link: https://lkml.kernel.org/r/1604566549-62481-16-git-send-email-alex.shi@linux.alibaba.comSuggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9df41314
    • A
      mm/thp: move lru_add_page_tail() to huge_memory.c · 88dcb9a3
      Alex Shi 提交于
      Patch series "per memcg lru lock", v21.
      
      This patchset includes 3 parts:
      
       1) some code cleanup and minimum optimization as preparation
      
       2) use TestCleanPageLRU as page isolation's precondition
      
       3) replace per node lru_lock with per memcg per node lru_lock
      
      Current lru_lock is one for each of node, pgdat->lru_lock, that guard
      for lru lists, but now we had moved the lru lists into memcg for long
      time.  Still using per node lru_lock is clearly unscalable, pages on
      each of memcgs have to compete each others for a whole lru_lock.  This
      patchset try to use per lruvec/memcg lru_lock to repleace per node lru
      lock to guard lru lists, make it scalable for memcgs and get performance
      gain.
      
      Currently lru_lock still guards both lru list and page's lru bit, that's
      ok.  but if we want to use specific lruvec lock on the page, we need to
      pin down the page's lruvec/memcg during locking.  Just taking lruvec
      lock first may be undermined by the page's memcg charge/migration.  To
      fix this problem, we could take out the page's lru bit clear and use it
      as pin down action to block the memcg changes.  That's the reason for
      new atomic func TestClearPageLRU.  So now isolating a page need both
      actions: TestClearPageLRU and hold the lru_lock.
      
      The typical usage of this is isolate_migratepages_block() in
      compaction.c we have to take lru bit before lru lock, that serialized
      the page isolation in memcg page charge/migration which will change
      page's lruvec and new lru_lock in it.
      
      The above solution suggested by Johannes Weiner, and based on his new
      memcg charge path, then have this patchset.  (Hugh Dickins tested and
      contributed much code from compaction fix to general code polish, thanks
      a lot!).
      
      Daniel Jordan's testing show 62% improvement on modified readtwice case
      on his 2P * 10 core * 2 HT broadwell box on v18, which has no much
      different with this v20.
      
       https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
      
      Thanks to Hugh Dickins and Konstantin Khlebnikov, they both brought this
      idea 8 years ago, and others who gave comments as well: Daniel Jordan,
      Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
      
      Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
      and Yun Wang.  Hugh Dickins also shared his kbuild-swap case.
      
      This patch (of 19):
      
      lru_add_page_tail() is only used in huge_memory.c, defining it in other
      file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
      
      Let's move it THP. And make it static as Hugh Dickins suggested.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-1-git-send-email-alex.shi@linux.alibaba.com
      Link: https://lkml.kernel.org/r/1604566549-62481-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88dcb9a3
  20. 14 10月, 2020 1 次提交