1. 29 6月, 2022 1 次提交
  2. 22 3月, 2022 8 次提交
  3. 15 3月, 2022 5 次提交
  4. 22 1月, 2022 1 次提交
  5. 15 1月, 2022 1 次提交
  6. 08 1月, 2022 10 次提交
  7. 05 1月, 2022 1 次提交
  8. 10 11月, 2021 1 次提交
    • J
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner 提交于
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  9. 04 9月, 2021 2 次提交
  10. 13 7月, 2021 2 次提交
  11. 17 6月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 22061a1f
      Hugh Dickins 提交于
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22061a1f
  12. 06 5月, 2021 2 次提交
  13. 27 2月, 2021 4 次提交
  14. 16 12月, 2020 1 次提交