1. 18 7月, 2022 1 次提交
  2. 28 5月, 2022 1 次提交
  3. 20 5月, 2022 1 次提交
  4. 13 5月, 2022 2 次提交
    • Z
      mm/memory-failure.c: move clear_hwpoisoned_pages · 60f272f6
      zhenwei pi 提交于
      Patch series "memory-failure: fix hwpoison_filter", v2.
      
      As well known, the memory failure mechanism handles memory corrupted
      event, and try to send SIGBUS to the user process which uses this
      corrupted page.
      
      For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
      into the guest, and the guest handles memory failure again.  Thus the
      guest gets the minimal effect from hardware memory corruption.
      
      The further step I'm working on:
      
      1, try to modify code to decrease poisoned pages in a single place
         (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).
      
      2, try to use page_handle_poison() to handle SetPageHWPoison() and
         num_poisoned_pages_inc() together.  It would be best to call
         num_poisoned_pages_inc() in a single place too.
      
      3, introduce memory failure notifier list in memory-failure.c: notify
         the corrupted PFN to someone who registers this list.  If I can
         complete [1] and [2] part, [3] will be quite easy(just call notifier
         list after increasing poisoned page).
      
      4, introduce memory recover VQ for memory balloon device, and registers
         memory failure notifier list.  During the guest kernel handles memory
         failure, balloon device gets notified by memory failure notifier list,
         and tells the host to recover the corrupted PFN(GPA) by the new VQ.
      
      5, host side remaps the corrupted page(HVA), and tells the guest side
         to unpoison the PFN(GPA).  Then the guest fixes the corrupted page(GPA)
         dynamically.
      
      
      This patch (of 5):
      
      clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
      poisoned pages, this actually works as part of memory failure.
      
      Move this function from sparse.c to memory-failure.c, finally there is no
      CONFIG_MEMORY_FAILURE in sparse.c.
      
      Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
      Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.comSigned-off-by: Nzhenwei pi <pizhenwei@bytedance.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      60f272f6
    • Z
      mm: make alloc_contig_range work at pageblock granularity · b2c9e2fb
      Zi Yan 提交于
      alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
      merging pageblocks with different migratetypes.  It might unnecessarily
      convert extra pageblocks at the beginning and at the end of the range. 
      Change alloc_contig_range() to work at pageblock granularity.
      
      Special handling is needed for free pages and in-use pages across the
      boundaries of the range specified by alloc_contig_range().  Because these=
      
      Partially isolated pages causes free page accounting issues.  The free
      pages will be split and freed into separate migratetype lists; the in-use=
      
      Pages will be migrated then the freed pages will be handled in the
      aforementioned way.
      
      [ziy@nvidia.com: fix deadlock/crash]
        Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
      Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.comSigned-off-by: NZi Yan <ziy@nvidia.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b2c9e2fb
  5. 29 4月, 2022 3 次提交
  6. 02 4月, 2022 1 次提交
  7. 23 3月, 2022 3 次提交
  8. 22 3月, 2022 10 次提交
  9. 04 3月, 2022 2 次提交
  10. 18 2月, 2022 6 次提交
    • H
      mm/munlock: mlock_page() munlock_page() batch by pagevec · 2fbb0c10
      Hugh Dickins 提交于
      A weakness of the page->mlock_count approach is the need for lruvec lock
      while holding page table lock.  That is not an overhead we would allow on
      normal pages, but I think acceptable just for pages in an mlocked area.
      But let's try to amortize the extra cost by gathering on per-cpu pagevec
      before acquiring the lruvec lock.
      
      I have an unverified conjecture that the mlock pagevec might work out
      well for delaying the mlock processing of new file pages until they have
      got off lru_cache_add()'s pagevec and on to LRU.
      
      The initialization of page->mlock_count is subject to races and awkward:
      0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
      this commit, which just widens the window?  I haven't gone back to think
      it through.  Maybe someone can point out a better way to initialize it.
      
      Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
      into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
      rather than lru_cache_add()'s pagevec.
      
      Experimented with various orderings: the right thing seems to be for
      mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
      pagevec, but munlock_page() to leave TestClearPageMlocked to the later
      pagevec processing.
      
      Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
      their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
      for that.
      
      This still leaves acquiring lruvec locks under page table lock each time
      the pagevec fills (or a THP is added): which I suppose is rather silly,
      since they sit on pagevec waiting to be processed long after page table
      lock has been dropped; but I'm disinclined to uglify the calling sequence
      until some load shows an actual problem with it (nothing wrong with
      taking lruvec lock under page table lock, just "nicer" to do it less).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      2fbb0c10
    • H
      mm/munlock: mlock_pte_range() when mlocking or munlocking · 34b67923
      Hugh Dickins 提交于
      Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
      required to lower the mlock_counts when munlocking without munmapping;
      and its complement, implementation of mlock_vma_pages_range(), required
      to raise the mlock_counts on pages already there when a range is mlocked.
      
      Combine them into just the one function mlock_vma_pages_range(), using
      walk_page_range() to run mlock_pte_range().  This approach fixes the
      "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
      https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/
      
      Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
      a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
      pte first, it will not try to munlock the page: leaving release_pages()
      to correct it when the last reference to the page is gone - that's okay,
      a page is not evictable anyway while it is held by an extra reference.
      
      Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
      a racing remove_migration_pte() or try_to_unmap_one() (depending on
      i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
      then mlock_pte_range() mlock it a second time.  This is harder to
      reproduce, but a more serious race because it could leave the page
      unevictable indefinitely though the area is munlocked afterwards.
      Guard against it by setting the (inappropriate) VM_IO flag,
      and modifying mlock_vma_page() to decline such vmas.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      34b67923
    • H
      mm/munlock: replace clear_page_mlock() by final clearance · b109b870
      Hugh Dickins 提交于
      Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
      of the munlocking to clear_page_mlock(), since PageMlocked is typically
      still set when mapcount has fallen to 0.  That is not what we want: we
      want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
      on the integrity of of the mlock/munlock protocol - small numbers are
      not surprising, but big numbers mean the protocol is not working.
      
      That could be easily fixed by placing munlock_vma_page() at the start of
      page_remove_rmap(); but later in the series we shall want to batch the
      munlocking, and that too would tend to leave PageMlocked still set at
      the point when it is checked.
      
      So delete clear_page_mlock() now: leave it instead to release_pages()
      (and __page_cache_release()) to do this backstop clearing of Mlocked,
      when page refcount has fallen to 0.  If a pinned page occasionally gets
      counted as Mlocked and Unevictable until it is unpinned, that's okay.
      
      A slightly regrettable side-effect of this change is that, since
      release_pages() and __page_cache_release() may be called at interrupt
      time, those places which update NR_MLOCK with interrupts enabled
      had better use mod_zone_page_state() than __mod_zone_page_state()
      (but holding the lruvec lock always has interrupts disabled).
      
      This change, forcing Mlocked off when refcount 0 instead of earlier
      when mapcount 0, is not fundamental: it can be reversed if performance
      or something else is found to suffer; but this is the easiest way to
      separate the stats - let's not complicate that without good reason.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      b109b870
    • H
      mm/munlock: rmap call mlock_vma_page() munlock_vma_page() · cea86fe2
      Hugh Dickins 提交于
      Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
      inline functions which check (vma->vm_flags & VM_LOCKED) before calling
      mlock_page() and munlock_page() in mm/mlock.c.
      
      Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
      because we have understandable difficulty in accounting pte maps of THPs,
      and if passed a PageHead page, mlock_page() and munlock_page() cannot
      tell whether it's a pmd map to be counted or a pte map to be ignored.
      
      Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
      others, and use that to call mlock_vma_page() at the end of the page
      adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
      beginning? unimportant, but end was easier for assertions in testing).
      
      No page lock is required (although almost all adds happen to hold it):
      delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
      Certainly page lock did serialize with page migration, but I'm having
      difficulty explaining why that was ever important.
      
      Mlock accounting on THPs has been hard to define, differed between anon
      and file, involved PageDoubleMap in some places and not others, required
      clear_page_mlock() at some points.  Keep it simple now: just count the
      pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
      
      page_add_new_anon_rmap() callers unchanged: they have long been calling
      lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
      handling (it also checks for not VM_SPECIAL: I think that's overcautious,
      and inconsistent with other checks, that mmap_region() already prevents
      VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      cea86fe2
    • H
      mm/munlock: delete munlock_vma_pages_all(), allow oomreap · a213e5cf
      Hugh Dickins 提交于
      munlock_vma_pages_range() will still be required, when munlocking but
      not munmapping a set of pages; but when unmapping a pte, the mlock count
      will be maintained in much the same way as it will be maintained when
      mapping in the pte.  Which removes the need for munlock_vma_pages_all()
      on mlocked vmas when munmapping or exiting: eliminating the catastrophic
      contention on i_mmap_rwsem, and the need for page lock on the pages.
      
      There is still a need to update locked_vm accounting according to the
      munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
      exit_mmap() does not need locked_vm updates, so delete unlock_range().
      
      And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
      because of the uncertainty in blocking on all those page locks?
      No fear of that now, so permit the OOM reaper on mlocked vmas.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      a213e5cf
    • H
      mm/munlock: delete page_mlock() and all its works · ebcbc6ea
      Hugh Dickins 提交于
      We have recommended some applications to mlock their userspace, but that
      turns out to be counter-productive: when many processes mlock the same
      file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
      is needed for write, to remove any vma mapping that file from rmap's tree;
      but hogged for read by those with mlocks calling page_mlock() (formerly
      known as try_to_munlock()) on *each* page mapped from the file (the
      purpose being to find out whether another process has the page mlocked,
      so therefore it should not be unmlocked yet).
      
      Several optimizations have been made in the past: one is to skip
      page_mlock() when mapcount tells that nothing else has this page
      mapped; but that doesn't help at all when others do have it mapped.
      This time around, I initially intended to add a preliminary search
      of the rmap tree for overlapping VM_LOCKED ranges; but that gets
      messy with locking order, when in doubt whether a page is actually
      present; and risks adding even more contention on the i_mmap_rwsem.
      
      A solution would be much easier, if only there were space in struct page
      for an mlock_count... but actually, most of the time, there is space for
      it - an mlocked page spends most of its life on an unevictable LRU, but
      since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
      been redundant.  Let's try to reuse its page->lru.
      
      But leave that until a later patch: in this patch, clear the ground by
      removing page_mlock(), and all the infrastructure that has gathered
      around it - which mostly hinders understanding, and will make reviewing
      new additions harder.  Don't mind those old comments about THPs, they
      date from before 4.5's refcounting rework: splitting is not a risk here.
      
      Just keep a minimal version of munlock_vma_page(), as reminder of what it
      should attend to (in particular, the odd way PGSTRANDED is counted out of
      PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
      unchanged __mlock_posix_error_return() out of the way, down to above its
      caller: this series then makes no further change after mlock_fixup().
      
      After this and each following commit, the kernel builds, boots and runs;
      but with deficiencies which may show up in testing of mlock and munlock.
      The system calls succeed or fail as before, and mlock remains effective
      in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
      may be shown too low after mlock, grow, then stay too high after munlock:
      with previously mlocked pages remaining unevictable for too long, until
      finally unmapped and freed and counts corrected. Normal service will be
      resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      ebcbc6ea
  11. 15 1月, 2022 2 次提交
  12. 08 1月, 2022 6 次提交
  13. 07 11月, 2021 2 次提交
    • M
      mm/vmscan: centralise timeout values for reclaim_throttle · c3f4a9a2
      Mel Gorman 提交于
      Neil Brown raised concerns about callers of reclaim_throttle specifying
      a timeout value.  The original timeout values to congestion_wait() were
      probably pulled out of thin air or copy&pasted from somewhere else.
      This patch centralises the timeout values and selects a timeout based on
      the reason for reclaim throttling.  These figures are also pulled out of
      the same thin air but better values may be derived
      
      Running a workload that is throttling for inappropriate periods and
      tracing mm_vmscan_throttled can be used to pick a more appropriate
      value.  Excessive throttling would pick a lower timeout where as
      excessive CPU usage in reclaim context would select a larger timeout.
      Ideally a large value would always be used and the wakeups would occur
      before a timeout but that requires careful testing.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-7-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3f4a9a2
    • M
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated · d818fca1
      Mel Gorman 提交于
      Page reclaim throttles on congestion if too many parallel reclaim
      instances have isolated too many pages.  This makes no sense, excessive
      parallelisation has nothing to do with writeback or congestion.
      
      This patch creates an additional workqueue to sleep on when too many
      pages are isolated.  The throttled tasks are woken when the number of
      isolated pages is reduced or a timeout occurs.  There may be some false
      positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
      throttle again if necessary.
      
      [shy828301@gmail.com: Wake up from compaction context]
      [vbabka@suse.cz: Account number of throttled tasks only for writeback]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d818fca1