1. 30 6月, 2021 1 次提交
  2. 07 5月, 2021 1 次提交
  3. 06 5月, 2021 2 次提交
    • M
      mm: fs: invalidate BH LRU during page migration · 8cc621d2
      Minchan Kim 提交于
      Pages containing buffer_heads that are in one of the per-CPU buffer_head
      LRU caches will be pinned and thus cannot be migrated.  This can prevent
      CMA allocations from succeeding, which are often used on platforms with
      co-processors (such as a DSP) that can only use physically contiguous
      memory.  It can also prevent memory hot-unplugging from succeeding,
      which involves migrating at least MIN_MEMORY_BLOCK_SIZE bytes of memory,
      which ranges from 8 MiB to 1 GiB based on the architecture in use.
      
      Correspondingly, invalidate the BH LRU caches before a migration starts
      and stop any buffer_head from being cached in the LRU caches, until
      migration has finished.
      
      Link: https://lkml.kernel.org/r/20210319175127.886124-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NChris Goldsworthy <cgoldswo@codeaurora.org>
      Reported-by: NLaura Abbott <labbott@kernel.org>
      Tested-by: NOliver Sang <oliver.sang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc621d2
    • M
      mm: disable LRU pagevec during the migration temporarily · d479960e
      Minchan Kim 提交于
      LRU pagevec holds refcount of pages until the pagevec are drained.  It
      could prevent migration since the refcount of the page is greater than
      the expection in migration logic.  To mitigate the issue, callers of
      migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all
      before migrate_pages call.
      
      However, it's not enough because pages coming into pagevec after the
      draining call still could stay at the pagevec so it could keep
      preventing page migration.  Since some callers of migrate_pages have
      retrial logic with LRU draining, the page would migrate at next trail
      but it is still fragile in that it doesn't close the fundamental race
      between upcoming LRU pages into pagvec and migration so the migration
      failure could cause contiguous memory allocation failure in the end.
      
      To close the race, this patch disables lru caches(i.e, pagevec) during
      ongoing migration until migrate is done.
      
      Since it's really hard to reproduce, I measured how many times
      migrate_pages retried with force mode(it is about a fallback to a sync
      migration) with below debug code.
      
      int migrate_pages(struct list_head *from, new_page_t get_new_page,
      			..
      			..
      
        if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
               printk(KERN_ERR, "pfn 0x%lx reason %d", page_to_pfn(page), rc);
               dump_page(page, "fail to migrate");
        }
      
      The test was repeating android apps launching with cma allocation in
      background every five seconds.  Total cma allocation count was about 500
      during the testing.  With this patch, the dump_page count was reduced
      from 400 to 30.
      
      The new interface is also useful for memory hotplug which currently
      drains lru pcp caches after each migration failure.  This is rather
      suboptimal as it has to disrupt others running during the operation.
      With the new interface the operation happens only once.  This is also in
      line with pcp allocator cache which are disabled for the offlining as
      well.
      
      Link: https://lkml.kernel.org/r/20210319175127.886124-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NChris Goldsworthy <cgoldswo@codeaurora.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oliver Sang <oliver.sang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d479960e
  4. 27 2月, 2021 5 次提交
  5. 25 2月, 2021 5 次提交
  6. 16 12月, 2020 8 次提交
    • A
      mm/lru: introduce relock_page_lruvec() · 2a5e4e34
      Alexander Duyck 提交于
      Add relock_page_lruvec() to replace repeated same code, no functional
      change.
      
      When testing for relock we can avoid the need for RCU locking if we simply
      compare the page pgdat and memcg pointers versus those that the lruvec is
      holding.  By doing this we can avoid the extra pointer walks and accesses
      of the memory cgroup.
      
      In addition we can avoid the checks entirely if lruvec is currently NULL.
      
      [alex.shi@linux.alibaba.com: use page_memcg()]
        Link: https://lkml.kernel.org/r/66d8e79d-7ec6-bfbc-1c82-bf32db3ae5b7@linux.alibaba.com
      
      Link: https://lkml.kernel.org/r/1604566549-62481-19-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a5e4e34
    • A
      mm/lru: replace pgdat lru_lock with lruvec lock · 6168d0da
      Alex Shi 提交于
      This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
      each of memcg per node.  So on a large machine, each of memcg don't have
      to suffer from per node pgdat->lru_lock competition.  They could go fast
      with their self lru_lock.
      
      After move memcg charge before lru inserting, page isolation could
      serialize page's memcg, then per memcg lruvec lock is stable and could
      replace per node lru lock.
      
      In isolate_migratepages_block(), compact_unlock_should_abort and
      lock_page_lruvec_irqsave are open coded to work with compact_control.
      Also add a debug func in locking which may give some clues if there are
      sth out of hands.
      
      Daniel Jordan's testing show 62% improvement on modified readtwice case on
      his 2P * 10 core * 2 HT broadwell box.
      https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
      
      Hugh Dickins helped on the patch polish, thanks!
      
      [alex.shi@linux.alibaba.com: fix comment typo]
        Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
      [alex.shi@linux.alibaba.com: use page_memcg()]
        Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com
      
      Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6168d0da
    • A
      mm/swap.c: serialize memcg changes in pagevec_lru_move_fn · fc574c23
      Alex Shi 提交于
      Hugh Dickins' found a memcg change bug on original version: If we want to
      change the pgdat->lru_lock to memcg's lruvec lock, we have to serialize
      mem_cgroup_move_account during pagevec_lru_move_fn.  The possible bad
      scenario would like:
      
      	cpu 0					cpu 1
      lruvec = mem_cgroup_page_lruvec()
      					if (!isolate_lru_page())
      						mem_cgroup_move_account
      
      spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.
      
      So we need TestClearPageLRU to block isolate_lru_page(), that serializes
      the memcg change.  and then removing the PageLRU check in move_fn callee
      as the consequence.
      
      __pagevec_lru_add_fn() is different from the others, because the pages it
      deals with are, by definition, not yet on the lru.  TestClearPageLRU is
      not needed and would not work, so __pagevec_lru_add() goes its own way.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-17-git-send-email-alex.shi@linux.alibaba.comReported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc574c23
    • A
      mm/lru: move lock into lru_note_cost · 75cc3c91
      Alex Shi 提交于
      We have to move lru_lock into lru_note_cost, since it cycle up on memcg
      tree, for future per lruvec lru_lock replace.  It's a bit ugly and may
      cost a bit more locking, but benefit from multiple memcg locking could
      cover the lost.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-11-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75cc3c91
    • A
      mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn · c7c7b80c
      Alex Shi 提交于
      Fold the PGROTATED event collection into pagevec_move_tail_fn call back
      func like other funcs does in pagevec_lru_move_fn.  Thus we could save
      func call pagevec_move_tail().  Now all usage of pagevec_lru_move_fn are
      same and no needs of its 3rd parameter.
      
      It's just simply the calling. No functional change.
      
      [lkp@intel.com: found a build issue in the original patch, thanks]
      
      Link: https://lkml.kernel.org/r/1604566549-62481-10-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7c7b80c
    • A
      mm/thp: move lru_add_page_tail() to huge_memory.c · 88dcb9a3
      Alex Shi 提交于
      Patch series "per memcg lru lock", v21.
      
      This patchset includes 3 parts:
      
       1) some code cleanup and minimum optimization as preparation
      
       2) use TestCleanPageLRU as page isolation's precondition
      
       3) replace per node lru_lock with per memcg per node lru_lock
      
      Current lru_lock is one for each of node, pgdat->lru_lock, that guard
      for lru lists, but now we had moved the lru lists into memcg for long
      time.  Still using per node lru_lock is clearly unscalable, pages on
      each of memcgs have to compete each others for a whole lru_lock.  This
      patchset try to use per lruvec/memcg lru_lock to repleace per node lru
      lock to guard lru lists, make it scalable for memcgs and get performance
      gain.
      
      Currently lru_lock still guards both lru list and page's lru bit, that's
      ok.  but if we want to use specific lruvec lock on the page, we need to
      pin down the page's lruvec/memcg during locking.  Just taking lruvec
      lock first may be undermined by the page's memcg charge/migration.  To
      fix this problem, we could take out the page's lru bit clear and use it
      as pin down action to block the memcg changes.  That's the reason for
      new atomic func TestClearPageLRU.  So now isolating a page need both
      actions: TestClearPageLRU and hold the lru_lock.
      
      The typical usage of this is isolate_migratepages_block() in
      compaction.c we have to take lru bit before lru lock, that serialized
      the page isolation in memcg page charge/migration which will change
      page's lruvec and new lru_lock in it.
      
      The above solution suggested by Johannes Weiner, and based on his new
      memcg charge path, then have this patchset.  (Hugh Dickins tested and
      contributed much code from compaction fix to general code polish, thanks
      a lot!).
      
      Daniel Jordan's testing show 62% improvement on modified readtwice case
      on his 2P * 10 core * 2 HT broadwell box on v18, which has no much
      different with this v20.
      
       https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
      
      Thanks to Hugh Dickins and Konstantin Khlebnikov, they both brought this
      idea 8 years ago, and others who gave comments as well: Daniel Jordan,
      Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
      
      Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
      and Yun Wang.  Hugh Dickins also shared his kbuild-swap case.
      
      This patch (of 19):
      
      lru_add_page_tail() is only used in huge_memory.c, defining it in other
      file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
      
      Let's move it THP. And make it static as Hugh Dickins suggested.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-1-git-send-email-alex.shi@linux.alibaba.com
      Link: https://lkml.kernel.org/r/1604566549-62481-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88dcb9a3
    • J
      mm: remove pagevec_lookup_range_nr_tag() · 46268094
      Jeff Layton 提交于
      With the merge of commit 2e169296 ("ceph: have ceph_writepages_start
      call pagevec_lookup_range_tag"), nothing calls this anymore.
      
      Link: https://lkml.kernel.org/r/20201021193926.101474-1-jlayton@kernel.orgSigned-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46268094
    • R
      mm: handle zone device pages in release_pages() · 43fbdeb3
      Ralph Campbell 提交于
      release_pages() is an optimized, inlined version of __put_pages() except
      that zone device struct pages that are not page_is_devmap_managed() (i.e.,
      memory_type MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA), fall
      through to the code that could return the zone device page to the page
      allocator instead of adjusting the pgmap reference count.
      
      Clearly these type of pages are not having the reference count decremented
      to zero via release_pages() or page allocation problems would be seen.
      Just to be safe, handle the 1 to zero case in release_pages() like
      __put_page() does.
      
      Link: https://lkml.kernel.org/r/20201021194733.11530-1-rcampbell@nvidia.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43fbdeb3
  7. 14 10月, 2020 5 次提交
  8. 20 9月, 2020 1 次提交
  9. 10 9月, 2020 1 次提交
    • A
      mm/swap: Do not abuse the seqcount_t latching API · 6446a513
      Ahmed S. Darwish 提交于
      Commit eef1a429 ("mm/swap.c: piggyback lru_add_drain_all() calls")
      implemented an optimization mechanism to exit the to-be-started LRU
      drain operation (name it A) if another drain operation *started and
      finished* while (A) was blocked on the LRU draining mutex.
      
      This was done through a seqcount_t latch, which is an abuse of its
      semantics:
      
        1. seqcount_t latching should be used for the purpose of switching
           between two storage places with sequence protection to allow
           interruptible, preemptible, writer sections. The referenced
           optimization mechanism has absolutely nothing to do with that.
      
        2. The used raw_write_seqcount_latch() has two SMP write memory
           barriers to insure one consistent storage place out of the two
           storage places available. A full memory barrier is required
           instead: to guarantee that the pagevec counter stores visible by
           local CPU are visible to other CPUs -- before loading the current
           drain generation.
      
      Beside the seqcount_t API abuse, the semantics of a latch sequence
      counter was force-fitted into the referenced optimization. What was
      meant is to track "generations" of LRU draining operations, where
      "global lru draining generation = x" implies that all generations
      0 < n <= x are already *scheduled* for draining -- thus nothing needs
      to be done if the current generation number n <= x.
      
      Remove the conceptually-inappropriate seqcount_t latch usage. Manually
      implement the referenced optimization using a counter and SMP memory
      barriers.
      
      Note, while at it, use the non-atomic variant of cpumask_set_cpu(),
      __cpumask_set_cpu(), due to the already existing mutex protection.
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/87y2pg9erj.fsf@vostro.fn.ogness.net
      6446a513
  10. 15 8月, 2020 2 次提交
    • Q
      mm/swap.c: annotate data races for lru_rotate_pvecs · 7e0cc01e
      Qian Cai 提交于
      Read to lru_add_pvec->nr could be interrupted and then write to the same
      variable.  The write has local interrupt disabled, but the plain reads
      result in data races.  However, it is unlikely the compilers could do much
      damage here given that lru_add_pvec->nr is a "unsigned char" and there is
      an existing compiler barrier.  Thus, annotate the reads using the
      data_race() macro.  The data races were reported by KCSAN,
      
       BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page
      
       write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
        rotate_reclaimable_page+0x2df/0x490
        pagevec_add at include/linux/pagevec.h:81
        (inlined by) rotate_reclaimable_page at mm/swap.c:259
        end_page_writeback+0x1b5/0x2b0
        end_swap_bio_write+0x1d0/0x280
        bio_endio+0x297/0x560
        dec_pending+0x218/0x430 [dm_mod]
        clone_endio+0xe4/0x2c0 [dm_mod]
        bio_endio+0x297/0x560
        blk_update_request+0x201/0x920
        scsi_end_request+0x6b/0x4a0
        scsi_io_completion+0xb7/0x7e0
        scsi_finish_command+0x1ed/0x2a0
        scsi_softirq_done+0x1c9/0x1d0
        blk_done_softirq+0x181/0x1d0
        __do_softirq+0xd9/0x57c
        irq_exit+0xa2/0xc0
        do_IRQ+0x8b/0x190
        ret_from_intr+0x0/0x42
        delay_tsc+0x46/0x80
        __const_udelay+0x3c/0x40
        __udelay+0x10/0x20
        kcsan_setup_watchpoint+0x202/0x3a0
        __tsan_read1+0xc2/0x100
        lru_add_drain_cpu+0xb8/0x3f0
        lru_add_drain+0x25/0x40
        shrink_active_list+0xe1/0xc80
        shrink_lruvec+0x766/0xb70
        shrink_node+0x2d6/0xca0
        do_try_to_free_pages+0x1f7/0x9a0
        try_to_free_pages+0x252/0x5b0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
        lru_add_drain_cpu+0xb8/0x3f0
        lru_add_drain_cpu at mm/swap.c:602
        lru_add_drain+0x25/0x40
        shrink_active_list+0xe1/0xc80
        shrink_lruvec+0x766/0xb70
        shrink_node+0x2d6/0xca0
        do_try_to_free_pages+0x1f7/0x9a0
        try_to_free_pages+0x252/0x5b0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       2 locks held by oom02/37761:
        #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
        #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
       irq event stamp: 1949217
       trace_hardirqs_on_thunk+0x1a/0x1c
       __do_softirq+0x2e7/0x57c
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
       Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMarco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e0cc01e
    • M
      mm: replace hpage_nr_pages with thp_nr_pages · 6c357848
      Matthew Wilcox (Oracle) 提交于
      The thp prefix is more frequently used than hpage and we should be
      consistent between the various functions.
      
      [akpm@linux-foundation.org: fix mm/migrate.c]
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c357848
  11. 13 8月, 2020 1 次提交
    • J
      mm/vmscan: protect the workingset on anonymous LRU · b518154e
      Joonsoo Kim 提交于
      In current implementation, newly created or swap-in anonymous page is
      started on active list.  Growing active list results in rebalancing
      active/inactive list so old pages on active list are demoted to inactive
      list.  Hence, the page on active list isn't protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list.  Numbers denote the number of
      pages on active/inactive list (active | inactive).
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      This patch tries to fix this issue.  Like as file LRU, newly created or
      swap-in anonymous pages will be inserted to the inactive list.  They are
      promoted to active list if enough reference happens.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      As you can see, hot pages on active list would be protected.
      
      Note that, this implementation has a drawback that the page cannot be
      promoted and will be swapped-out if re-access interval is greater than the
      size of inactive list but less than the size of total(active+inactive).
      To solve this potential issue, following patch will apply workingset
      detection similar to the one that's already applied to file LRU.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b518154e
  12. 17 7月, 2020 1 次提交
  13. 26 6月, 2020 1 次提交
  14. 04 6月, 2020 6 次提交
    • S
      mm: swap: memcg: fix memcg stats for huge pages · 21e330fc
      Shakeel Butt 提交于
      The commit 2262185c ("mm: per-cgroup memory reclaim stats") added
      PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed
      couple of places and PGLAZYFREE missed huge page handling. Fix that.
      Also for PGLAZYFREE use the irq-unsafe function to update as the irq is
      already disabled.
      
      Fixes: 2262185c ("mm: per-cgroup memory reclaim stats")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200527182947.251343-1-shakeelb@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21e330fc
    • S
      mm: swap: fix vmstats for huge pages · 5d91f31f
      Shakeel Butt 提交于
      Many of the callbacks called by pagevec_lru_move_fn() does not correctly
      update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn()
      use the irq-unsafe alternative to update the stat as the irqs are
      already disabled.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200527182916.249910-1-shakeelb@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d91f31f
    • J
      mm: vmscan: reclaim writepage is IO cost · 96f8bf4f
      Johannes Weiner 提交于
      The VM tries to balance reclaim pressure between anon and file so as to
      reduce the amount of IO incurred due to the memory shortage.  It already
      counts refaults and swapins, but in addition it should also count
      writepage calls during reclaim.
      
      For swap, this is obvious: it's IO that wouldn't have occurred if the
      anonymous memory hadn't been under memory pressure.  From a relative
      balancing point of view this makes sense as well: even if anon is cold and
      reclaimable, a cache that isn't thrashing may have equally cold pages that
      don't require IO to reclaim.
      
      For file writeback, it's trickier: some of the reclaim writepage IO would
      have likely occurred anyway due to dirty expiration.  But not all of it -
      premature writeback reduces batching and generates additional writes.
      Since the flushers are already woken up by the time the VM starts writing
      cache pages one by one, let's assume that we'e likely causing writes that
      wouldn't have happened without memory pressure.  In addition, the per-page
      cost of IO would have probably been much cheaper if written in larger
      batches from the flusher thread rather than the single-page-writes from
      kswapd.
      
      For our purposes - getting the trend right to accelerate convergence on a
      stable state that doesn't require paging at all - this is sufficiently
      accurate.  If we later wanted to optimize for sustained thrashing, we can
      still refine the measurements.
      
      Count all writepage calls from kswapd as IO cost toward the LRU that the
      page belongs to.
      
      Why do this dynamically?  Don't we know in advance that anon pages require
      IO to reclaim, and so could build in a static bias?
      
      First, scanning is not the same as reclaiming.  If all the anon pages are
      referenced, we may not swap for a while just because we're scanning the
      anon list.  During this time, however, it's important that we age
      anonymous memory and the page cache at the same rate so that their
      hot-cold gradients are comparable.  Everything else being equal, we still
      want to reclaim the coldest memory overall.
      
      Second, we keep copies in swap unless the page changes.  If there is
      swap-backed data that's mostly read (tmpfs file) and has been swapped out
      before, we can reclaim it without incurring additional IO.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f8bf4f
    • J
      mm: vmscan: determine anon/file pressure balance at the reclaim root · 7cf111bc
      Johannes Weiner 提交于
      We split the LRU lists into anon and file, and we rebalance the scan
      pressure between them when one of them begins thrashing: if the file cache
      experiences workingset refaults, we increase the pressure on anonymous
      pages; if the workload is stalled on swapins, we increase the pressure on
      the file cache instead.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, LRU pressure balancing is
      done on an individual cgroup LRU level.  As a result, when one cgroup is
      thrashing on the filesystem cache while a sibling may have cold anonymous
      pages, pressure doesn't get equalized between them.
      
      This patch moves LRU balancing decision to the root of reclaim - the same
      level where the LRU order is established.
      
      It does this by tracking LRU cost recursively, so that every level of the
      cgroup tree knows the aggregate LRU cost of all memory within its domain.
      When the page scanner calculates the scan balance for any given individual
      cgroup's LRU list, it uses the values from the ancestor cgroup that
      initiated the reclaim cycle.
      
      If one sibling is then thrashing on the cache, it will tip the pressure
      balance inside its ancestors, and the next hierarchical reclaim iteration
      will go more after the anon pages in the tree.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf111bc
    • J
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner 提交于
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • J
      mm: deactivations shouldn't bias the LRU balance · fbbb602e
      Johannes Weiner 提交于
      Operations like MADV_FREE, FADV_DONTNEED etc.  currently move any affected
      active pages to the inactive list to accelerate their reclaim (good) but
      also steer page reclaim toward that LRU type, or away from the other
      (bad).
      
      The reason why this is undesirable is that such operations are not part of
      the regular page aging cycle, and rather a fluke that doesn't say much
      about the remaining pages on that list; they might all be in heavy use,
      and once the chunk of easy victims has been purged, the VM continues to
      apply elevated pressure on those remaining hot pages.  The other LRU,
      meanwhile, might have easily reclaimable pages, and there was never a need
      to steer away from it in the first place.
      
      As the previous patch outlined, we should focus on recording actually
      observed cost to steer the balance rather than speculating about the
      potential value of one LRU list over the other.  In that spirit, leave
      explicitely deactivated pages to the LRU algorithm to pick up, and let
      rotations decide which list is the easiest to reclaim.
      
      [cai@lca.pw: fix set-but-not-used warning]
        Link: http://lkml.kernel.org/r/20200522133335.GA624@Qians-MacBook-Air.localSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200520232525.798933-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbbb602e