1. 14 7月, 2021 2 次提交
    • A
      mm/compaction: do page isolation first in compaction · d74ec531
      Alex Shi 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 9df41314
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZF7C?from=project-issue
      CVE: NA
      
      --------------------------------------
      
      Currently, compaction would get the lru_lock and then do page isolation
      which works fine with pgdat->lru_lock, since any page isoltion would
      compete for the lru_lock.  If we want to change to memcg lru_lock, we have
      to isolate the page before getting lru_lock, thus isoltion would block
      page's memcg change which relay on page isoltion too.  Then we could
      safely use per memcg lru_lock later.
      
      The new page isolation use previous introduced TestClearPageLRU() + pgdat
      lru locking which will be changed to memcg lru lock later.
      
      Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early
      version:
      
      Fix lots of crashes under compaction load: isolate_migratepages_block()
      must clean up appropriately when rejecting a page, setting PageLRU again
      if it had been cleared; and a put_page() after get_page_unless_zero()
      cannot safely be done while holding locked_lruvec - it may turn out to be
      the final put_page(), which will take an lruvec lock when PageLRU.
      
      And move __isolate_lru_page_prepare back after get_page_unless_zero to
      make trylock_page() safe: trylock_page() is not safe to use at this time:
      its setting PG_locked can race with the page being freed or allocated
      ("Bad page"), and can also erase flags being set by one of those "sole
      owners" of a freshly allocated page who use non-atomic __SetPageFlag().
      
      Link: https://lkml.kernel.org/r/1604566549-62481-16-git-send-email-alex.shi@linux.alibaba.comSuggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: Nchenwandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d74ec531
    • A
      mm/thp: move lru_add_page_tail() to huge_memory.c · 1604052c
      Alex Shi 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 88dcb9a3
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZF7C?from=project-issue
      CVE: NA
      
      --------------------------------------
      
      Patch series "per memcg lru lock", v21.
      
      This patchset includes 3 parts:
      
       1) some code cleanup and minimum optimization as preparation
      
       2) use TestCleanPageLRU as page isolation's precondition
      
       3) replace per node lru_lock with per memcg per node lru_lock
      
      Current lru_lock is one for each of node, pgdat->lru_lock, that guard
      for lru lists, but now we had moved the lru lists into memcg for long
      time.  Still using per node lru_lock is clearly unscalable, pages on
      each of memcgs have to compete each others for a whole lru_lock.  This
      patchset try to use per lruvec/memcg lru_lock to repleace per node lru
      lock to guard lru lists, make it scalable for memcgs and get performance
      gain.
      
      Currently lru_lock still guards both lru list and page's lru bit, that's
      ok.  but if we want to use specific lruvec lock on the page, we need to
      pin down the page's lruvec/memcg during locking.  Just taking lruvec
      lock first may be undermined by the page's memcg charge/migration.  To
      fix this problem, we could take out the page's lru bit clear and use it
      as pin down action to block the memcg changes.  That's the reason for
      new atomic func TestClearPageLRU.  So now isolating a page need both
      actions: TestClearPageLRU and hold the lru_lock.
      
      The typical usage of this is isolate_migratepages_block() in
      compaction.c we have to take lru bit before lru lock, that serialized
      the page isolation in memcg page charge/migration which will change
      page's lruvec and new lru_lock in it.
      
      The above solution suggested by Johannes Weiner, and based on his new
      memcg charge path, then have this patchset.  (Hugh Dickins tested and
      contributed much code from compaction fix to general code polish, thanks
      a lot!).
      
      Daniel Jordan's testing show 62% improvement on modified readtwice case
      on his 2P * 10 core * 2 HT broadwell box on v18, which has no much
      different with this v20.
      
       https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
      
      Thanks to Hugh Dickins and Konstantin Khlebnikov, they both brought this
      idea 8 years ago, and others who gave comments as well: Daniel Jordan,
      Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
      
      Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
      and Yun Wang.  Hugh Dickins also shared his kbuild-swap case.
      
      This patch (of 19):
      
      lru_add_page_tail() is only used in huge_memory.c, defining it in other
      file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
      
      Let's move it THP. And make it static as Hugh Dickins suggested.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-1-git-send-email-alex.shi@linux.alibaba.com
      Link: https://lkml.kernel.org/r/1604566549-62481-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: Nchenwandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1604052c
  2. 14 4月, 2021 1 次提交
  3. 09 4月, 2021 1 次提交
  4. 14 10月, 2020 3 次提交
  5. 24 9月, 2020 1 次提交
    • C
      mm: split swap_type_of · 21bd9005
      Christoph Hellwig 提交于
      swap_type_of is used for two entirely different purposes:
      
       (1) check what swap type a given device/offset corresponds to
       (2) find the first available swap device that can be written to
      
      Mixing both in a single function creates an unreadable mess.  Create two
      separate functions instead, and switch both to pass a dev_t instead of
      a struct block_device to further simplify the code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21bd9005
  6. 13 8月, 2020 3 次提交
  7. 08 8月, 2020 2 次提交
  8. 26 6月, 2020 1 次提交
  9. 04 6月, 2020 5 次提交
    • J
      mm: vmscan: reclaim writepage is IO cost · 96f8bf4f
      Johannes Weiner 提交于
      The VM tries to balance reclaim pressure between anon and file so as to
      reduce the amount of IO incurred due to the memory shortage.  It already
      counts refaults and swapins, but in addition it should also count
      writepage calls during reclaim.
      
      For swap, this is obvious: it's IO that wouldn't have occurred if the
      anonymous memory hadn't been under memory pressure.  From a relative
      balancing point of view this makes sense as well: even if anon is cold and
      reclaimable, a cache that isn't thrashing may have equally cold pages that
      don't require IO to reclaim.
      
      For file writeback, it's trickier: some of the reclaim writepage IO would
      have likely occurred anyway due to dirty expiration.  But not all of it -
      premature writeback reduces batching and generates additional writes.
      Since the flushers are already woken up by the time the VM starts writing
      cache pages one by one, let's assume that we'e likely causing writes that
      wouldn't have happened without memory pressure.  In addition, the per-page
      cost of IO would have probably been much cheaper if written in larger
      batches from the flusher thread rather than the single-page-writes from
      kswapd.
      
      For our purposes - getting the trend right to accelerate convergence on a
      stable state that doesn't require paging at all - this is sufficiently
      accurate.  If we later wanted to optimize for sustained thrashing, we can
      still refine the measurements.
      
      Count all writepage calls from kswapd as IO cost toward the LRU that the
      page belongs to.
      
      Why do this dynamically?  Don't we know in advance that anon pages require
      IO to reclaim, and so could build in a static bias?
      
      First, scanning is not the same as reclaiming.  If all the anon pages are
      referenced, we may not swap for a while just because we're scanning the
      anon list.  During this time, however, it's important that we age
      anonymous memory and the page cache at the same rate so that their
      hot-cold gradients are comparable.  Everything else being equal, we still
      want to reclaim the coldest memory overall.
      
      Second, we keep copies in swap unless the page changes.  If there is
      swap-backed data that's mostly read (tmpfs file) and has been swapped out
      before, we can reclaim it without incurring additional IO.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f8bf4f
    • J
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner 提交于
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • J
      mm: base LRU balancing on an explicit cost model · 1431d4d1
      Johannes Weiner 提交于
      Currently, scan pressure between the anon and file LRU lists is balanced
      based on a mixture of reclaim efficiency and a somewhat vague notion of
      "value" of having certain pages in memory over others.  That concept of
      value is problematic, because it has caused us to count any event that
      remotely makes one LRU list more or less preferrable for reclaim, even
      when these events are not directly comparable and impose very different
      costs on the system.  One example is referenced file pages that we still
      deactivate and referenced anonymous pages that we actually rotate back to
      the head of the list.
      
      There is also conceptual overlap with the LRU algorithm itself.  By
      rotating recently used pages instead of reclaiming them, the algorithm
      already biases the applied scan pressure based on page value.  Thus, when
      rebalancing scan pressure due to rotations, we should think of reclaim
      cost, and leave assessing the page value to the LRU algorithm.
      
      Lastly, considering both value-increasing as well as value-decreasing
      events can sometimes cause the same type of event to be counted twice,
      i.e.  how rotating a page increases the LRU value, while reclaiming it
      succesfully decreases the value.  In itself this will balance out fine,
      but it quietly skews the impact of events that are only recorded once.
      
      The abstract metric of "value", the murky relationship with the LRU
      algorithm, and accounting both negative and positive events make the
      current pressure balancing model hard to reason about and modify.
      
      This patch switches to a balancing model of accounting the concrete,
      actually observed cost of reclaiming one LRU over another.  For now, that
      cost includes pages that are scanned but rotated back to the list head.
      Subsequent patches will add consideration for IO caused by refaulting of
      recently evicted pages.
      
      Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
      make everything that affects cost go through a new lru_note_cost()
      function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1431d4d1
    • J
      mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() · 6058eaec
      Johannes Weiner 提交于
      They're the same function, and for the purpose of all callers they are
      equivalent to lru_cache_add().
      
      [akpm@linux-foundation.org: fix it for local_lock changes]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6058eaec
    • J
      mm: memcontrol: move out cgroup swaprate throttling · 6caa6a07
      Johannes Weiner 提交于
      The cgroup swaprate throttling is about matching new anon allocations to
      the rate of available IO when that is being throttled.  It's the io
      controller hooking into the VM, rather than a memory controller thing.
      
      Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
      drop the @memcg argument which is only used to check whether the preceding
      page charge has succeeded and the fault is proceeding.
      
      We could decouple the call from mem_cgroup_try_charge() here as well, but
      that would cause unnecessary churn: the following patches convert all
      callsites to a new charge API and we'll decouple as we go along.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6caa6a07
  10. 03 6月, 2020 3 次提交
    • M
      include/linux/swap.h: delete meaningless __add_to_swap_cache() declaration · 251af0cd
      Miaohe Lin 提交于
      Since commit 8d93b41c ("mm: Convert add_to_swap_cache to XArray"),
      __add_to_swap_cache and add_to_swap_cache are combined into one
      function.  There is no __add_to_swap_cache() anymore.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Link: http://lkml.kernel.org/r/1590810326-2493-1-git-send-email-linmiaohe@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      251af0cd
    • H
      swap: reduce lock contention on swap cache from swap slots allocation · 49070588
      Huang Ying 提交于
      In some swap scalability test, it is found that there are heavy lock
      contention on swap cache even if we have split one swap cache radix tree
      per swap device to one swap cache radix tree every 64 MB trunk in commit
      4b3ef9da ("mm/swap: split swap cache into 64MB trunks").
      
      The reason is as follow.  After the swap device becomes fragmented so
      that there's no free swap cluster, the swap device will be scanned
      linearly to find the free swap slots.  swap_info_struct->cluster_next is
      the next scanning base that is shared by all CPUs.  So nearby free swap
      slots will be allocated for different CPUs.  The probability for
      multiple CPUs to operate on the same 64 MB trunk is high.  This causes
      the lock contention on the swap cache.
      
      To solve the issue, in this patch, for SSD swap device, a percpu version
      next scanning base (cluster_next_cpu) is added.  Every CPU will use its
      own per-cpu next scanning base.  And after finishing scanning a 64MB
      trunk, the per-cpu scanning base will be changed to the beginning of
      another randomly selected 64MB trunk.  In this way, the probability for
      multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
      Thus the lock contention is reduced too.  For HDD, because sequential
      access is more important for IO performance, the original shared next
      scanning base is used.
      
      To test the patch, we have run 16-process pmbench memory benchmark on a
      2-socket server machine with 48 cores.  One ram disk is configured as the
      swap device per socket.  The pmbench working-set size is much larger than
      the available memory so that swapping is triggered.  The memory read/write
      ratio is 80/20 and the accessing pattern is random.  In the original
      implementation, the lock contention on the swap cache is heavy.  The perf
      profiling data of the lock contention code path is as following,
      
       _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.91
       _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:               7.11
       _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
       _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     1.66
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.29
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.03
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        0.93
      
      After applying this patch, it becomes,
      
       _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
       _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      2.3
       _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     2.26
       _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        1.8
       _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.19
      
      The lock contention on the swap cache is almost eliminated.
      
      And the pmbench score increases 18.5%.  The swapin throughput increases
      18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout throughput increases
      18.5% from 2.99 GB/s to 3.54 GB/s.
      
      We need really fast disk to show the benefit.  I have tried this on 2
      Intel P3600 NVMe disks.  The performance improvement is only about 1%.
      The improvement should be better on the faster disks, such as Intel Optane
      disk.
      
      [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
        Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
      [ying.huang@intel.com: v4]
        Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49070588
    • W
      mm/swapfile.c: classify SWAP_MAP_XXX to make it more readable · 4b4bb6bb
      Wei Yang 提交于
      swap_info_struct->swap_map[] encodes some flag and count. And to
      do some condition check, it also introduces some special values.
      
      Currently those macros are defined with some magic order, which makes
      audience hard to understand the exact meaning.
      
      This patch split those macros into three categories:
      
          flag
          special value for first swap_map
          special value for continued swap_map
      
      May this help audiences a little.
      
      [akpm@linux-foundation.org: tweak capitalization in comments]
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200501015259.32237-1-richard.weiyang@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b4bb6bb
  11. 28 5月, 2020 1 次提交
    • I
      mm/swap: Use local_lock for protection · b01b2141
      Ingo Molnar 提交于
      The various struct pagevec per CPU variables are protected by disabling
      either preemption or interrupts across the critical sections. Inside
      these sections spinlocks have to be acquired.
      
      These spinlocks are regular spinlock_t types which are converted to
      "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
      locks cannot be acquired in preemption or interrupt disabled sections.
      
      local locks provide a trivial way to substitute preempt and interrupt
      disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
      to preempt_disable() and local_lock_irq() to local_irq_disable().
      
      Create lru_rotate_pvecs containing the pagevec and the locallock.
      Create lru_pvecs containing the remaining pagevecs and the locallock.
      Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
      exporting the pvec structure.
      
      Change the relevant call sites to acquire these locks instead of using
      preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
      local_irq_save().
      
      There is neither a functional change nor a change in the generated
      binary code for non PREEMPT_RT enabled non-debug kernels.
      
      When lockdep is enabled local locks have lockdep maps embedded. These
      allow lockdep to validate the protections, i.e. inappropriate usage of a
      preemption only protected sections would result in a lockdep warning
      while the same problem would not be noticed with a plain
      preempt_disable() based protection.
      
      local locks also improve readability as they provide a named scope for
      the protections while preempt/interrupt disable are opaque scopeless.
      
      Finally local locks allow PREEMPT_RT to substitute them with real
      locking primitives to ensure the correctness of operation in a fully
      preemptible kernel.
      
      [ bigeasy: Adopted to use local_lock ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
      b01b2141
  12. 19 4月, 2020 1 次提交
    • G
      swap.h: Replace zero-length array with flexible-array member · 16c3380f
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      16c3380f
  13. 03 4月, 2020 1 次提交
    • Y
      mm: swap: make page_evictable() inline · 1eb6234e
      Yang Shi 提交于
      When backporting commit 9c4e6b1a ("mm, mlock, vmscan: no more skipping
      pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
      a couple of vm-scalability's test cases (lru-file-readonce,
      lru-file-readtwice and lru-file-mmap-read).  I didn't see that much down
      on my VM (32c-64g-2nodes).  It might be caused by the test configuration,
      which is 32c-256g with NUMA disabled and the tests were run in root memcg,
      so the tests actually stress only one inactive and active lru.  It sounds
      not very usual in mordern production environment.
      
      That commit did two major changes:
      1. Call page_evictable()
      2. Use smp_mb to force the PG_lru set visible
      
      It looks they contribute the most overhead.  The page_evictable() is a
      function which does function prologue and epilogue, and that was used by
      page reclaim path only.  However, lru add is a very hot path, so it sounds
      better to make it inline.  However, it calls page_mapping() which is not
      inlined either, but the disassemble shows it doesn't do push and pop
      operations and it sounds not very straightforward to inline it.
      
      Other than this, it sounds smp_mb() is not necessary for x86 since
      SetPageLRU is atomic which enforces memory barrier already, replace it
      with smp_mb__after_atomic() in the following patch.
      
      With the two fixes applied, the tests can get back around 5% on that test
      bench and get back normal on my VM.  Since the test bench configuration is
      not that usual and I also saw around 6% up on the latest upstream, so it
      sounds good enough IMHO.
      
      The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
      	mainline	w/ inline fix
                150MB            154MB
      
      With this patch the throughput gets 2.67% up.  The data with using
      smp_mb__after_atomic() is showed in the following patch.
      
      Shakeel Butt did the below test:
      
      On a real machine with limiting the 'dd' on a single node and reading 100
      GiB sparse file (less than a single node).  Just ran a single instance to
      not cause the lru lock contention.  The cmdline used is "dd if=file-100GiB
      of=/dev/null bs=4k".  Ran the cmd 10 times with drop_caches in between and
      measured the time it took.
      
      Without patch: 56.64143 +- 0.672 sec
      
      With patches: 56.10 +- 0.21 sec
      
      [akpm@linux-foundation.org: move page_evictable() to internal.h]
      Fixes: 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb6234e
  14. 02 12月, 2019 1 次提交
    • J
      mm: vmscan: detect file thrashing at the reclaim root · b910718a
      Johannes Weiner 提交于
      We use refault information to determine whether the cache workingset is
      stable or transitioning, and dynamically adjust the inactive:active file
      LRU ratio so as to maximize protection from one-off cache during stable
      periods, and minimize IO during transitions.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, refaults only affect the
      local LRU order in the cgroup in which they are occuring.  As a result,
      cache transitions can take longer in a cgrouped system as the active pages
      of sibling cgroups aren't challenged when they should be.
      
      [ Right now, this is somewhat theoretical, because the siblings, under
        continued regular reclaim pressure, should eventually run out of
        inactive pages - and since inactive:active *size* balancing is also
        done on a cgroup-local level, we will challenge the active pages
        eventually in most cases. But the next patch will move that relative
        size enforcement to the reclaim root as well, and then this patch
        here will be necessary to propagate refault pressure to siblings. ]
      
      This patch moves refault detection to the root of reclaim.  Instead of
      remembering the cgroup owner of an evicted page, remember the cgroup that
      caused the reclaim to happen.  When refaults later occur, they'll
      correctly influence the cross-cgroup LRU order that reclaim follows.
      
      I.e.  if global reclaim kicked out pages in some subgroup A/B/C, the
      refault of those pages will challenge the global LRU order, and not just
      the local order down inside C.
      
      [hannes@cmpxchg.org:  use page_memcg() instead of another lookup]
        Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
      Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b910718a
  15. 26 9月, 2019 2 次提交
    • M
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim 提交于
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  16. 13 7月, 2019 2 次提交
    • A
      mm, swap: use rbtree for swap_extent · 4efaceb1
      Aaron Lu 提交于
      swap_extent is used to map swap page offset to backing device's block
      offset.  For a continuous block range, one swap_extent is used and all
      these swap_extents are managed in a linked list.
      
      These swap_extents are used by map_swap_entry() during swap's read and
      write path.  To find out the backing device's block offset for a page
      offset, the swap_extent list will be traversed linearly, with
      curr_swap_extent being used as a cache to speed up the search.
      
      This works well as long as swap_extents are not huge or when the number
      of processes that access swap device are few, but when the swap device
      has many extents and there are a number of processes accessing the swap
      device concurrently, it can be a problem.  On one of our servers, the
      disk's remaining size is tight:
      
        $df -h
        Filesystem      Size  Used Avail Use% Mounted on
        ... ...
        /dev/nvme0n1p1  1.8T  1.3T  504G  72% /home/t4
      
      When creating a 80G swapfile there, there are as many as 84656 swap
      extents.  The end result is, kernel spends abou 30% time in
      map_swap_entry() and swap throughput is only 70MB/s.
      
      As a comparison, when I used smaller sized swapfile, like 4G whose
      swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
      map_swap_entry() is about 3%.
      
      One downside of using rbtree for swap_extent is, 'struct rbtree' takes
      24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
      for each swap_extent.  For a swapfile that has 80k swap_extents, that
      means 625KiB more memory consumed.
      
      Test:
      
      Since it's not possible to reboot that server, I can not test this patch
      diretly there.  Instead, I tested it on another server with NVMe disk.
      
      I created a 20G swapfile on an NVMe backed XFS fs.  By default, the
      filesystem is quite clean and the created swapfile has only 2 extents.
      Testing vanilla and this patch shows no obvious performance difference
      when swapfile is not fragmented.
      
      To see the patch's effects, I used some tweaks to manually fragment the
      swapfile by breaking the extent at 1M boundary.  This made the swapfile
      have 20K extents.
      
        nr_task=4
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  165191           90.77%             171798          90.21%
        patched  858993 +420%      2.16%             715827 +317%     0.77%
      
        nr_task=8
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  306783           92.19%             318145          87.76%
        patched  954437 +211%      2.35%            1073741 +237%     1.57%
      
      swapout: the throughput of swap out, in KB/s, higher is better 1st
      map_swap_entry: cpu cycles percent sampled by perf swapin: the
      throughput of swap in, in KB/s, higher is better.  2nd map_swap_entry:
      cpu cycles percent sampled by perf
      
      nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
      can be effectively used to cache the correct swap extent for single task
      workload.
      
      [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
      Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronluSigned-off-by: NAaron Lu <ziqian.lzq@antfin.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4efaceb1
    • H
      mm, swap: fix race between swapoff and some swap operations · eb085574
      Huang Ying 提交于
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb085574
  17. 15 3月, 2019 1 次提交
  18. 06 3月, 2019 2 次提交
  19. 29 12月, 2018 3 次提交
  20. 07 11月, 2018 1 次提交
  21. 27 10月, 2018 2 次提交
  22. 21 10月, 2018 1 次提交