1. 15 8月, 2020 2 次提交
    • Q
      mm/swap.c: annotate data races for lru_rotate_pvecs · 7e0cc01e
      Qian Cai 提交于
      Read to lru_add_pvec->nr could be interrupted and then write to the same
      variable.  The write has local interrupt disabled, but the plain reads
      result in data races.  However, it is unlikely the compilers could do much
      damage here given that lru_add_pvec->nr is a "unsigned char" and there is
      an existing compiler barrier.  Thus, annotate the reads using the
      data_race() macro.  The data races were reported by KCSAN,
      
       BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page
      
       write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
        rotate_reclaimable_page+0x2df/0x490
        pagevec_add at include/linux/pagevec.h:81
        (inlined by) rotate_reclaimable_page at mm/swap.c:259
        end_page_writeback+0x1b5/0x2b0
        end_swap_bio_write+0x1d0/0x280
        bio_endio+0x297/0x560
        dec_pending+0x218/0x430 [dm_mod]
        clone_endio+0xe4/0x2c0 [dm_mod]
        bio_endio+0x297/0x560
        blk_update_request+0x201/0x920
        scsi_end_request+0x6b/0x4a0
        scsi_io_completion+0xb7/0x7e0
        scsi_finish_command+0x1ed/0x2a0
        scsi_softirq_done+0x1c9/0x1d0
        blk_done_softirq+0x181/0x1d0
        __do_softirq+0xd9/0x57c
        irq_exit+0xa2/0xc0
        do_IRQ+0x8b/0x190
        ret_from_intr+0x0/0x42
        delay_tsc+0x46/0x80
        __const_udelay+0x3c/0x40
        __udelay+0x10/0x20
        kcsan_setup_watchpoint+0x202/0x3a0
        __tsan_read1+0xc2/0x100
        lru_add_drain_cpu+0xb8/0x3f0
        lru_add_drain+0x25/0x40
        shrink_active_list+0xe1/0xc80
        shrink_lruvec+0x766/0xb70
        shrink_node+0x2d6/0xca0
        do_try_to_free_pages+0x1f7/0x9a0
        try_to_free_pages+0x252/0x5b0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
        lru_add_drain_cpu+0xb8/0x3f0
        lru_add_drain_cpu at mm/swap.c:602
        lru_add_drain+0x25/0x40
        shrink_active_list+0xe1/0xc80
        shrink_lruvec+0x766/0xb70
        shrink_node+0x2d6/0xca0
        do_try_to_free_pages+0x1f7/0x9a0
        try_to_free_pages+0x252/0x5b0
        __alloc_pages_slowpath+0x458/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       2 locks held by oom02/37761:
        #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
        #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
       irq event stamp: 1949217
       trace_hardirqs_on_thunk+0x1a/0x1c
       __do_softirq+0x2e7/0x57c
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
       Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMarco Elver <elver@google.com>
      Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e0cc01e
    • M
      mm: replace hpage_nr_pages with thp_nr_pages · 6c357848
      Matthew Wilcox (Oracle) 提交于
      The thp prefix is more frequently used than hpage and we should be
      consistent between the various functions.
      
      [akpm@linux-foundation.org: fix mm/migrate.c]
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c357848
  2. 13 8月, 2020 1 次提交
    • J
      mm/vmscan: protect the workingset on anonymous LRU · b518154e
      Joonsoo Kim 提交于
      In current implementation, newly created or swap-in anonymous page is
      started on active list.  Growing active list results in rebalancing
      active/inactive list so old pages on active list are demoted to inactive
      list.  Hence, the page on active list isn't protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list.  Numbers denote the number of
      pages on active/inactive list (active | inactive).
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      This patch tries to fix this issue.  Like as file LRU, newly created or
      swap-in anonymous pages will be inserted to the inactive list.  They are
      promoted to active list if enough reference happens.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      As you can see, hot pages on active list would be protected.
      
      Note that, this implementation has a drawback that the page cannot be
      promoted and will be swapped-out if re-access interval is greater than the
      size of inactive list but less than the size of total(active+inactive).
      To solve this potential issue, following patch will apply workingset
      detection similar to the one that's already applied to file LRU.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b518154e
  3. 17 7月, 2020 1 次提交
  4. 26 6月, 2020 1 次提交
  5. 04 6月, 2020 11 次提交
    • S
      mm: swap: memcg: fix memcg stats for huge pages · 21e330fc
      Shakeel Butt 提交于
      The commit 2262185c ("mm: per-cgroup memory reclaim stats") added
      PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed
      couple of places and PGLAZYFREE missed huge page handling. Fix that.
      Also for PGLAZYFREE use the irq-unsafe function to update as the irq is
      already disabled.
      
      Fixes: 2262185c ("mm: per-cgroup memory reclaim stats")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200527182947.251343-1-shakeelb@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21e330fc
    • S
      mm: swap: fix vmstats for huge pages · 5d91f31f
      Shakeel Butt 提交于
      Many of the callbacks called by pagevec_lru_move_fn() does not correctly
      update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn()
      use the irq-unsafe alternative to update the stat as the irqs are
      already disabled.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/20200527182916.249910-1-shakeelb@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d91f31f
    • J
      mm: vmscan: reclaim writepage is IO cost · 96f8bf4f
      Johannes Weiner 提交于
      The VM tries to balance reclaim pressure between anon and file so as to
      reduce the amount of IO incurred due to the memory shortage.  It already
      counts refaults and swapins, but in addition it should also count
      writepage calls during reclaim.
      
      For swap, this is obvious: it's IO that wouldn't have occurred if the
      anonymous memory hadn't been under memory pressure.  From a relative
      balancing point of view this makes sense as well: even if anon is cold and
      reclaimable, a cache that isn't thrashing may have equally cold pages that
      don't require IO to reclaim.
      
      For file writeback, it's trickier: some of the reclaim writepage IO would
      have likely occurred anyway due to dirty expiration.  But not all of it -
      premature writeback reduces batching and generates additional writes.
      Since the flushers are already woken up by the time the VM starts writing
      cache pages one by one, let's assume that we'e likely causing writes that
      wouldn't have happened without memory pressure.  In addition, the per-page
      cost of IO would have probably been much cheaper if written in larger
      batches from the flusher thread rather than the single-page-writes from
      kswapd.
      
      For our purposes - getting the trend right to accelerate convergence on a
      stable state that doesn't require paging at all - this is sufficiently
      accurate.  If we later wanted to optimize for sustained thrashing, we can
      still refine the measurements.
      
      Count all writepage calls from kswapd as IO cost toward the LRU that the
      page belongs to.
      
      Why do this dynamically?  Don't we know in advance that anon pages require
      IO to reclaim, and so could build in a static bias?
      
      First, scanning is not the same as reclaiming.  If all the anon pages are
      referenced, we may not swap for a while just because we're scanning the
      anon list.  During this time, however, it's important that we age
      anonymous memory and the page cache at the same rate so that their
      hot-cold gradients are comparable.  Everything else being equal, we still
      want to reclaim the coldest memory overall.
      
      Second, we keep copies in swap unless the page changes.  If there is
      swap-backed data that's mostly read (tmpfs file) and has been swapped out
      before, we can reclaim it without incurring additional IO.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f8bf4f
    • J
      mm: vmscan: determine anon/file pressure balance at the reclaim root · 7cf111bc
      Johannes Weiner 提交于
      We split the LRU lists into anon and file, and we rebalance the scan
      pressure between them when one of them begins thrashing: if the file cache
      experiences workingset refaults, we increase the pressure on anonymous
      pages; if the workload is stalled on swapins, we increase the pressure on
      the file cache instead.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, LRU pressure balancing is
      done on an individual cgroup LRU level.  As a result, when one cgroup is
      thrashing on the filesystem cache while a sibling may have cold anonymous
      pages, pressure doesn't get equalized between them.
      
      This patch moves LRU balancing decision to the root of reclaim - the same
      level where the LRU order is established.
      
      It does this by tracking LRU cost recursively, so that every level of the
      cgroup tree knows the aggregate LRU cost of all memory within its domain.
      When the page scanner calculates the scan balance for any given individual
      cgroup's LRU list, it uses the values from the ancestor cgroup that
      initiated the reclaim cycle.
      
      If one sibling is then thrashing on the cache, it will tip the pressure
      balance inside its ancestors, and the next hierarchical reclaim iteration
      will go more after the anon pages in the tree.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf111bc
    • J
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner 提交于
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • J
      mm: deactivations shouldn't bias the LRU balance · fbbb602e
      Johannes Weiner 提交于
      Operations like MADV_FREE, FADV_DONTNEED etc.  currently move any affected
      active pages to the inactive list to accelerate their reclaim (good) but
      also steer page reclaim toward that LRU type, or away from the other
      (bad).
      
      The reason why this is undesirable is that such operations are not part of
      the regular page aging cycle, and rather a fluke that doesn't say much
      about the remaining pages on that list; they might all be in heavy use,
      and once the chunk of easy victims has been purged, the VM continues to
      apply elevated pressure on those remaining hot pages.  The other LRU,
      meanwhile, might have easily reclaimable pages, and there was never a need
      to steer away from it in the first place.
      
      As the previous patch outlined, we should focus on recording actually
      observed cost to steer the balance rather than speculating about the
      potential value of one LRU list over the other.  In that spirit, leave
      explicitely deactivated pages to the LRU algorithm to pick up, and let
      rotations decide which list is the easiest to reclaim.
      
      [cai@lca.pw: fix set-but-not-used warning]
        Link: http://lkml.kernel.org/r/20200522133335.GA624@Qians-MacBook-Air.localSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200520232525.798933-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbbb602e
    • J
      mm: base LRU balancing on an explicit cost model · 1431d4d1
      Johannes Weiner 提交于
      Currently, scan pressure between the anon and file LRU lists is balanced
      based on a mixture of reclaim efficiency and a somewhat vague notion of
      "value" of having certain pages in memory over others.  That concept of
      value is problematic, because it has caused us to count any event that
      remotely makes one LRU list more or less preferrable for reclaim, even
      when these events are not directly comparable and impose very different
      costs on the system.  One example is referenced file pages that we still
      deactivate and referenced anonymous pages that we actually rotate back to
      the head of the list.
      
      There is also conceptual overlap with the LRU algorithm itself.  By
      rotating recently used pages instead of reclaiming them, the algorithm
      already biases the applied scan pressure based on page value.  Thus, when
      rebalancing scan pressure due to rotations, we should think of reclaim
      cost, and leave assessing the page value to the LRU algorithm.
      
      Lastly, considering both value-increasing as well as value-decreasing
      events can sometimes cause the same type of event to be counted twice,
      i.e.  how rotating a page increases the LRU value, while reclaiming it
      succesfully decreases the value.  In itself this will balance out fine,
      but it quietly skews the impact of events that are only recorded once.
      
      The abstract metric of "value", the murky relationship with the LRU
      algorithm, and accounting both negative and positive events make the
      current pressure balancing model hard to reason about and modify.
      
      This patch switches to a balancing model of accounting the concrete,
      actually observed cost of reclaiming one LRU over another.  For now, that
      cost includes pages that are scanned but rotated back to the list head.
      Subsequent patches will add consideration for IO caused by refaulting of
      recently evicted pages.
      
      Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
      make everything that affects cost go through a new lru_note_cost()
      function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1431d4d1
    • J
      mm: remove use-once cache bias from LRU balancing · 96824687
      Johannes Weiner 提交于
      When the splitlru patches divided page cache and swap-backed pages into
      separate LRU lists, the pressure balance between the lists was biased to
      account for the fact that streaming IO can cause memory pressure with a
      flood of pages that are used only once.  New page cache additions would
      tip the balance toward the file LRU, and repeat access would neutralize
      that bias again.  This ensured that page reclaim would always go for
      used-once cache first.
      
      Since e9868505 ("mm,vmscan: only evict file pages when we have
      plenty"), page reclaim generally skips over swap-backed memory entirely as
      long as there is used-once cache present, and will apply the LRU balancing
      when only repeatedly accessed cache pages are left - at which point the
      previous use-once bias will have been neutralized.  This makes the
      use-once cache balancing bias unnecessary.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-7-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96824687
    • J
      mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() · 6058eaec
      Johannes Weiner 提交于
      They're the same function, and for the purpose of all callers they are
      equivalent to lru_cache_add().
      
      [akpm@linux-foundation.org: fix it for local_lock changes]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6058eaec
    • J
      mm: fix LRU balancing effect of new transparent huge pages · 5df74196
      Johannes Weiner 提交于
      The reclaim code that balances between swapping and cache reclaim tries to
      predict likely reuse based on in-memory reference patterns alone.  This
      works in many cases, but when it fails it cannot detect when the cache is
      thrashing pathologically, or when we're in the middle of a swap storm.
      
      The high seek cost of rotational drives under which the algorithm evolved
      also meant that mistakes could quickly result in lockups from too
      aggressive swapping (which is predominantly random IO).  As a result, the
      balancing code has been tuned over time to a point where it mostly goes
      for page cache and defers swapping until the VM is under significant
      memory pressure.
      
      The resulting strategy doesn't make optimal caching decisions - where
      optimal is the least amount of IO required to execute the workload.
      
      The proliferation of fast random IO devices such as SSDs, in-memory
      compression such as zswap, and persistent memory technologies on the
      horizon, has made this undesirable behavior very noticable: Even in the
      presence of large amounts of cold anonymous memory and a capable swap
      device, the VM refuses to even seriously scan these pages, and can leave
      the page cache thrashing needlessly.
      
      This series sets out to address this.  Since commit ("a528910e mm:
      thrash detection-based file cache sizing") we have exact tracking of
      refault IO - the ultimate cost of reclaiming the wrong pages.  This allows
      us to use an IO cost based balancing model that is more aggressive about
      scanning anonymous memory when the cache is thrashing, while being able to
      avoid unnecessary swap storms.
      
      These patches base the LRU balance on the rate of refaults on each list,
      times the relative IO cost between swap device and filesystem
      (swappiness), in order to optimize reclaim for least IO cost incurred.
      
      	History
      
      I floated these changes in 2016.  At the time they were incomplete and
      full of workarounds due to a lack of infrastructure in the reclaim code:
      We didn't have PageWorkingset, we didn't have hierarchical cgroup
      statistics, and problems with the cgroup swap controller.  As swapping
      wasn't too high a priority then, the patches stalled out.  With all
      dependencies in place now, here we are again with much cleaner,
      feature-complete patches.
      
      I kept the acks for patches that stayed materially the same :-)
      
      Below is a series of test results that demonstrate certain problematic
      behavior of the current code, as well as showcase the new code's more
      predictable and appropriate balancing decisions.
      
      	Test #1: No convergence
      
      This test shows an edge case where the VM currently doesn't converge at
      all on a new file workingset with a stale anon/tmpfs set.
      
      The test sets up a cold anon set the size of 3/4 RAM, then tries to
      establish a new file set half the size of RAM (flat access pattern).
      
      The vanilla kernel refuses to even scan anon pages and never converges.
      The file set is perpetually served from the filesystem.
      
      The first test kernel is with the series up to the workingset patch
      applied.  This allows thrashing page cache to challenge the anonymous
      workingset.  The VM then scans the lists based on the current
      scanned/rotated balancing algorithm.  It converges on a stable state where
      all cold anon pages are pushed out and the fileset is served entirely from
      cache:
      
      			    noconverge/5.7-rc5-mm	noconverge/5.7-rc5-mm-workingset
      Scanned			417719308.00 (    +0.00%)		64091155.00 (   -84.66%)
      Reclaimed		417711094.00 (    +0.00%)		61640308.00 (   -85.24%)
      Reclaim efficiency %	      100.00 (    +0.00%)		      96.18 (    -3.78%)
      Scanned file		417719308.00 (    +0.00%)		59211118.00 (   -85.83%)
      Scanned anon			0.00 (    +0.00%)	         4880037.00 (          )
      Swapouts			0.00 (    +0.00%)	         2439957.00 (          )
      Swapins				0.00 (    +0.00%)		     257.00 (          )
      Refaults		415246605.00 (    +0.00%)		59183722.00 (   -85.75%)
      Restore refaults		0.00 (    +0.00%)	        54988252.00 (          )
      
      The second test kernel is with the full patch series applied, which
      replaces the scanned/rotated ratios with refault/swapin rate-based
      balancing.  It evicts the cold anon pages more aggressively in the
      presence of a thrashing cache and the absence of swapins, and so converges
      with about 60% of the IO and reclaim activity:
      
      			noconverge/5.7-rc5-mm-workingset	noconverge/5.7-rc5-mm-lrubalance
      Scanned				64091155.00 (    +0.00%)		37579741.00 (   -41.37%)
      Reclaimed			61640308.00 (    +0.00%)		35129293.00 (   -43.01%)
      Reclaim efficiency %		      96.18 (    +0.00%)		      93.48 (    -2.78%)
      Scanned file			59211118.00 (    +0.00%)		32708385.00 (   -44.76%)
      Scanned anon			 4880037.00 (    +0.00%)		 4871356.00 (    -0.18%)
      Swapouts			 2439957.00 (    +0.00%)		 2435565.00 (    -0.18%)
      Swapins				     257.00 (    +0.00%)		     262.00 (    +1.94%)
      Refaults			59183722.00 (    +0.00%)		32675667.00 (   -44.79%)
      Restore refaults		54988252.00 (    +0.00%)		28480430.00 (   -48.21%)
      
      We're triggering this case in host sideloading scenarios: When a host's
      primary workload is not saturating the machine (primary load is usually
      driven by user activity), we can optimistically sideload a batch job; if
      user activity picks up and the primary workload needs the whole host
      during this time, we freeze the sideload and rely on it getting pushed to
      swap.  Frequently that swapping doesn't happen and the completely inactive
      sideload simply stays resident while the expanding primary worklad is
      struggling to gain ground.
      
      	Test #2: Kernel build
      
      This test is a a kernel build that is slightly memory-restricted (make -j4
      inside a 400M cgroup).
      
      Despite the very aggressive swapping of cold anon pages in test #1, this
      test shows that the new kernel carefully balances swap against cache
      refaults when both the file and the cache set are pressured.
      
      It shows the patched kernel to be slightly better at finding the coldest
      memory from the combined anon and file set to evict under pressure.  The
      result is lower aggregate reclaim and paging activity:
      
      z				    5.7-rc5-mm	5.7-rc5-mm-lrubalance
      Real time		   210.60 (    +0.00%)	   210.97 (    +0.18%)
      User time		   745.42 (    +0.00%)	   746.48 (    +0.14%)
      System time		    69.78 (    +0.00%)	    69.79 (    +0.02%)
      Scanned file		354682.00 (    +0.00%)	293661.00 (   -17.20%)
      Scanned anon		465381.00 (    +0.00%)	378144.00 (   -18.75%)
      Swapouts		185920.00 (    +0.00%)	147801.00 (   -20.50%)
      Swapins			 34583.00 (    +0.00%)	 32491.00 (    -6.05%)
      Refaults		212664.00 (    +0.00%)	172409.00 (   -18.93%)
      Restore refaults	 48861.00 (    +0.00%)	 80091.00 (   +63.91%)
      Total paging IO		433167.00 (    +0.00%)	352701.00 (   -18.58%)
      
      	Test #3: Overload
      
      This next test is not about performance, but rather about the
      predictability of the algorithm.  The current balancing behavior doesn't
      always lead to comprehensible results, which makes performance analysis
      and parameter tuning (swappiness e.g.) very difficult.
      
      The test shows the balancing behavior under equivalent anon and file
      input.  Anon and file sets are created of equal size (3/4 RAM), have the
      same access patterns (a hot-cold gradient), and synchronized access rates.
      Swappiness is raised from the default of 60 to 100 to indicate equal IO
      cost between swap and cache.
      
      With the vanilla balancing code, anon scans make up around 9% of the total
      pages scanned, or a ~1:10 ratio.  This is a surprisingly skewed ratio, and
      it's an outcome that is hard to explain given the input parameters to the
      VM.
      
      The new balancing model targets a 1:2 balance: All else being equal,
      reclaiming a file page costs one page IO - the refault; reclaiming an anon
      page costs two IOs - the swapout and the swapin.  In the test we observe a
      ~1:3 balance.
      
      The scanned and paging IO numbers indicate that the anon LRU algorithm we
      have in place right now does a slightly worse job at picking the coldest
      pages compared to the file algorithm.  There is ongoing work to improve
      this, like Joonsoo's anon workingset patches; however, it's difficult to
      compare the two aging strategies when the balancing between them is
      behaving unintuitively.
      
      The slightly less efficient anon reclaim results in a deviation from the
      optimal 1:2 scan ratio we would like to see here - however, 1:3 is much
      closer to what we'd want to see in this test than the vanilla kernel's
      aging of 10+ cache pages for every anonymous one:
      
      			overload-100/5.7-rc5-mm-workingset	overload-100/5.7-rc5-mm-lrubalance-realfile
      Scanned				 533633725.00 (    +0.00%)			  595687785.00 (   +11.63%)
      Reclaimed			 494325440.00 (    +0.00%)			  518154380.00 (    +4.82%)
      Reclaim efficiency %			92.63 (    +0.00%)				 86.98 (    -6.03%)
      Scanned file			 484532894.00 (    +0.00%)			  456937722.00 (    -5.70%)
      Scanned anon			  49100831.00 (    +0.00%)			  138750063.00 (  +182.58%)
      Swapouts			   8096423.00 (    +0.00%)			   48982142.00 (  +504.98%)
      Swapins				  10027384.00 (    +0.00%)			   62325044.00 (  +521.55%)
      Refaults			 479819973.00 (    +0.00%)			  451309483.00 (    -5.94%)
      Restore refaults		 426422087.00 (    +0.00%)			  399914067.00 (    -6.22%)
      Total paging IO			 497943780.00 (    +0.00%)			  562616669.00 (   +12.99%)
      
      	Test #4: Parallel IO
      
      It's important to note that these patches only affect the situation where
      the kernel has to reclaim workingset memory, which is usually a
      transitionary period.  The vast majority of page reclaim occuring in a
      system is from trimming the ever-expanding page cache.
      
      These patches don't affect cache trimming behavior.  We never swap as long
      as we only have use-once cache moving through the file LRU, we only
      consider swapping when the cache is actively thrashing.
      
      The following test demonstrates this.  It has an anon workingset that
      takes up half of RAM and then writes a file that is twice the size of RAM
      out to disk.
      
      As the cache is funneled through the inactive file list, no anon pages are
      scanned (aside from apparently some background noise of 10 pages):
      
      					  5.7-rc5-mm		          5.7-rc5-mm-lrubalance
      Scanned			    10714722.00 (    +0.00%)		       10723445.00 (    +0.08%)
      Reclaimed		    10703596.00 (    +0.00%)		       10712166.00 (    +0.08%)
      Reclaim efficiency %		  99.90 (    +0.00%)			     99.89 (    -0.00%)
      Scanned file		    10714722.00 (    +0.00%)		       10723435.00 (    +0.08%)
      Scanned anon			   0.00 (    +0.00%)			     10.00 (          )
      Swapouts			   0.00 (    +0.00%)			      7.00 (          )
      Swapins				   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Refaults			  92.00 (    +0.00%)			     41.00 (   -54.84%)
      Restore refaults		   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Total paging IO			  92.00 (    +0.00%)			     48.00 (   -47.31%)
      
      This patch (of 14):
      
      Currently, THP are counted as single pages until they are split right
      before being swapped out.  However, at that point the VM is already in the
      middle of reclaim, and adjusting the LRU balance then is useless.
      
      Always account THP by the number of basepages, and remove the fixup from
      the splitting path.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200520232525.798933-2-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5df74196
    • M
      mm: simplify calling a compound page destructor · ff45fc3c
      Matthew Wilcox (Oracle) 提交于
      None of the three callers of get_compound_page_dtor() want to know the
      value; they just want to call the function.  Replace it with
      destroy_compound_page() which calls the dtor for them.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff45fc3c
  6. 28 5月, 2020 1 次提交
    • I
      mm/swap: Use local_lock for protection · b01b2141
      Ingo Molnar 提交于
      The various struct pagevec per CPU variables are protected by disabling
      either preemption or interrupts across the critical sections. Inside
      these sections spinlocks have to be acquired.
      
      These spinlocks are regular spinlock_t types which are converted to
      "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
      locks cannot be acquired in preemption or interrupt disabled sections.
      
      local locks provide a trivial way to substitute preempt and interrupt
      disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
      to preempt_disable() and local_lock_irq() to local_irq_disable().
      
      Create lru_rotate_pvecs containing the pagevec and the locallock.
      Create lru_pvecs containing the remaining pagevecs and the locallock.
      Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
      exporting the pvec structure.
      
      Change the relevant call sites to acquire these locks instead of using
      preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
      local_irq_save().
      
      There is neither a functional change nor a change in the generated
      binary code for non PREEMPT_RT enabled non-debug kernels.
      
      When lockdep is enabled local locks have lockdep maps embedded. These
      allow lockdep to validate the protections, i.e. inappropriate usage of a
      preemption only protected sections would result in a lockdep warning
      while the same problem would not be noticed with a plain
      preempt_disable() based protection.
      
      local locks also improve readability as they provide a named scope for
      the protections while preempt/interrupt disable are opaque scopeless.
      
      Finally local locks allow PREEMPT_RT to substitute them with real
      locking primitives to ensure the correctness of operation in a fully
      preemptible kernel.
      
      [ bigeasy: Adopted to use local_lock ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
      b01b2141
  7. 08 4月, 2020 2 次提交
    • H
      mm: huge tmpfs: try to split_huge_page() when punching hole · 71725ed1
      Hugh Dickins 提交于
      Yang Shi writes:
      
      Currently, when truncating a shmem file, if the range is partly in a THP
      (start or end is in the middle of THP), the pages actually will just get
      cleared rather than being freed, unless the range covers the whole THP.
      Even though all the subpages are truncated (randomly or sequentially), the
      THP may still be kept in page cache.
      
      This might be fine for some usecases which prefer preserving THP, but
      balloon inflation is handled in base page size.  So when using shmem THP
      as memory backend, QEMU inflation actually doesn't work as expected since
      it doesn't free memory.  But the inflation usecase really needs to get the
      memory freed.  (Anonymous THP will also not get freed right away, but will
      be freed eventually when all subpages are unmapped: whereas shmem THP
      still stays in page cache.)
      
      Split THP right away when doing partial hole punch, and if split fails
      just clear the page so that read of the punched area will return zeroes.
      
      Hugh Dickins adds:
      
      Our earlier "team of pages" huge tmpfs implementation worked in the way
      that Yang Shi proposes; and we have been using this patch to continue to
      split the huge page when hole-punched or truncated, since converting over
      to the compound page implementation.  Although huge tmpfs gives out huge
      pages when available, if the user specifically asks to truncate or punch a
      hole (perhaps to free memory, perhaps to reduce the memcg charge), then
      the filesystem should do so as best it can, splitting the huge page.
      
      That is not always possible: any additional reference to the huge page
      prevents split_huge_page() from succeeding, so the result can be flaky.
      But in practice it works successfully enough that we've not seen any
      problem from that.
      
      Add shmem_punch_compound() to encapsulate the decision of when a split is
      needed, and doing the split if so.  Using this simplifies the flow in
      shmem_undo_range(); and the first (trylock) pass does not need to do any
      page clearing on failure, because the second pass will either succeed or
      do that clearing.  Following the example of zero_user_segment() when
      clearing a partial page, add flush_dcache_page() and set_page_dirty() when
      clearing a hole - though I'm not certain that either is needed.
      
      But: split_huge_page() would be sure to fail if shmem_undo_range()'s
      pagevec holds further references to the huge page.  The easiest way to fix
      that is for find_get_entries() to return early, as soon as it has put one
      compound head or tail into the pagevec.  At first this felt like a hack;
      but on examination, this convention better suits all its callers - or will
      do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
      and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
      speedup by checking for compound pages there.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71725ed1
    • H
      mm: code cleanup for MADV_FREE · 9de4f22a
      Huang Ying 提交于
      Some comments for MADV_FREE is revised and added to help people understand
      the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
      page_is_file_cache() isn't consistent with its comments.  So the function
      is renamed to page_is_file_lru() to make them consistent again.  All these
      are put in one patch as one logical change.
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de4f22a
  8. 03 4月, 2020 2 次提交
  9. 01 2月, 2020 1 次提交
    • J
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages · 07d80269
      John Hubbard 提交于
      An upcoming patch changes and complicates the refcounting and especially
      the "put page" aspects of it.  In order to keep everything clean,
      refactor the devmap page release routines:
      
      * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
        limit the functionality to "read only": return a bool, with no side
        effects.
      
      * Add a new routine, put_devmap_managed_page(), to handle decrementing
        the refcount for ZONE_DEVICE pages.
      
      * Change callers (just release_pages() and put_page()) to check
        page_is_devmap_managed() before calling the new
        put_devmap_managed_page() routine.  This is a performance point:
        put_page() is a hot path, so we need to avoid non- inline function calls
        where possible.
      
      * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
        limit the functionality to unconditionally freeing a devmap page.
      
      This is originally based on a separate patch by Ira Weiny, which applied
      to an early version of the put_user_page() experiments.  Since then,
      Jérôme Glisse suggested the refactoring described above.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07d80269
  10. 01 12月, 2019 2 次提交
  11. 26 9月, 2019 1 次提交
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  12. 25 9月, 2019 2 次提交
  13. 15 7月, 2019 2 次提交
  14. 03 7月, 2019 1 次提交
  15. 02 7月, 2019 1 次提交
  16. 21 5月, 2019 1 次提交
  17. 15 5月, 2019 1 次提交
  18. 06 3月, 2019 1 次提交
  19. 22 2月, 2019 1 次提交
  20. 05 1月, 2019 1 次提交
  21. 29 12月, 2018 1 次提交
  22. 13 11月, 2018 1 次提交
    • L
      mm: Replace spin_is_locked() with lockdep · 35f3aa39
      Lance Roy 提交于
      lockdep_assert_held() is better suited to checking locking requirements,
      since it only checks if the current thread holds the lock regardless of
      whether someone else does. This is also a step towards possibly removing
      spin_is_locked().
      Signed-off-by: NLance Roy <ldr709@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      35f3aa39
  23. 27 10月, 2018 1 次提交
  24. 21 10月, 2018 1 次提交