1. 03 4月, 2020 4 次提交
    • Q
      mm/vmscan.c: fix data races using kswapd_classzone_idx · 5644e1fb
      Qian Cai 提交于
      pgdat->kswapd_classzone_idx could be accessed concurrently in
      wakeup_kswapd().  Plain writes and reads without any lock protection
      result in data races.  Fix them by adding a pair of READ|WRITE_ONCE() as
      well as saving a branch (compilers might well optimize the original code
      in an unintentional way anyway).  While at it, also take care of
      pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim().  The
      data races were reported by KCSAN,
      
       BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd
      
       write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
        wakeup_kswapd+0xf1/0x400
        wakeup_kswapd at mm/vmscan.c:3967
        wake_all_kswapds+0x59/0xc0
        wake_all_kswapds at mm/page_alloc.c:4241
        __alloc_pages_slowpath+0xdcc/0x1290
        __alloc_pages_slowpath at mm/page_alloc.c:4512
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       1 lock held by mtest01/7454:
        #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
       do_page_fault+0x143/0x6f9
       do_user_addr_fault at arch/x86/mm/fault.c:1405
       (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
       irq event stamp: 6944085
       count_memcg_event_mm+0x1a6/0x270
       count_memcg_event_mm+0x119/0x270
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
        wakeup_kswapd+0xc8/0x400
        wake_all_kswapds+0x59/0xc0
        __alloc_pages_slowpath+0xdcc/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       1 lock held by mtest01/7472:
        #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
       do_page_fault+0x143/0x6f9
       irq event stamp: 6793561
       count_memcg_event_mm+0x1a6/0x270
       count_memcg_event_mm+0x119/0x270
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       BUG: KCSAN: data-race in kswapd / wakeup_kswapd
      
       write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6:
        kswapd+0x27c/0x8d0
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0:
        wakeup_kswapd+0xf3/0x450
        wake_all_kswapds+0x59/0xc0
        __alloc_pages_slowpath+0xdcc/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x170/0x700
        __handle_mm_fault+0xc9f/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5644e1fb
    • W
      mm/vmscan.c: remove cpu online notification for now · 6b700b5b
      Wei Yang 提交于
      kswapd kernel thread starts either with a CPU affinity set to the full cpu
      mask of its target node or without any affinity at all if the node is
      CPUless.  There is a cpu hotplug callback (kswapd_cpu_online) that
      implements an elaborate way to update this mask when a cpu is onlined.
      
      It is not really clear whether there is any actual benefit from this
      scheme. Completely CPU-less NUMA nodes rarely gain a new CPU during
      runtime. Drop the code for that reason. If there is a real usecase then
      we can resurrect and simplify the code.
      
      [mhocko@suse.com rewrite changelog]
      Suggested-by: NMichal Hocko <mhocko@suse.org>
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200218224422.3407-1-richardw.yang@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b700b5b
    • Y
      mm: vmscan: replace open codings to NUMA_NO_NODE · f661d007
      Yang Shi 提交于
      The commit 98fa15f3 ("mm: replace all open encodings for
      NUMA_NO_NODE") did the replacement across the kernel tree, but we got
      some more in vmscan.c since then.
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/1581568298-45317-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f661d007
    • Y
      mm: swap: make page_evictable() inline · 1eb6234e
      Yang Shi 提交于
      When backporting commit 9c4e6b1a ("mm, mlock, vmscan: no more skipping
      pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
      a couple of vm-scalability's test cases (lru-file-readonce,
      lru-file-readtwice and lru-file-mmap-read).  I didn't see that much down
      on my VM (32c-64g-2nodes).  It might be caused by the test configuration,
      which is 32c-256g with NUMA disabled and the tests were run in root memcg,
      so the tests actually stress only one inactive and active lru.  It sounds
      not very usual in mordern production environment.
      
      That commit did two major changes:
      1. Call page_evictable()
      2. Use smp_mb to force the PG_lru set visible
      
      It looks they contribute the most overhead.  The page_evictable() is a
      function which does function prologue and epilogue, and that was used by
      page reclaim path only.  However, lru add is a very hot path, so it sounds
      better to make it inline.  However, it calls page_mapping() which is not
      inlined either, but the disassemble shows it doesn't do push and pop
      operations and it sounds not very straightforward to inline it.
      
      Other than this, it sounds smp_mb() is not necessary for x86 since
      SetPageLRU is atomic which enforces memory barrier already, replace it
      with smp_mb__after_atomic() in the following patch.
      
      With the two fixes applied, the tests can get back around 5% on that test
      bench and get back normal on my VM.  Since the test bench configuration is
      not that usual and I also saw around 6% up on the latest upstream, so it
      sounds good enough IMHO.
      
      The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
      	mainline	w/ inline fix
                150MB            154MB
      
      With this patch the throughput gets 2.67% up.  The data with using
      smp_mb__after_atomic() is showed in the following patch.
      
      Shakeel Butt did the below test:
      
      On a real machine with limiting the 'dd' on a single node and reading 100
      GiB sparse file (less than a single node).  Just ran a single instance to
      not cause the lru lock contention.  The cmdline used is "dd if=file-100GiB
      of=/dev/null bs=4k".  Ran the cmd 10 times with drop_caches in between and
      measured the time it took.
      
      Without patch: 56.64143 +- 0.672 sec
      
      With patches: 56.10 +- 0.21 sec
      
      [akpm@linux-foundation.org: move page_evictable() to internal.h]
      Fixes: 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb6234e
  2. 22 2月, 2020 1 次提交
    • G
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan 提交于
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: NGavin Shan <gshan@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
  3. 01 2月, 2020 3 次提交
  4. 18 12月, 2019 1 次提交
  5. 02 12月, 2019 14 次提交
  6. 01 12月, 2019 1 次提交
  7. 19 10月, 2019 2 次提交
    • W
      mm/vmscan.c: support removing arbitrary sized pages from mapping · 906d278d
      William Kucharski 提交于
      __remove_mapping() assumes that pages can only be either base pages or
      HPAGE_PMD_SIZE.  Ask the page what size it is.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      906d278d
    • H
      mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size · b11edebb
      Honglei Wang 提交于
      Commit 1a61ab80 ("mm: memcontrol: replace zone summing with
      lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
      instead of calculating it directly from lru_zone_size with an idea that
      this would be more effective.
      
      Tim has reported that this is not really the case for their database
      benchmark which is showing an opposite results where lruvec_page_state
      is taking up a huge chunk of CPU cycles (about 25% of the system time
      which is roughly 7% of total cpu cycles) on 5.3 kernels.  The workload
      is running on a larger machine (96cpus), it has many cgroups (500) and
      it is heavily direct reclaim bound.
      
      Tim Chen said:
      
      : The problem can also be reproduced by running simple multi-threaded
      : pmbench benchmark with a fast Optane SSD swap (see profile below).
      :
      :
      : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
      :             |
      :             |--3.07%--lruvec_lru_size
      :             |          |
      :             |          |--2.11%--cpumask_next
      :             |          |          |
      :             |          |           --1.66%--find_next_bit
      :             |          |
      :             |           --0.57%--call_function_interrupt
      :             |                     |
      :             |                      --0.55%--smp_call_function_interrupt
      :             |
      :             |--1.59%--0x441f0fc3d009
      :             |          _ops_rdtsc_init_base_freq
      :             |          access_histogram
      :             |          page_fault
      :             |          __do_page_fault
      :             |          handle_mm_fault
      :             |          __handle_mm_fault
      :             |          |
      :             |           --1.54%--do_swap_page
      :             |                     swapin_readahead
      :             |                     swap_cluster_readahead
      :             |                     |
      :             |                      --1.53%--read_swap_cache_async
      :             |                                __read_swap_cache_async
      :             |                                alloc_pages_vma
      :             |                                __alloc_pages_nodemask
      :             |                                __alloc_pages_slowpath
      :             |                                try_to_free_pages
      :             |                                do_try_to_free_pages
      :             |                                shrink_node
      :             |                                shrink_node_memcg
      :             |                                |
      :             |                                |--0.77%--lruvec_lru_size
      :             |                                |
      :             |                                 --0.76%--inactive_list_is_low
      :             |                                           |
      :             |                                            --0.76%--lruvec_lru_size
      :             |
      :              --1.50%--measure_read
      :                        page_fault
      :                        __do_page_fault
      :                        handle_mm_fault
      :                        __handle_mm_fault
      :                        do_swap_page
      :                        swapin_readahead
      :                        swap_cluster_readahead
      :                        |
      :                         --1.48%--read_swap_cache_async
      :                                   __read_swap_cache_async
      :                                   alloc_pages_vma
      :                                   __alloc_pages_nodemask
      :                                   __alloc_pages_slowpath
      :                                   try_to_free_pages
      :                                   do_try_to_free_pages
      :                                   shrink_node
      :                                   shrink_node_memcg
      :                                   |
      :                                   |--0.75%--inactive_list_is_low
      :                                   |          |
      :                                   |           --0.75%--lruvec_lru_size
      :                                   |
      :                                    --0.73%--lruvec_lru_size
      
      The likely culprit is the cache traffic the lruvec_page_state_local
      generates.  Dave Hansen says:
      
      : I was thinking purely of the cache footprint.  If it's reading
      : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
      : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
      : be 18k of data for the whole system and the caching would be pretty
      : efficient and all 18k would probably survive a tight page fault loop in
      : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
      : fit in the L1 and probably wouldn't survive a tight page fault loop if
      : both logical threads were banging on different cgroups.
      :
      : It's just a theory, but it's why I noted the number of cgroups when I
      : initially saw this show up in profiles
      
      Fix the regression by partially reverting the said commit and calculate
      the lru size explicitly.
      
      Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
      Fixes: 1a61ab80 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
      Signed-off-by: NHonglei Wang <honglei.wang@oracle.com>
      Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b11edebb
  8. 08 10月, 2019 3 次提交
    • C
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down 提交于
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • C
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down 提交于
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
  9. 26 9月, 2019 2 次提交
    • M
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim 提交于
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • M
      mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · 8940b34a
      Minchan Kim 提交于
      The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
      as default.  It is for preventing to reclaim dirty pages when CMA try to
      migrate pages.  Strictly speaking, we don't need it because CMA didn't
      allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
      
      Moreover, it has a problem to prevent anonymous pages's swap out even
      though force_reclaim = true in shrink_page_list on upcoming patch.  So
      this patch makes references's default value to PAGEREF_RECLAIM and rename
      force_reclaim with ignore_references to make it more clear.
      
      This is a preparatory work for next patch.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8940b34a
  10. 25 9月, 2019 6 次提交
  11. 31 8月, 2019 1 次提交
  12. 14 8月, 2019 1 次提交
    • M
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · 28360f39
      Mel Gorman 提交于
      Dave Chinner reported a problem pointing a finger at commit 1c30844d
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28360f39
  13. 03 8月, 2019 1 次提交