1. 15 5月, 2019 8 次提交
  2. 20 4月, 2019 1 次提交
  3. 09 4月, 2019 1 次提交
    • S
      treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively · d75f773c
      Sakari Ailus 提交于
      %pF and %pf are functionally equivalent to %pS and %ps conversion
      specifiers. The former are deprecated, therefore switch the current users
      to use the preferred variant.
      
      The changes have been produced by the following command:
      
      	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
      	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done
      
      And verifying the result.
      
      Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: xen-devel@lists.xenproject.org
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: drbd-dev@lists.linbit.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-pci@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: linux-mm@kvack.org
      Cc: ceph-devel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
      Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      d75f773c
  4. 06 3月, 2019 7 次提交
  5. 13 2月, 2019 1 次提交
    • D
      Revert "mm: slowly shrink slabs with a relatively small number of objects" · a9a238e8
      Dave Chinner 提交于
      This reverts commit 172b06c3 ("mm: slowly shrink slabs with a
      relatively small number of objects").
      
      This change changes the agressiveness of shrinker reclaim, causing small
      cache and low priority reclaim to greatly increase scanning pressure on
      small caches.  As a result, light memory pressure has a disproportionate
      affect on small caches, and causes large caches to be reclaimed much
      faster than previously.
      
      As a result, it greatly perturbs the delicate balance of the VFS caches
      (dentry/inode vs file page cache) such that the inode/dentry caches are
      reclaimed much, much faster than the page cache and this drives us into
      several other caching imbalance related problems.
      
      As such, this is a bad change and needs to be reverted.
      
      [ Needs some massaging to retain the later seekless shrinker
        modifications.]
      
      Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
      Fixes: 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Wolfgang Walter <linux@stwm.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Spock <dairinin@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a238e8
  6. 29 12月, 2018 2 次提交
    • H
      mm: put_and_wait_on_page_locked() while page is migrated · 9a1ea439
      Hugh Dickins 提交于
      Waiting on a page migration entry has used wait_on_page_locked() all along
      since 2006: but you cannot safely wait_on_page_locked() without holding a
      reference to the page, and that extra reference is enough to make
      migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
      on the entry before migrate_page_move_mapping() gets there.
      
      And that failure is retried nine times, amplifying the pain when trying to
      migrate a popular page.  With a single persistent faulter, migration
      sometimes succeeds; with two or three concurrent faulters, success becomes
      much less likely (and the more the page was mapped, the worse the overhead
      of unmapping and remapping it on each try).
      
      This is especially a problem for memory offlining, where the outer level
      retries forever (or until terminated from userspace), because a heavy
      refault workload can trigger an endless loop of migration failures.
      wait_on_page_locked() is the wrong tool for the job.
      
      David Herrmann (but was he the first?) noticed this issue in 2014:
      https://marc.info/?l=linux-mm&m=140110465608116&w=2
      
      Tim Chen started a thread in August 2017 which appears relevant:
      https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
      on to implicate __migration_entry_wait():
      https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
      up with the v4.14 commits: 2554db91 ("sched/wait: Break up long wake
      list walk") 11a19c7b ("sched/wait: Introduce wakeup boomark in
      wake_up_page_bit")
      
      Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
      https://marc.info/?l=linux-mm&m=154217936431300&w=2
      
      We have all assumed that it is essential to hold a page reference while
      waiting on a page lock: partly to guarantee that there is still a struct
      page when MEMORY_HOTREMOVE is configured, but also to protect against
      reuse of the struct page going to someone who then holds the page locked
      indefinitely, when the waiter can reasonably expect timely unlocking.
      
      But in fact, so long as wait_on_page_bit_common() does the put_page(), and
      is careful not to rely on struct page contents thereafter, there is no
      need to hold a reference to the page while waiting on it.  That does mean
      that this case cannot go back through the loop: but that's fine for the
      page migration case, and even if used more widely, is limited by the "Stop
      walking if it's locked" optimization in wake_page_function().
      
      Add interface put_and_wait_on_page_locked() to do this, using "behavior"
      enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
      No interruptible or killable variant needed yet, but they might follow: I
      have a vague notion that reporting -EINTR should take precedence over
      return from wait_on_page_bit_common() without knowing the page state, so
      arrange it accordingly - but that may be nothing but pedantic.
      
      __migration_entry_wait() still has to take a brief reference to the page,
      prior to calling put_and_wait_on_page_locked(): but now that it is dropped
      before waiting, the chance of impeding page migration is very much
      reduced.  Should we perhaps disable preemption across this?
      
      shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
      survived a lot of testing before that showed up.  PageWaiters may have
      been set by wait_on_page_bit_common(), and the reference dropped, just
      before shrink_page_list() succeeds in freezing its last page reference: in
      such a case, unlock_page() must be used.  Follow the suggestion from
      Michal Hocko, just revert a978d6f5 ("mm: unlockless reclaim") now:
      that optimization predates PageWaiters, and won't buy much these days; but
      we can reinstate it for the !PageWaiters case if anyone notices.
      
      It does raise the question: should vmscan.c's is_page_cache_freeable() and
      __remove_mapping() now treat a PageWaiters page as if an extra reference
      were held?  Perhaps, but I don't think it matters much, since
      shrink_page_list() already had to win its trylock_page(), so waiters are
      not very common there: I noticed no difference when trying the bigger
      change, and it's surely not needed while put_and_wait_on_page_locked() is
      only used for page migration.
      
      [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvilsSigned-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a1ea439
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 1c30844d
      Mel Gorman 提交于
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c30844d
  7. 07 11月, 2018 1 次提交
  8. 27 10月, 2018 4 次提交
    • J
      mm: zero-seek shrinkers · 4b85afbd
      Johannes Weiner 提交于
      The page cache and most shrinkable slab caches hold data that has been
      read from disk, but there are some caches that only cache CPU work, such
      as the dentry and inode caches of procfs and sysfs, as well as the subset
      of radix tree nodes that track non-resident page cache.
      
      Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
      the shrinker's seeks setting tells the reclaim algorithm that for every
      two page cache pages scanned it should scan one slab object.
      
      This is a bogus setting.  A virtual inode that required no IO to create is
      not twice as valuable as a page cache page; shadow cache entries with
      eviction distances beyond the size of memory aren't either.
      
      In most cases, the behavior in practice is still fine.  Such virtual
      caches don't tend to grow and assert themselves aggressively, and usually
      get picked up before they cause problems.  But there are scenarios where
      that's not true.
      
      Our database workloads suffer from two of those.  For one, their file
      workingset is several times bigger than available memory, which has the
      kernel aggressively create shadow page cache entries for the non-resident
      parts of it.  The workingset code does tell the VM that most of these are
      expendable, but the VM ends up balancing them 2:1 to cache pages as per
      the seeks setting.  This is a huge waste of memory.
      
      These workloads also deal with tens of thousands of open files and use
      /proc for introspection, which ends up growing the proc_inode_cache to
      absurdly large sizes - again at the cost of valuable cache space, which
      isn't a reasonable trade-off, given that proc inodes can be re-created
      without involving the disk.
      
      This patch implements a "zero-seek" setting for shrinkers that results in
      a target ratio of 0:1 between their objects and IO-backed caches.  This
      allows such virtual caches to grow when memory is available (they do
      cache/avoid CPU work after all), but effectively disables them as soon as
      IO-backed objects are under pressure.
      
      It then switches the shrinkers for procfs and sysfs metadata, as well as
      excess page cache shadow nodes, to the new zero-seek setting.
      
      Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDomas Mituzas <dmituzas@fb.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b85afbd
    • J
      psi: pressure stall information for CPU, memory, and IO · eb414681
      Johannes Weiner 提交于
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb414681
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner 提交于
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
    • R
      mm: don't miss the last page because of round-off error · 68600f62
      Roman Gushchin 提交于
      I've noticed, that dying memory cgroups are often pinned in memory by a
      single pagecache page.  Even under moderate memory pressure they sometimes
      stayed in such state for a long time.  That looked strange.
      
      My investigation showed that the problem is caused by applying the LRU
      pressure balancing math:
      
        scan = div64_u64(scan * fraction[lru], denominator),
      
      where
      
        denominator = fraction[anon] + fraction[file] + 1.
      
      Because fraction[lru] is always less than denominator, if the initial scan
      size is 1, the result is always 0.
      
      This means the last page is not scanned and has
      no chances to be reclaimed.
      
      Fix this by rounding up the result of the division.
      
      In practice this change significantly improves the speed of dying cgroups
      reclaim.
      
      [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
        Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
      Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68600f62
  9. 21 10月, 2018 2 次提交
  10. 06 10月, 2018 1 次提交
  11. 21 9月, 2018 1 次提交
    • R
      mm: slowly shrink slabs with a relatively small number of objects · 172b06c3
      Roman Gushchin 提交于
      9092c71b ("mm: use sc->priority for slab shrink targets") changed the
      way that the target slab pressure is calculated and made it
      priority-based:
      
          delta = freeable >> priority;
          delta *= 4;
          do_div(delta, shrinker->seeks);
      
      The problem is that on a default priority (which is 12) no pressure is
      applied at all, if the number of potentially reclaimable objects is less
      than 4096 (1<<12).
      
      This causes the last objects on slab caches of no longer used cgroups to
      (almost) never get reclaimed.  It's obviously a waste of memory.
      
      It can be especially painful, if these stale objects are holding a
      reference to a dying cgroup.  Slab LRU lists are reparented on memcg
      offlining, but corresponding objects are still holding a reference to the
      dying cgroup.  If we don't scan these objects, the dying cgroup can't go
      away.  Most likely, the parent cgroup hasn't any directly charged objects,
      only remaining objects from dying children cgroups.  So it can easily hold
      a reference to hundreds of dying cgroups.
      
      If there are no big spikes in memory pressure, and new memory cgroups are
      created and destroyed periodically, this causes the number of dying
      cgroups grow steadily, causing a slow-ish and hard-to-detect memory
      "leak".  It's not a real leak, as the memory can be eventually reclaimed,
      but it could not happen in a real life at all.  I've seen hosts with a
      steadily climbing number of dying cgroups, which doesn't show any signs of
      a decline in months, despite the host is loaded with a production
      workload.
      
      It is an obvious waste of memory, and to prevent it, let's apply a minimal
      pressure even on small shrinker lists.  E.g.  if there are freeable
      objects, let's scan at least min(freeable, scan_batch) objects.
      
      This fix significantly improves a chance of a dying cgroup to be
      reclaimed, and together with some previous patches stops the steady growth
      of the dying cgroups number on some of our hosts.
      
      Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com
      Fixes: 9092c71b ("mm: use sc->priority for slab shrink targets")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      172b06c3
  12. 23 8月, 2018 2 次提交
  13. 18 8月, 2018 9 次提交
    • K
      mm: use special value SHRINKER_REGISTERING instead of list_empty() check · 7e010df5
      Kirill Tkhai 提交于
      The patch introduces a special value SHRINKER_REGISTERING to use instead
      of list_empty() to differ a registering shrinker from unregistered
      shrinker.  Why we need that at all?
      
      Shrinker registration is split in two parts.  The first one is
      prealloc_shrinker(), which allocates shrinker memory and reserves ID in
      shrinker_idr.  This function can fail.  The second is
      register_shrinker_prepared(), and it finalizes the registration.  This
      function actually makes shrinker available to be used from
      shrink_slab(), and it can't fail.
      
      One shrinker may be based on more then one LRU lists.  So, we never
      clear the bit in memcg shrinker maps, when (one of) corresponding LRU
      list becomes empty, since other LRU lists may be not empty.  See
      superblock shrinker for example: it is based on two LRU lists:
      s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
      when there are no inodes in s_inode_lru, as s_dentry_lru may contain
      dentries.
      
      Instead of that, we use special algorithm to detect shrinkers having no
      elements at all its LRU lists, and this is made in shrink_slab_memcg().
      See the comment in this function for the details.
      
      Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
      meet unregistered shrinker (bit is set, while there is no a shrinker in
      IDR).  Otherwise, we would have done that at the moment of shrinker
      unregistration for all memcgs (and this looks worse, since iteration
      over all memcg may take much time).  Also this would have imposed
      restrictions on shrinker unregistration order for its users: they would
      have had to guarantee, there are no new elements after
      unregister_shrinker() (otherwise, a new added element would have set a
      bit).
      
      So, if we meet a set bit in map and no shrinker in IDR when we're
      iterating over the map in shrink_slab_memcg(), this means the
      corresponding shrinker is unregistered, and we must clear the bit.
      
      Another case is shrinker registration.  We want two things there:
      
      1) do_shrink_slab() can be called only for completely registered
         shrinkers;
      
      2) shrinker internal lists may be populated in any order with
         register_shrinker_prepared() (let's talk on the example with sb).  Both
         of:
      
        a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
          memcg_set_shrinker_bit();                               [cpu0]
          ...
          register_shrinker_prepared();                           [cpu1]
      
        and
      
        b)register_shrinker_prepared();                           [cpu0]
          ...
          list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
          memcg_set_shrinker_bit();                               [cpu1]
      
         are legitimate.  We don't want to impose restriction here and to
         force people to use only (b) variant.  We don't want to force people to
         care, there is no elements in LRU lists before the shrinker is
         completely registered.  Internal users of LRU lists and shrinker code
         are two different subsystems, and they have to be closed in themselves
         each other.
      
      In (a) case we have the bit set before shrinker is completely
      registered.  We don't want do_shrink_slab() is called at this moment, so
      we have to detect such the registering shrinkers.
      
      Before this patch list_empty() (shrinker is not linked to the list)
      check was used for that.  So, in (a) there could be a bit set, but we
      don't call do_shrink_slab() unless shrinker is linked to the list.  It's
      just an indicator, I just overloaded linking to the list.
      
      This was not the best solution, since it's better not to touch the
      shrinker memory from shrink_slab_memcg() before it's completely
      registered (this also will be useful in the future to make shrink_slab()
      completely lockless).
      
      So, this patch introduces better way to detect registering shrinker,
      which allows not to dereference shrinker memory.  It's just a ~0UL
      value, which we insert into the IDR during ID allocation.  After
      shrinker is ready to be used, we insert actual shrinker pointer in the
      IDR, and it becomes available to shrink_slab_memcg().
      
      We can't use NULL instead of this new value for this purpose as:
      shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
      and we don't want the function sees NULL and clears the bit, otherwise
      (a) won't work.
      
      This is the only thing the patch makes: the better way to detect
      registering shrinker.  Nothing else this patch makes.
      
      Also this gives a better assembler, but it's minor side of the patch:
      
      Before:
        callq  <idr_find>
        mov    %rax,%r15
        test   %rax,%rax
        je     <shrink_slab_memcg+0x1d5>
        mov    0x20(%rax),%rax
        lea    0x20(%r15),%rdx
        cmp    %rax,%rdx
        je     <shrink_slab_memcg+0xbd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  <do_shrink_slab>
      
      After:
        callq  <idr_find>
        mov    %rax,%r15
        lea    -0x1(%rax),%rax
        cmp    $0xfffffffffffffffd,%rax
        ja     <shrink_slab_memcg+0x1cd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  ffffffff810cefd0 <do_shrink_slab>
      
      [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
        Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
      Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e010df5
    • K
      mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab() · ac7fb3ad
      Kirill Tkhai 提交于
      In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
      numa-aware.  This is not a real problem, since currently all memcg-aware
      shrinkers are numa-aware too (we have two: super_block shrinker and
      workingset shrinker), but something may change in the future.
      
      Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac7fb3ad
    • K
      mm/vmscan.c: clear shrinker bit if there are no objects related to memcg · f90280d6
      Kirill Tkhai 提交于
      To avoid further unneed calls of do_shrink_slab() for shrinkers, which
      already do not have any charged objects in a memcg, their bits have to
      be cleared.
      
      This patch introduces a lockless mechanism to do that without races
      without parallel list lru add.  After do_shrink_slab() returns
      SHRINK_EMPTY the first time, we clear the bit and call it once again.
      Then we restore the bit, if the new return value is different.
      
      Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
      two situations:
      
      1)list_lru_add()     shrink_slab_memcg
          list_add_tail()    for_each_set_bit() <--- read bit
                               do_shrink_slab() <--- missed list update (no barrier)
          <MB>                 <MB>
          set_bit()            do_shrink_slab() <--- seen list update
      
      This situation, when the first do_shrink_slab() sees set bit, but it
      doesn't see list update (i.e., race with the first element queueing), is
      rare.  So we don't add <MB> before the first call of do_shrink_slab()
      instead of this to do not slow down generic case.  Also, it's need the
      second call as seen in below in (2).
      
      2)list_lru_add()      shrink_slab_memcg()
          list_add_tail()     ...
          set_bit()           ...
        ...                   for_each_set_bit()
        do_shrink_slab()        do_shrink_slab()
          clear_bit()           ...
        ...                     ...
        list_lru_add()          ...
          list_add_tail()       clear_bit()
          <MB>                  <MB>
          set_bit()             do_shrink_slab()
      
      The barriers guarantee that the second do_shrink_slab() in the right
      side task sees list update if really cleared the bit.  This case is
      drawn in the code comment.
      
      [Results/performance of the patchset]
      
      After the whole patchset applied the below test shows signify increase
      of performance:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
            $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
      			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
      			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
      			    touch s/$i/file; done
      
      Then, 5 sequential calls of drop caches:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
      1)Before:
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      2)After
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      Shakeel Butt tested this patchset with fork-bomb on his configuration:
      
       > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
       > file containing few KiBs on corresponding mount. Then in a separate
       > memcg of 200 MiB limit ran a fork-bomb.
       >
       > I ran the "perf record -ag -- sleep 60" and below are the results:
       >
       > Without the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
       > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
       > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
       > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
       > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
       > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       >
       > With the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
       > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
       > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
       > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
       > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
       > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
       > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90280d6
    • K
      mm: add SHRINK_EMPTY shrinker methods return value · 9b996468
      Kirill Tkhai 提交于
      We need to distinguish the situations when shrinker has very small
      amount of objects (see vfs_pressure_ratio() called from
      super_cache_count()), and when it has no objects at all.  Currently, in
      the both of these cases, shrinker::count_objects() returns 0.
      
      The patch introduces new SHRINK_EMPTY return value, which will be used
      for "no objects at all" case.  It's is a refactoring mostly, as
      SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
      patch, and all the magic will happen in further.
      
      Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b996468
    • V
      mm/vmscan.c: generalize shrink_slab() calls in shrink_node() · aeed1d32
      Vladimir Davydov 提交于
      The patch makes shrink_slab() be called for root_mem_cgroup in the same
      way as it's called for the rest of cgroups.  This simplifies the logic
      and improves the readability.
      
      [ktkhai@virtuozzo.com: wrote changelog]
      Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomainSigned-off-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aeed1d32
    • K
      mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab() · b0dedc49
      Kirill Tkhai 提交于
      Using the preparations made in previous patches, in case of memcg
      shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
      bitmap.  To do that, we separate iterations over memcg-aware and
      !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
      for_each_set_bit() from the bitmap.  In case of big nodes, having many
      isolated environments, this gives significant performance growth.  See
      next patches for the details.
      
      Note that the patch does not respect to empty memcg shrinkers, since we
      never clear the bitmap bits after we set it once.  Their shrinkers will
      be called again, with no shrinked objects as result.  This functionality
      is provided by next patches.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dedc49
    • K
      mm, memcg: assign memcg-aware shrinkers bitmap to memcg · 0a4465d3
      Kirill Tkhai 提交于
      Imagine a big node with many cpus, memory cgroups and containers.  Let
      we have 200 containers, every container has 10 mounts, and 10 cgroups.
      All container tasks don't touch foreign containers mounts.  If there is
      intensive pages write, and global reclaim happens, a writing task has to
      iterate over all memcgs to shrink slab, before it's able to go to
      shrink_page_list().
      
      Iteration over all the memcg slabs is very expensive: the task has to
      visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
      2000 memcgs, the total calls are 2000 * 2000 = 4000000.
      
      So, the shrinker makes 4 million do_shrink_slab() calls just to try to
      isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
      shrink_page_list().  I've observed a node spending almost 100% in
      kernel, making useless iteration over already shrinked slab.
      
      This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
      the bitmap depends on bitmap_nr_ids, and during memcg life it's
      maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
      the map is related to corresponding shrinker id.
      
      Next patches will maintain set bit only for really charged memcg.  This
      will allow shrink_slab() to increase its performance in significant way.
      See the last patch for the numbers.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
      [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
        Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
      Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a4465d3
    • K
      mm: assign id to every memcg-aware shrinker · b4c2b231
      Kirill Tkhai 提交于
      Introduce shrinker::id number, which is used to enumerate memcg-aware
      shrinkers.  The number start from 0, and the code tries to maintain it
      as small as possible.
      
      This will be used to represent a memcg-aware shrinkers in memcg
      shrinkers map.
      
      Since all memcg-aware shrinkers are based on list_lru, which is
      per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
      be under this config option.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4c2b231
    • G
      mm/vmscan.c: condense scan_control · bb451fdf
      Greg Thelen 提交于
      Use smaller scan_control fields for order, priority, and reclaim_idx.
      Convert fields from int => s8.  All easily fit within a byte:
      
       - allocation order range: 0..MAX_ORDER(64?)
       - priority range:         0..12(DEF_PRIORITY)
       - reclaim_idx range:      0..6(__MAX_NR_ZONES)
      
      Since 6538b8ea ("x86_64: expand kernel stack to 16K") x86_64 stack
      overflows are not an issue.  But it's inefficient to use ints.
      
      Use s8 (signed byte) rather than u8 to allow for loops like:
      	do {
      		...
      	} while (--sc.priority >= 0);
      
      Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
      
      This reduces sizeof(struct scan_control):
       - 96 => 80 bytes (x86_64)
       - 68 => 56 bytes (i386)
      
      scan_control structure field order is changed to utilize padding.  After
      this patch there is 1 bit of scan_control padding.
      
      akpm: makes my vmscan.o's .text 572 bytes smaller as well.
      
      Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb451fdf