1. 27 9月, 2021 1 次提交
  2. 24 9月, 2021 1 次提交
    • S
      memcg: flush lruvec stats in the refault · 1f828223
      Shakeel Butt 提交于
      Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
      and the commit aa48e47e ("memcg: infrastructure to flush memcg
      stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
      32) at worst and for unbounded amount of time.  The commit aa48e47e
      moved the lruvec stats to rstat infrastructure and the commit
      7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
      32) at worst for at most 2 seconds.  More specifically it decoupled the
      number of stats and the number of cgroups from the error rate.
      
      However this reduction in error comes with the cost of triggering the
      slowpath of stats update more frequently.  Previously in the slowpath
      the kernel adds the stats up the memcg tree.  After aa48e47e, the
      kernel triggers the asyn lruvec stats flush through queue_work().  This
      causes regression reports from 0day kernel bot [1] as well as from
      phoronix test suite [2].
      
      We tried two options to fix the regression:
      
       1) Increase the threshold to trigger the slowpath in lruvec stats
          update codepath from 32 to 512.
      
       2) Remove the slowpath from lruvec stats update codepath and instead
          flush the stats in the page refault codepath. The assumption is that
          the kernel timely flush the stats, so, the update tree would be
          small in the refault codepath to not cause the preformance impact.
      
      Following are the results of will-it-scale/page_fault[1|2|3] benchmark
      on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
      aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
      (4) 5.15-rc1 with option-2.
      
        test       (1)      (2)               (3)               (4)
        pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
        pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
        pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)
      
      From the above result, it seems like the option-2 not only solves the
      regression but also improves the performance for at least these
      benchmarks.
      
      Feng Tang (intel) ran the aim7 benchmark with these two options and
      confirms that option-1 reduces the regression but option-2 removes the
      regression.
      
      Michael Larabel (phoronix) ran multiple benchmarks with these options
      and reported the results at [3] and it shows for most benchmarks
      option-2 removes the regression introduced by the commit aa48e47e
      ("memcg: infrastructure to flush memcg stats").
      
      Based on the experiment results, this patch proposed the option-2 as the
      solution to resolve the regression.
      
      Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
      Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
      Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
      Fixes: aa48e47e ("memcg: infrastructure to flush memcg stats")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Tested-by: NMichael Larabel <Michael@phoronix.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>,
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f828223
  3. 09 9月, 2021 1 次提交
  4. 01 7月, 2021 1 次提交
  5. 30 6月, 2021 1 次提交
  6. 06 5月, 2021 1 次提交
  7. 25 2月, 2021 2 次提交
  8. 16 12月, 2020 2 次提交
  9. 03 12月, 2020 1 次提交
    • R
      mm: memcontrol: Use helpers to read page's memcg data · bcfe06bf
      Roman Gushchin 提交于
      Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
      
      Currently a non-slab kernel page which has been charged to a memory cgroup
      can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
      flag is defined as a page type (like buddy, offline, etc), so it takes a
      bit from a page->mapped counter.  Pages with a type set can't be mapped to
      userspace.
      
      But in general the kmemcg flag has nothing to do with mapping to
      userspace.  It only means that the page has been accounted by the page
      allocator, so it has to be properly uncharged on release.
      
      Some bpf maps are mapping the vmalloc-based memory to userspace, and their
      memory can't be accounted because of this implementation detail.
      
      This patchset removes this limitation by moving the PageKmemcg flag into
      one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
      accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
      adds several checks and removes a couple of obsolete functions.  As the
      result the code became more robust with fewer open-coded bit tricks.
      
      This patch (of 4):
      
      Currently there are many open-coded reads of the page->mem_cgroup pointer,
      as well as a couple of read helpers, which are barely used.
      
      It creates an obstacle on a way to reuse some bits of the pointer for
      storing additional bits of information.  In fact, we already do this for
      slab pages, where the last bit indicates that a pointer has an attached
      vector of objcg pointers instead of a regular memcg pointer.
      
      This commits uses 2 existing helpers and introduces a new helper to
      converts all read sides to calls of these helpers:
        struct mem_cgroup *page_memcg(struct page *page);
        struct mem_cgroup *page_memcg_rcu(struct page *page);
        struct mem_cgroup *page_memcg_check(struct page *page);
      
      page_memcg_check() is intended to be used in cases when the page can be a
      slab page and have a memcg pointer pointing at objcg vector.  It does
      check the lowest bit, and if set, returns NULL.  page_memcg() contains a
      VM_BUG_ON_PAGE() check for the page not being a slab page.
      
      To make sure nobody uses a direct access, struct page's
      mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
      Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
      bcfe06bf
  10. 17 10月, 2020 1 次提交
  11. 13 10月, 2020 1 次提交
  12. 15 8月, 2020 1 次提交
  13. 13 8月, 2020 2 次提交
  14. 08 8月, 2020 1 次提交
  15. 26 6月, 2020 1 次提交
  16. 04 6月, 2020 3 次提交
    • J
      mm: vmscan: reclaim writepage is IO cost · 96f8bf4f
      Johannes Weiner 提交于
      The VM tries to balance reclaim pressure between anon and file so as to
      reduce the amount of IO incurred due to the memory shortage.  It already
      counts refaults and swapins, but in addition it should also count
      writepage calls during reclaim.
      
      For swap, this is obvious: it's IO that wouldn't have occurred if the
      anonymous memory hadn't been under memory pressure.  From a relative
      balancing point of view this makes sense as well: even if anon is cold and
      reclaimable, a cache that isn't thrashing may have equally cold pages that
      don't require IO to reclaim.
      
      For file writeback, it's trickier: some of the reclaim writepage IO would
      have likely occurred anyway due to dirty expiration.  But not all of it -
      premature writeback reduces batching and generates additional writes.
      Since the flushers are already woken up by the time the VM starts writing
      cache pages one by one, let's assume that we'e likely causing writes that
      wouldn't have happened without memory pressure.  In addition, the per-page
      cost of IO would have probably been much cheaper if written in larger
      batches from the flusher thread rather than the single-page-writes from
      kswapd.
      
      For our purposes - getting the trend right to accelerate convergence on a
      stable state that doesn't require paging at all - this is sufficiently
      accurate.  If we later wanted to optimize for sustained thrashing, we can
      still refine the measurements.
      
      Count all writepage calls from kswapd as IO cost toward the LRU that the
      page belongs to.
      
      Why do this dynamically?  Don't we know in advance that anon pages require
      IO to reclaim, and so could build in a static bias?
      
      First, scanning is not the same as reclaiming.  If all the anon pages are
      referenced, we may not swap for a while just because we're scanning the
      anon list.  During this time, however, it's important that we age
      anonymous memory and the page cache at the same rate so that their
      hot-cold gradients are comparable.  Everything else being equal, we still
      want to reclaim the coldest memory overall.
      
      Second, we keep copies in swap unless the page changes.  If there is
      swap-backed data that's mostly read (tmpfs file) and has been swapped out
      before, we can reclaim it without incurring additional IO.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f8bf4f
    • J
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner 提交于
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • J
      mm: workingset: let cache workingset challenge anon · 34e58cac
      Johannes Weiner 提交于
      We activate cache refaults with reuse distances in pages smaller than the
      size of the total cache.  This allows new pages with competitive access
      frequencies to establish themselves, as well as challenge and potentially
      displace pages on the active list that have gone cold.
      
      However, that assumes that active cache can only replace other active
      cache in a competition for the hottest memory.  This is not a great
      default assumption.  The page cache might be thrashing while there are
      enough completely cold and unused anonymous pages sitting around that we'd
      only have to write to swap once to stop all IO from the cache.
      
      Activate cache refaults when their reuse distance in pages is smaller than
      the total userspace workingset, including anonymous pages.
      
      Reclaim can still decide how to balance pressure among the two LRUs
      depending on the IO situation.  Rotational drives will prefer avoiding
      random IO from swap and go harder after cache.  But fundamentally, hot
      cache should be able to compete with anon pages for a place in RAM.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34e58cac
  17. 02 12月, 2019 2 次提交
    • J
      mm: vmscan: detect file thrashing at the reclaim root · b910718a
      Johannes Weiner 提交于
      We use refault information to determine whether the cache workingset is
      stable or transitioning, and dynamically adjust the inactive:active file
      LRU ratio so as to maximize protection from one-off cache during stable
      periods, and minimize IO during transitions.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, refaults only affect the
      local LRU order in the cgroup in which they are occuring.  As a result,
      cache transitions can take longer in a cgrouped system as the active pages
      of sibling cgroups aren't challenged when they should be.
      
      [ Right now, this is somewhat theoretical, because the siblings, under
        continued regular reclaim pressure, should eventually run out of
        inactive pages - and since inactive:active *size* balancing is also
        done on a cgroup-local level, we will challenge the active pages
        eventually in most cases. But the next patch will move that relative
        size enforcement to the reclaim root as well, and then this patch
        here will be necessary to propagate refault pressure to siblings. ]
      
      This patch moves refault detection to the root of reclaim.  Instead of
      remembering the cgroup owner of an evicted page, remember the cgroup that
      caused the reclaim to happen.  When refaults later occur, they'll
      correctly influence the cross-cgroup LRU order that reclaim follows.
      
      I.e.  if global reclaim kicked out pages in some subgroup A/B/C, the
      refault of those pages will challenge the global LRU order, and not just
      the local order down inside C.
      
      [hannes@cmpxchg.org:  use page_memcg() instead of another lookup]
        Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
      Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b910718a
    • J
      mm: clean up and clarify lruvec lookup procedure · 867e5e1d
      Johannes Weiner 提交于
      There is a per-memcg lruvec and a NUMA node lruvec.  Which one is being
      used is somewhat confusing right now, and it's easy to make mistakes -
      especially when it comes to global reclaim.
      
      How it works: when memory cgroups are enabled, we always use the
      root_mem_cgroup's per-node lruvecs.  When memory cgroups are not compiled
      in or disabled at runtime, we use pgdat->lruvec.
      
      Document that in a comment.
      
      Due to the way the reclaim code is generalized, all lookups use the
      mem_cgroup_lruvec() helper function, and nobody should have to find the
      right lruvec manually right now.  But to avoid future mistakes, rename the
      pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
      that suggests it's a commonly accessed member.
      
      While in this area, swap the mem_cgroup_lruvec() argument order.  The name
      suggests a memcg operation, yet it takes a pgdat first and a memcg second.
      I have to double take every time I call this.  Fix that.
      
      Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      867e5e1d
  18. 14 8月, 2019 1 次提交
  19. 15 5月, 2019 2 次提交
    • J
      mm: memcontrol: make cgroup stats and events query API explicitly local · 205b20cc
      Johannes Weiner 提交于
      Patch series "mm: memcontrol: memory.stat cost & correctness".
      
      The cgroup memory.stat file holds recursive statistics for the entire
      subtree.  The current implementation does this tree walk on-demand
      whenever the file is read.  This is giving us problems in production.
      
      1. The cost of aggregating the statistics on-demand is high.  A lot of
         system service cgroups are mostly idle and their stats don't change
         between reads, yet we always have to check them.  There are also always
         some lazily-dying cgroups sitting around that are pinned by a handful
         of remaining page cache; the same applies to them.
      
         In an application that periodically monitors memory.stat in our
         fleet, we have seen the aggregation consume up to 5% CPU time.
      
      2. When cgroups die and disappear from the cgroup tree, so do their
         accumulated vm events.  The result is that the event counters at
         higher-level cgroups can go backwards and confuse some of our
         automation, let alone people looking at the graphs over time.
      
      To address both issues, this patch series changes the stat
      implementation to spill counts upwards when the counters change.
      
      The upward spilling is batched using the existing per-cpu cache.  In a
      sparse file stress test with 5 level cgroup nesting, the additional cost
      of the flushing was negligible (a little under 1% of CPU at 100% CPU
      utilization, compared to the 5% of reading memory.stat during regular
      operation).
      
      This patch (of 4):
      
      memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
      currently returning the state of the local memcg or lruvec, not the
      recursive state.
      
      In practice there is a demand for both versions, although the callers
      that want the recursive counts currently sum them up by hand.
      
      Per default, cgroups are considered recursive entities and generally we
      expect more users of the recursive counters, with the local counts being
      special cases.  To reflect that in the name, add a _local suffix to the
      current implementations.
      
      The following patch will re-incarnate these functions with recursive
      semantics, but with an O(1) implementation.
      
      [hannes@cmpxchg.org: fix bisection hole]
        Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
      Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      205b20cc
    • J
      mm: memcontrol: push down mem_cgroup_node_nr_lru_pages() · 2b487e59
      Johannes Weiner 提交于
      mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
      lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
      counts for those.
      
      Replace callsites where the bitmask is simple enough with direct
      lruvec_page_state() calls.
      
      This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
      make that function private again, too.
      
      Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b487e59
  20. 06 3月, 2019 1 次提交
  21. 29 12月, 2018 1 次提交
  22. 27 10月, 2018 5 次提交
    • J
      mm: zero-seek shrinkers · 4b85afbd
      Johannes Weiner 提交于
      The page cache and most shrinkable slab caches hold data that has been
      read from disk, but there are some caches that only cache CPU work, such
      as the dentry and inode caches of procfs and sysfs, as well as the subset
      of radix tree nodes that track non-resident page cache.
      
      Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
      the shrinker's seeks setting tells the reclaim algorithm that for every
      two page cache pages scanned it should scan one slab object.
      
      This is a bogus setting.  A virtual inode that required no IO to create is
      not twice as valuable as a page cache page; shadow cache entries with
      eviction distances beyond the size of memory aren't either.
      
      In most cases, the behavior in practice is still fine.  Such virtual
      caches don't tend to grow and assert themselves aggressively, and usually
      get picked up before they cause problems.  But there are scenarios where
      that's not true.
      
      Our database workloads suffer from two of those.  For one, their file
      workingset is several times bigger than available memory, which has the
      kernel aggressively create shadow page cache entries for the non-resident
      parts of it.  The workingset code does tell the VM that most of these are
      expendable, but the VM ends up balancing them 2:1 to cache pages as per
      the seeks setting.  This is a huge waste of memory.
      
      These workloads also deal with tens of thousands of open files and use
      /proc for introspection, which ends up growing the proc_inode_cache to
      absurdly large sizes - again at the cost of valuable cache space, which
      isn't a reasonable trade-off, given that proc inodes can be re-created
      without involving the disk.
      
      This patch implements a "zero-seek" setting for shrinkers that results in
      a target ratio of 0:1 between their objects and IO-backed caches.  This
      allows such virtual caches to grow when memory is available (they do
      cache/avoid CPU work after all), but effectively disables them as soon as
      IO-backed objects are under pressure.
      
      It then switches the shrinkers for procfs and sysfs metadata, as well as
      excess page cache shadow nodes, to the new zero-seek setting.
      
      Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDomas Mituzas <dmituzas@fb.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b85afbd
    • J
      mm: workingset: add vmstat counter for shadow nodes · 68d48e6a
      Johannes Weiner 提交于
      Make it easier to catch bugs in the shadow node shrinker by adding a
      counter for the shadow nodes in circulation.
      
      [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
      [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
      Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68d48e6a
    • J
      mm: workingset: use cheaper __inc_lruvec_state in irqsafe node reclaim · 505802a5
      Johannes Weiner 提交于
      No need to use the preemption-safe lruvec state function inside the
      reclaim region that has irqs disabled.
      
      Link: http://lkml.kernel.org/r/20181009184732.762-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505802a5
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner 提交于
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
    • J
      mm: workingset: don't drop refault information prematurely · 95f9ab2d
      Johannes Weiner 提交于
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95f9ab2d
  23. 21 10月, 2018 2 次提交
  24. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  25. 18 8月, 2018 4 次提交