提交 · df442a4ec740864ff65cbfa7e71669603ddbe2af · openeuler / Kernel

11 12月, 2021 1 次提交

mm/memcg: relocate mod_objcg_mlstate(), get_obj_stock() and put_obj_stock() · a7ebf564

由 Waiman Long 提交于 12月 10, 2021

All the calls to mod_objcg_mlstate(), get_obj_stock() and
put_obj_stock() are done by functions defined within the same "#ifdef
CONFIG_MEMCG_KMEM" compilation block.  When CONFIG_MEMCG_KMEM isn't
defined, the following compilation warnings will be issued [1] and [2].

  mm/memcontrol.c:785:20: warning: unused function 'mod_objcg_mlstate'
  mm/memcontrol.c:2113:33: warning: unused function 'get_obj_stock'

Fix these warning by moving those functions to under the same
CONFIG_MEMCG_KMEM compilation block.  There is no functional change.

[1] https://lore.kernel.org/lkml/202111272014.WOYNLUV6-lkp@intel.com/
[2] https://lore.kernel.org/lkml/202111280551.LXsWYt1T-lkp@intel.com/

Link: https://lkml.kernel.org/r/20211129161140.306488-1-longman@redhat.com
Fixes: 55927114 ("mm/memcg: optimize user context object stock access")
Fixes: 68ac5b3c ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp")
Signed-off-by: NWaiman Long <longman@redhat.com>
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a7ebf564

17 11月, 2021 1 次提交
- M
  mm: Rename folio_test_multi to folio_test_large · 9c325215
  由 Matthew Wilcox (Oracle) 提交于 11月 16, 2021
```
This is a better name.  Also add kernel-doc.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
```
  9c325215
12 11月, 2021 2 次提交

mm: unexport {,un}lock_page_memcg · ab2f9d2d

由 Christoph Hellwig 提交于 11月 10, 2021

These are only used in built-in core mm code.

Link: https://lkml.kernel.org/r/20210820095815.445392-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ab2f9d2d

mm: unexport folio_memcg_{,un}lock · 913ffbdd

由 Christoph Hellwig 提交于 11月 10, 2021

Patch series "unexport memcg locking helpers".

Neither the old page-based nor the new folio-based memcg locking helpers
are used in modular code at all, so drop the exports.

This patch (of 2):

folio_memcg_{,un}lock are only used in built-in core mm code.

Link: https://lkml.kernel.org/r/20210820095815.445392-1-hch@lst.de
Link: https://lkml.kernel.org/r/20210820095815.445392-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

913ffbdd

07 11月, 2021 9 次提交

mm/vmscan: throttle reclaim when no progress is being made · 69392a40

由 Mel Gorman 提交于 11月 05, 2021

Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of other
factors.

For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress.  Kswapd
throttles if too many pages are under writeback and marked for immediate
reclaim.

This patch explicitly throttles if reclaim is failing to make progress.

[vbabka@suse.cz: Remove redundant code]

Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69392a40

memcg: prohibit unconditional exceeding the limit of dying tasks · a4ebf1b6

由 Vasily Averin 提交于 11月 05, 2021

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit.  It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting.  This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves.  There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though.  E.g.  vmalloc can
scale to really large requests and the heuristic would allow that.  We
used to have an early break in the vmalloc allocator for killed tasks
but this has been reverted by commit b8c8a338 ("Revert "vmalloc:
back off when the current task is killed"").  There are likely other
similar code paths which do not check for fatal signals in an
allocation&charge loop.  Also there are some kernel objects charged to a
memcg which are not bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves).  This
is certainly possible but it is not really clear how much of an excess
is desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether.  Bypass is only allowed for requests which either cannot
fail or where the failure is not desirable while excess should be still
limited (e.g.  atomic requests).  Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage.  That should
give all forms of reclaim chance to restore the limit before the failure
(ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper to
task_is_dying() because now its use is not associated witch forced
charging.

This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.

Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a4ebf1b6

mm: memcontrol: remove the kmem states · e80216d9

由 Muchun Song 提交于 11月 05, 2021

Now the kmem states is only used to indicate whether the kmem is
offline.  However, we can set ->kmemcg_id to -1 to indicate whether the
kmem is offline.  Finally, we can remove the kmem states to simplify the
code.

Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e80216d9

mm: memcontrol: remove kmemcg_id reparenting · 64268868

由 Muchun Song 提交于 11月 05, 2021

Since slab objects and kmem pages are charged to object cgroup instead
of memory cgroup, memcg_reparent_objcgs() will reparent this cgroup and
all its descendants to its parent cgroup. This already makes further
list_lru_add()'s add elements to the parent's list. So it is
unnecessary to change kmemcg_id of an offline cgroup to its parent's id.
It just wastes CPU cycles. Just remove the redundant code.

Link: https://lkml.kernel.org/r/20211025125102.56533-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

64268868

memcg, kmem: further deprecate kmem.limit_in_bytes · 58056f77

由 Shakeel Butt 提交于 11月 05, 2021

The deprecation process of kmem.limit_in_bytes started with the commit
0158115f ("memcg, kmem: deprecate kmem.limit_in_bytes") which also
explains in detail the motivation behind the deprecation.  To summarize,
it is the unexpected behavior on hitting the kmem limit.  This patch
moves the deprecation process to the next stage by disallowing to set
the kmem limit.  In future we might just remove the kmem.limit_in_bytes
file completely.

[akpm@linux-foundation.org: s/ENOTSUPP/EOPNOTSUPP/]
[arnd@arndb.de: mark cancel_charge() inline]
  Link: https://lkml.kernel.org/r/20211022070542.679839-1-arnd@kernel.org

Link: https://lkml.kernel.org/r/20211019153408.2916808-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

58056f77

mm/memcg: remove obsolete memcg_free_kmem() · 38d4ef44

由 Waiman Long 提交于 11月 05, 2021

Since commit d648bcc7 ("mm: kmem: make memcg_kmem_enabled()
irreversible"), the only thing memcg_free_kmem() does is to call
memcg_offline_kmem() when the memcg is still online which can happen
when online_css() fails due to -ENOMEM.

However, the name memcg_free_kmem() is confusing and it is more clear
and straight forward to call memcg_offline_kmem() directly from
mem_cgroup_css_free().

Link: https://lkml.kernel.org/r/20211005202450.11775-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
Suggested-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

38d4ef44

memcg: unify memcg stat flushing · fd25a9e0

由 Shakeel Butt 提交于 11月 05, 2021

The memcg stats can be flushed in multiple context and potentially in
parallel too.  For example multiple parallel user space readers for
memcg stats will contend on the rstat locks with each other.  There is
no need for that.  We just need one flusher and everyone else can
benefit.

In addition after aa48e47e ("memcg: infrastructure to flush memcg
stats") the kernel periodically flush the memcg stats from the root, so,
the other flushers will potentially have much less work to do.

Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fd25a9e0

memcg: flush stats only if updated · 11192d9c

由 Shakeel Butt 提交于 11月 05, 2021

At the moment, the kernel flushes the memcg stats on every refault and
also on every reclaim iteration. Although rstat maintains per-cpu
update tree but on the flush the kernel still has to go through all the
cpu rstat update tree to check if there is anything to flush. This
patch adds the tracking on the stats update side to make flush side more
clever by skipping the flush if there is no update.

The stats update codepath is very sensitive performance wise for many
workloads and benchmarks. So, we can not follow what the commit
aa48e47e ("memcg: infrastructure to flush memcg stats") did which
was triggering async flush through queue_work() and caused a lot
performance regression reports. That got reverted by the commit
1f828223 ("memcg: flush lruvec stats in the refault").

In this patch we kept the stats update codepath very minimal and let the
stats reader side to flush the stats only when the updates are over a
specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH).

To evaluate the impact of this patch, an 8 GiB tmpfs file is created on
a system with swap-on-zram and the file was pushed to swap through
memory.force_empty interface. On reading the whole file, the memcg stat
flush in the refault code path is triggered. With this patch, we
observed 63% reduction in the read time of 8 GiB file.

Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: N"Michal Koutný" <mkoutny@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

11192d9c

mm/memcg: drop swp_entry_t* in mc_handle_file_pte() · 48384b0b

由 Peter Xu 提交于 11月 05, 2021

It is unused after the rework of commit f5df8635 ("mm: use
find_get_incore_page in memcontrol").

Link: https://lkml.kernel.org/r/20210916193014.80129-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

48384b0b

27 9月, 2021 15 次提交

mm/memcg: Add folio_lruvec_lock() and similar functions · e809c3fe

由 Matthew Wilcox (Oracle) 提交于 6月 28, 2021

These are the folio equivalents of lock_page_lruvec() and similar
functions.  Also convert lruvec_memcg_debug() to take a folio.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

e809c3fe

mm/memcg: Add folio_lruvec() · b1baabd9

由 Matthew Wilcox (Oracle) 提交于 6月 28, 2021

This replaces mem_cgroup_page_lruvec().  All callers converted.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

b1baabd9

mm/memcg: Convert mem_cgroup_move_account() to use a folio · fcce4672

由 Matthew Wilcox (Oracle) 提交于 3月 01, 2021

This saves dozens of bytes of text by eliminating a lot of calls to
compound_head().
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

fcce4672

mm/memcg: Add folio_memcg_lock() and folio_memcg_unlock() · f70ad448

由 Matthew Wilcox (Oracle) 提交于 6月 28, 2021

These are the folio equivalents of lock_page_memcg() and
unlock_page_memcg().

lock_page_memcg() and unlock_page_memcg() have too many callers to be
easily replaced in a single patch, so reimplement them as wrappers for
now to be cleaned up later when enough callers have been converted to
use folios.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

f70ad448

mm/memcg: Convert mem_cgroup_track_foreign_dirty_slowpath() to folio · 9d8053fc

由 Matthew Wilcox (Oracle) 提交于 5月 04, 2021

The page was only being used for the memcg and to gather trace
information, so this is a simple conversion.  The only caller of
mem_cgroup_track_foreign_dirty() will be converted to folios in a later
patch, so doing this now makes that patch simpler.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

9d8053fc

mm/memcg: Convert mem_cgroup_migrate() to take folios · d21bba2b

由 Matthew Wilcox (Oracle) 提交于 5月 06, 2021

Convert all callers of mem_cgroup_migrate() to call page_folio() first.
They all look like they're using head pages already, but this proves it.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

d21bba2b

mm/memcg: Convert mem_cgroup_uncharge() to take a folio · bbc6b703

由 Matthew Wilcox (Oracle) 提交于 5月 01, 2021

Convert all the callers to call page_folio().  Most of them were already
using a head page, but a few of them I can't prove were, so this may
actually fix a bug.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

bbc6b703

mm/memcg: Convert uncharge_page() to uncharge_folio() · c4ed6ebf

由 Matthew Wilcox (Oracle) 提交于 6月 29, 2021

Use a folio rather than a page to ensure that we're only operating on
base or head pages, and not tail pages.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

c4ed6ebf

mm/memcg: Convert mem_cgroup_charge() to take a folio · 8f425e4e

由 Matthew Wilcox (Oracle) 提交于 6月 25, 2021

Convert all callers of mem_cgroup_charge() to call page_folio() on the
page they're currently passing in.  Many of them will be converted to
use folios themselves soon.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

8f425e4e

mm/memcg: Convert commit_charge() to take a folio · 118f2875

由 Matthew Wilcox (Oracle) 提交于 4月 29, 2021

The memcg_data is only set on the head page, so enforce that by
typing it as a folio.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

118f2875

mm/memcg: Add folio_memcg() and related functions · 1b7e4464

由 Matthew Wilcox (Oracle) 提交于 6月 28, 2021

memcg information is only stored in the head page, so the memcg
subsystem needs to assure that all accesses are to the head page.
The first step is converting page_memcg() to folio_memcg().

The callers of page_memcg() and PageMemcgKmem() are not yet ready to be
converted to use folios, so retain them as wrappers around folio_memcg()
and folio_memcg_kmem().  They will be converted in a later patch set.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

1b7e4464

mm/memcg: Convert memcg_check_events to take a node ID · 8e88bd2d

由 Matthew Wilcox (Oracle) 提交于 6月 25, 2021

memcg_check_events only uses the page's nid, so call page_to_nid in the
callers to make the interface easier to understand.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

8e88bd2d

mm/memcg: Remove soft_limit_tree_node() · 2ab082ba

由 Matthew Wilcox (Oracle) 提交于 6月 25, 2021

Opencode this one-line function in its three callers.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

2ab082ba

mm/memcg: Use the node id in mem_cgroup_update_tree() · 658b69c9

由 Matthew Wilcox (Oracle) 提交于 4月 29, 2021

By using the node id in mem_cgroup_update_tree(), we can delete
soft_limit_tree_from_page() and mem_cgroup_page_nodeinfo().  Saves 42
bytes of kernel text on my config.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

658b69c9

mm/memcg: Remove 'page' parameter to mem_cgroup_charge_statistics() · 6e0110c2

由 Matthew Wilcox (Oracle) 提交于 4月 29, 2021

The last use of 'page' was removed by commit 468c3982 ("mm:
memcontrol: switch to native NR_ANON_THPS counter"), so we can now remove
the parameter from the function.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>

6e0110c2

24 9月, 2021 1 次提交

memcg: flush lruvec stats in the refault · 1f828223

由 Shakeel Butt 提交于 9月 22, 2021

Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
and the commit aa48e47e ("memcg: infrastructure to flush memcg
stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
32) at worst and for unbounded amount of time.  The commit aa48e47e
moved the lruvec stats to rstat infrastructure and the commit
7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
32) at worst for at most 2 seconds.  More specifically it decoupled the
number of stats and the number of cgroups from the error rate.

However this reduction in error comes with the cost of triggering the
slowpath of stats update more frequently.  Previously in the slowpath
the kernel adds the stats up the memcg tree.  After aa48e47e, the
kernel triggers the asyn lruvec stats flush through queue_work().  This
causes regression reports from 0day kernel bot [1] as well as from
phoronix test suite [2].

We tried two options to fix the regression:

 1) Increase the threshold to trigger the slowpath in lruvec stats
    update codepath from 32 to 512.

 2) Remove the slowpath from lruvec stats update codepath and instead
    flush the stats in the page refault codepath. The assumption is that
    the kernel timely flush the stats, so, the update tree would be
    small in the refault codepath to not cause the preformance impact.

Following are the results of will-it-scale/page_fault[1|2|3] benchmark
on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
(4) 5.15-rc1 with option-2.

  test       (1)      (2)               (3)               (4)
  pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
  pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
  pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)

From the above result, it seems like the option-2 not only solves the
regression but also improves the performance for at least these
benchmarks.

Feng Tang (intel) ran the aim7 benchmark with these two options and
confirms that option-1 reduces the regression but option-2 removes the
regression.

Michael Larabel (phoronix) ran multiple benchmarks with these options
and reported the results at [3] and it shows for most benchmarks
option-2 removes the regression introduced by the commit aa48e47e
("memcg: infrastructure to flush memcg stats").

Based on the experiment results, this patch proposed the option-2 as the
solution to resolve the regression.

Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
Fixes: aa48e47e ("memcg: infrastructure to flush memcg stats")
Signed-off-by: NShakeel Butt <shakeelb@google.com>
Tested-by: NMichael Larabel <Michael@phoronix.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>,
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1f828223

04 9月, 2021 11 次提交

mm/vmpressure: replace vmpressure_to_css() with vmpressure_to_memcg() · 9647875b

由 Hui Su 提交于 9月 02, 2021

We can get memcg directly form vmpr instead of vmpr->memcg->css->memcg, so
add a new func helper vmpressure_to_memcg().  And no code will use
vmpressure_to_css(), so delete it.

Link: https://lkml.kernel.org/r/20210630112146.455103-1-suhui@zeku.comSigned-off-by: NHui Su <suhui@zeku.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9647875b

memcg: make memcg->event_list_lock irqsafe · 4ba9515d

由 Shakeel Butt 提交于 9月 02, 2021

The memcg->event_list_lock is usually taken in the normal context but when
the userspace closes the corresponding eventfd, eventfd_release through
memcg_event_wake takes memcg->event_list_lock with interrupts disabled.
This is not an issue on its own but it creates a nested dependency from
eventfd_ctx->wqh.lock to memcg->event_list_lock.

Independently, for unrelated eventfd, eventfd_signal() can be called in
the irq context, thus making eventfd_ctx->wqh.lock an irq lock. For
example, FPGA DFL driver, VHOST VPDA driver and couple of VFIO drivers.
This will force memcg->event_list_lock to be an irqsafe lock as well.

One way to break the nested dependency between eventfd_ctx->wqh.lock and
memcg->event_list_lock is to add an indirection. However the simplest
solution would be to make memcg->event_list_lock irqsafe. This is cgroup
v1 feature, is in maintenance and may get deprecated in near future. So,
no need to add more code.

BTW this has been discussed previously [1] but there weren't irq users of
eventfd_signal() at the time.

[1] https://www.spinics.net/lists/cgroups/msg06248.html

Link: https://lkml.kernel.org/r/20210830172953.207257-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4ba9515d

memcg: fix up drain_local_stock comment · 5c49cf9a

由 Michal Hocko 提交于 9月 02, 2021

Thomas and Vlastimil have noticed that the comment in drain_local_stock
doesn't quite make sense. It talks about a synchronization with the
memory hotplug but there is no actual memory hotplug involvement here. I
meant to talk about cpu hotplug here. Fix that up and hopefuly make the
comment more helpful by referencing the cpu hotplug callback as well.

Link: https://lkml.kernel.org/r/YRDwOhVglJmY7ES5@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c49cf9a

mm, memcg: save some atomic ops when flush is already true · 27fb0956

由 Miaohe Lin 提交于 9月 02, 2021

Add 'else' to save some atomic ops in obj_stock_flush_required() when
flush is already true.  No functional change intended here.

Link: https://lkml.kernel.org/r/20210807082835.61281-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

27fb0956

mm: memcontrol: set the correct memcg swappiness restriction · 37bc3cb9

由 Baolin Wang 提交于 9月 02, 2021

Since commit c843966c ("mm: allow swappiness that prefers reclaiming
anon over the file workingset") has expended the swappiness value to make
swap to be preferred in some systems. We should also change the memcg
swappiness restriction to allow memcg swap-preferred.

Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.1627626255.git.baolin.wang@linux.alibaba.com
Fixes: c843966c ("mm: allow swappiness that prefers reclaiming anon over the file workingset")
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

37bc3cb9

memcg: replace in_interrupt() by !in_task() in active_memcg() · 55a68c82

由 Vasily Averin 提交于 9月 02, 2021

set_active_memcg() uses in_interrupt() check to select proper storage for
cgroup: pointer on task struct or per-cpu pointer.

It isn't fully correct: obsoleted in_interrupt() includes tasks with
disabled BH.  It's better to use '!in_task()' instead.

Link: https://lkml.org/lkml/2021/7/26/487
Link: https://lkml.kernel.org/r/ed4448b0-4970-616f-7368-ef9dd3cb628d@virtuozzo.com
Fixes: 37d5985c ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

55a68c82

memcg: infrastructure to flush memcg stats · aa48e47e

由 Shakeel Butt 提交于 9月 02, 2021

At the moment memcg stats are read in four contexts:

1. memcg stat user interfaces
2. dirty throttling
3. page fault
4. memory reclaim

Currently the kernel flushes the stats for first two cases.  Flushing the
stats for remaining two casese may have performance impact.  Always
flushing the memcg stats on the page fault code path may negatively
impacts the performance of the applications.  In addition flushing in the
memory reclaim code path, though treated as slowpath, can become the
source of contention for the global lock taken for stat flushing because
when system or memcg is under memory pressure, many tasks may enter the
reclaim path.

This patch uses following mechanisms to solve these challenges:

1. Periodically flush the stats from root memcg every 2 seconds.  This
   will time limit the out of sync stats.

2. Asynchronously flush the stats after fixed number of stat updates.
   In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for
   2 seconds.

3. For avoiding thundering herd to flush the stats particularly from
   the memory reclaim context, introduce memcg local spinlock and let only
   one flusher active at a time.  This could have been done through
   cgroup_rstat_lock lock but that lock is used by other subsystem and for
   userspace reading memcg stats.  So, it is better to keep flushers
   introduced by this patch decoupled from cgroup_rstat_lock.  However we
   would have to use irqsafe version of rstat flush but that is fine as
   this code path will be flushing for whole tree and do the work for
   everyone.  No one will be waiting for that worker.

[shakeelb@google.com: fix sleep-in-wrong context bug]
  Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com

Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa48e47e

memcg: switch lruvec stats to rstat · 7e1c0d6f

由 Shakeel Butt 提交于 9月 02, 2021

The commit 2d146aa3 ("mm: memcontrol: switch to rstat") switched memcg
stats to rstat infrastructure but skipped the conversion of the lruvec
stats as such stats are read in the performance critical code paths and
flushing stats may have impacted the performances of the applications.
This patch converts the lruvec stats to rstat and later patches add
mechanisms to keep the performance impact to minimum.

The rstat conversion comes with the price i.e. memory cost. Effectively
this patch reverts the savings done by the commit f3344adf ("mm:
memcontrol: optimize per-lruvec stats counter memory usage"). However
this cost is justified due to negative impact of the inaccurate lruvec
stats on many heuristics. One such case is reported in [1].

The memory reclaim code is filled with plethora of heuristics and many of
those heuristics reads the lruvec stats. So, inaccurate stats can make
such heuristics ineffective. [1] reports the impact of inaccurate lruvec
stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can
impact the deactivation and aging anon heuristics as well.

[1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/

Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e1c0d6f

mm, memcg: inline swap-related functions to improve disabled memcg config · 01c4b28c

由 Suren Baghdasaryan 提交于 9月 02, 2021

Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and
cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static
key check inline before calling the main body of the function.  This
minimizes the memcg overhead in the pagefault and exit_mmap paths when
memcgs are disabled using cgroup_disable=memory command-line option.  This
change results in ~1% overhead reduction when running PFT test [1]
comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
configuration on an 8-core ARM64 Android device.

[1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

Link: https://lkml.kernel.org/r/20210713010934.299876-3-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

01c4b28c

mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg config · 2c8d8f97

由 Suren Baghdasaryan 提交于 9月 02, 2021

Inline mem_cgroup_{charge/uncharge} and mem_cgroup_uncharge_list functions
functions to perform mem_cgroup_disabled static key check inline before
calling the main body of the function.  This minimizes the memcg overhead
in the pagefault and exit_mmap paths when memcgs are disabled using
cgroup_disable=memory command-line option.

This change results in ~0.4% overhead reduction when running PFT test [1]
comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
configuration on an 8-core ARM64 Android device.

[1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

Link: https://lkml.kernel.org/r/20210713010934.299876-2-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2c8d8f97

mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related functions · 56cab285

由 Suren Baghdasaryan 提交于 9月 02, 2021

Add mem_cgroup_disabled check in vmpressure, mem_cgroup_uncharge_swap and
cgroup_throttle_swaprate functions.  This minimizes the memcg overhead in
the pagefault and exit_mmap paths when memcgs are disabled using
cgroup_disable=memory command-line option.

This change results in ~2.1% overhead reduction when running PFT test [1]
comparing {CONFIG_MEMCG=n, CONFIG_MEMCG_SWAP=n} against {CONFIG_MEMCG=y,
CONFIG_MEMCG_SWAP=y, cgroup_disable=memory} configuration on an 8-core
ARM64 Android device.

[1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

Link: https://lkml.kernel.org/r/20210713010934.299876-1-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56cab285

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功