提交 15b44736 编写于 作者: H Hugh Dickins 提交者: Linus Torvalds

mm/lru: revise the comments of lru_lock

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix
the incorrect comments in code.  Also fixed some zone->lru_lock comment
error from ancient time.  etc.

I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.

Link: https://lkml.kernel.org/r/1604566549-62481-20-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
上级 2a5e4e34
......@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8. LRU
======
Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
list management functions under pgdat->lru_lock.
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU.
(By __isolate_lru_page(), the page is removed from both of global and
private LRU.)
Each memcg has its own vector of LRUs (inactive anon, active anon,
inactive file, active file, unevictable) of pages from each node,
each LRU handled under a single lru_lock for that memcg and node.
9. Typical Tests.
=================
......
......@@ -287,20 +287,17 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking
-----------
lock_page_cgroup()/unlock_page_cgroup() should not be called under
the i_pages lock.
Lock order is as follows:
Other lock order is following:
Page lock (PG_locked bit of page->flags)
mm->page_table_lock or split pte_lock
lock_page_memcg (memcg->move_lock)
mapping->i_pages lock
lruvec->lru_lock.
PG_locked.
mm->page_table_lock
pgdat->lru_lock
lock_page_cgroup.
In many cases, just lock_page_cgroup() is called.
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
pgdat->lru_lock, it has no lock of its own.
Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
......
......@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
contention on the zone->lru_lock.
contention on the lruvec->lru_lock.
4. Per-CPU Allocator Activity
=============================
......
......@@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems.
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
main memory will have over 32 million 4k pages in a single zone. When a large
main memory will have over 32 million 4k pages in a single node. When a large
fraction of these pages are not evictable for any reason [see below], vmscan
will spend a lot of time scanning the LRU lists looking for the small fraction
of pages that are evictable. This can result in a situation where all CPUs are
......@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
The Unevictable Page List
-------------------------
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list.
......@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are,
in fact, evictable.
The unevictable list benefits from the "arrayification" of the per-zone LRU
The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter.
The unevictable list does not use the LRU pagevec mechanism. Rather,
unevictable pages are placed directly on the page's zone's unevictable list
under the zone lru_lock. This allows us to prevent the stranding of pages on
the unevictable list when one task has the page isolated from the LRU and other
tasks are changing the "evictability" state of the page.
Memory Control Group Interaction
--------------------------------
......@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
lru_list enum.
The memory controller data structure automatically gets a per-zone unevictable
list as a result of the "arrayification" of the per-zone LRU lists (one per
The memory controller data structure automatically gets a per-node unevictable
list as a result of the "arrayification" of the per-node LRU lists (one per
lru_list enum element). The memory controller tracks the movement of pages to
and from the unevictable list.
......@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the
unevictable list for the zone being scanned.
unevictable list for the node being scanned.
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to
......@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
back the page - by calling putback_lru_page() - which will notice that the page
is now mlocked and divert the page to the zone's unevictable list. If
is now mlocked and divert the page to the node's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page.
......@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
unevictable list in mlock_vma_page().
shrink_inactive_list() also diverts any unevictable pages that it finds on the
inactive lists to the appropriate zone's unevictable list.
inactive lists to the appropriate node's unevictable list.
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
......
......@@ -79,7 +79,7 @@ struct page {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* pgdat->lru_lock. Sometimes used as a generic list
* lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
......
......@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data;
/*
* zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
* So add a wild amount of padding here to ensure that they fall into separate
* Add a wild amount of padding here to ensure datas fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
*/
......
......@@ -102,8 +102,8 @@
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
* ->i_pages lock (try_to_unmap_one)
* ->pgdat->lru_lock (follow_page->mark_page_accessed)
* ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
* ->lruvec->lru_lock (follow_page->mark_page_accessed)
* ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
* ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
......
......@@ -28,12 +28,12 @@
* hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
* mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
* lock_page_memcg move_lock (in __set_page_dirty_buffers)
* i_pages lock (widely used)
* lruvec->lru_lock (in lock_page_lruvec_irq)
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
* sb_lock (within inode_lock in fs/fs-writeback.c)
......
......@@ -1613,14 +1613,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
}
/**
* pgdat->lru_lock is heavily contended. Some of the functions that
* Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
*
* lruvec->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
* For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions).
*
* Appropriate locks must be held before calling this function.
* Lru_lock must be held before calling this function.
*
* @nr_to_scan: The number of eligible pages to look through on the list.
* @lruvec: The LRU vector to pull pages from.
......@@ -1814,25 +1816,11 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
}
/*
* This moves pages from @list to corresponding LRU list.
*
* We move them the other way if the page is referenced by one or more
* processes, from rmap.
*
* If the pages are mostly unmapped, the processing is fast and it is
* appropriate to hold zone_lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()) so we
* should drop zone_lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
* move_pages_to_lru() moves pages from private @list to appropriate LRU list.
* On return, @list is reused as a list of pages to be freed by the caller.
*
* Returns the number of pages moved to the given lruvec.
*/
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
struct list_head *list)
{
......@@ -2010,6 +1998,23 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return nr_reclaimed;
}
/*
* shrink_active_list() moves pages from the active LRU to the inactive LRU.
*
* We move them the other way if the page is referenced by one or more
* processes.
*
* If the pages are mostly unmapped, the processing is fast and it is
* appropriate to hold lru_lock across the whole operation. But if
* the pages are mapped, the processing is slow (page_referenced()), so
* we should drop lru_lock around each page. It's impossible to balance
* this, so instead we remove the pages from the LRU while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
*
* The downside is that we have to touch page->_refcount against each page.
* But we had to alter page->flags anyway.
*/
static void shrink_active_list(unsigned long nr_to_scan,
struct lruvec *lruvec,
struct scan_control *sc,
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册