提交 c55e8d03 编写于 作者: J Johannes Weiner 提交者: Linus Torvalds

mm: vmscan: move dirty pages out of the way until they're flushed

We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6.  This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans.  Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance.  Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect and
activate refaulting pages right away, distilling used-once pages on the
inactive list much more effectively.  This is by design, and it makes
sense for clean cache.  But for the most part our workload's cache
faults are refaults and its use-once cache is from streaming writes.  We
end up with most of the inactive list dirty, and we don't go after the
active cache as long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off.  Even if the refaults happen, reads are
faster than writes.  Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that are dirty or under writeback
when they reach the end of the inactive LRU.  The pages are marked for
immediate reclaim, meaning they'll get moved back to the inactive LRU
tail as soon as they're written back and become reclaimable.  But in the
meantime, by reducing the inactive list to only immediately reclaimable
pages, we allow the scanner to deactivate and refill the inactive list
with clean cache from the active list tail to guarantee forward
progress.

[hannes@cmpxchg.org: update comment]
  Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
上级 4eda4823
...@@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page, ...@@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
list_add(&page->lru, &lruvec->lists[lru]); list_add(&page->lru, &lruvec->lists[lru]);
} }
static __always_inline void add_page_to_lru_list_tail(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
list_add_tail(&page->lru, &lruvec->lists[lru]);
}
static __always_inline void del_page_from_lru_list(struct page *page, static __always_inline void del_page_from_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru) struct lruvec *lruvec, enum lru_list lru)
{ {
......
...@@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, ...@@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
{ {
int *pgmoved = arg; int *pgmoved = arg;
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { if (PageLRU(page) && !PageUnevictable(page)) {
enum lru_list lru = page_lru_base_type(page); del_page_from_lru_list(page, lruvec, page_lru(page));
list_move_tail(&page->lru, &lruvec->lists[lru]); ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
(*pgmoved)++; (*pgmoved)++;
} }
} }
...@@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec) ...@@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
*/ */
void rotate_reclaimable_page(struct page *page) void rotate_reclaimable_page(struct page *page)
{ {
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && if (!PageLocked(page) && !PageDirty(page) &&
!PageUnevictable(page) && PageLRU(page)) { !PageUnevictable(page) && PageLRU(page)) {
struct pagevec *pvec; struct pagevec *pvec;
unsigned long flags; unsigned long flags;
......
...@@ -1056,6 +1056,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, ...@@ -1056,6 +1056,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* throttling so we could easily OOM just because too many * throttling so we could easily OOM just because too many
* pages are in writeback and there is nothing else to * pages are in writeback and there is nothing else to
* reclaim. Wait for the writeback to complete. * reclaim. Wait for the writeback to complete.
*
* In cases 1) and 2) we activate the pages to get them out of
* the way while we continue scanning for clean pages on the
* inactive list and refilling from the active list. The
* observation here is that waiting for disk writes is more
* expensive than potentially causing reloads down the line.
* Since they're marked for immediate reclaim, they won't put
* memory pressure on the cache working set any longer than it
* takes to write them to disk.
*/ */
if (PageWriteback(page)) { if (PageWriteback(page)) {
/* Case 1 above */ /* Case 1 above */
...@@ -1063,7 +1072,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, ...@@ -1063,7 +1072,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
PageReclaim(page) && PageReclaim(page) &&
test_bit(PGDAT_WRITEBACK, &pgdat->flags)) { test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
nr_immediate++; nr_immediate++;
goto keep_locked; goto activate_locked;
/* Case 2 above */ /* Case 2 above */
} else if (sane_reclaim(sc) || } else if (sane_reclaim(sc) ||
...@@ -1081,7 +1090,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, ...@@ -1081,7 +1090,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/ */
SetPageReclaim(page); SetPageReclaim(page);
nr_writeback++; nr_writeback++;
goto keep_locked; goto activate_locked;
/* Case 3 above */ /* Case 3 above */
} else { } else {
...@@ -1174,7 +1183,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, ...@@ -1174,7 +1183,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
inc_node_page_state(page, NR_VMSCAN_IMMEDIATE); inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
SetPageReclaim(page); SetPageReclaim(page);
goto keep_locked; goto activate_locked;
} }
if (references == PAGEREF_RECLAIM_CLEAN) if (references == PAGEREF_RECLAIM_CLEAN)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册