提交 · a75d4c33377277b6034dd1e2663bce444f952c14 · openeuler / Kernel

16 3月, 2019 1 次提交

filemap: kill page_cache_read usage in filemap_fault · a75d4c33

由 Josef Bacik 提交于 3月 13, 2019

Patch series "drop the mmap_sem when doing IO in the fault path", v6.

Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions.  Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.

We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such.  This involves running
'ps' or other such utilities.  These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.

This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer.  This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal.  But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.

In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO.  This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock.  With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.

This patch (of 3):

If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it.  This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache.  Then use the normal page locking and readpage logic already in
filemap_fault.  This simplifies the no page in page cache case
significantly.

[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
  Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a75d4c33

15 3月, 2019 1 次提交

filemap: pass vm_fault to the mmap ra helpers · 2a1180f1

由 Josef Bacik 提交于 3月 13, 2019

All of the arguments to these functions come from the vmf.

Cut down on the amount of arguments passed by simply passing in the vmf
to these two helpers.

Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2a1180f1

06 3月, 2019 5 次提交

mm: remove zone_lru_lock() function, access ->lru_lock directly · f4b7e272

由 Andrey Ryabinin 提交于 3月 05, 2019

We have common pattern to access lru_lock from a page pointer:
	zone_lru_lock(page_zone(page))

Which is silly, because it unfolds to this:
	&NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
while we can simply do
	&NODE_DATA(page_to_nid(page))->lru_lock

Remove zone_lru_lock() function, since it's only complicate things.  Use
'page_pgdat(page)->lru_lock' pattern instead.

[aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
  Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f4b7e272

mm/shmem: make find_get_pages_range() work for huge page · 5d3ee42f

由 Yu Zhao 提交于 3月 05, 2019

find_get_pages_range() and find_get_pages_range_tag() already correctly
increment reference count on head when seeing compound page, but they
may still use page index from tail.  Page index from tail is always
zero, so these functions don't work on huge shmem.  This hasn't been a
problem because, AFAIK, nobody calls these functions on (huge) shmem.
Fix them anyway just in case.

Link: http://lkml.kernel.org/r/20190110030838.84446-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d3ee42f

docs/core-api/mm: fix return value descriptions in mm/ · a862f68a

由 Mike Rapoport 提交于 3月 05, 2019

Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:

$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...

Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.

Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a862f68a

mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page · 35f12f0f

由 zhengbin 提交于 3月 05, 2019

The 'end_byte' parameter of filemap_range_has_page is required to be
inclusive, so follow the rule.

Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
Fixes: 6be96d3a ("fs: return if direct I/O will trigger writeback")
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NMatthew Wilcox <willy@infradead.org>
Acked-by: NChristoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: zhangyi (F) <yi.zhang@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

35f12f0f

mm/filemap.c: remove redundant test from find_get_pages_contig · 14ef1fc7

由 Matthew Wilcox 提交于 3月 05, 2019

After we establish a reference on the page, we check the pointer
continues to be in the correct position in i_pages. Checking
page->index afterwards is unnecessary; if it were to change, then the
pointer to it from the page cache would also move. The check used to be
done before grabbing a reference on the page which was racy (see commit
9cbb4cb2 ("mm: find_get_pages_contig fixlet")), but nobody noticed
that moving the check after grabbing the reference was redundant.

Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

14ef1fc7

05 1月, 2019 1 次提交

mm/: remove caller signal_pending branch predictions · fa45f116

由 Davidlohr Bueso 提交于 1月 03, 2019

This is already done for us internally by the signal machinery.

Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dave@stgolabs.net>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fa45f116

29 12月, 2018 3 次提交

mm, fault_around: do not take a reference to a locked page · e0975b2a

由 Michal Hocko 提交于 12月 28, 2018

filemap_map_pages takes a speculative reference to each page in the range
before it tries to lock that page.  While this is correct it also can
influence page migration which will bail out when seeing an elevated
reference count.  The faultaround code would bail on seeing a locked page
so we can pro-actively check the PageLocked bit before
page_cache_get_speculative and prevent from pointless reference count
churn.

Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Suggested-by: NJan Kara <jack@suse.cz>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e0975b2a

mm/filemap.c: remove useless check in pagecache_get_page() · c16eb000

由 Kirill Tkhai 提交于 12月 28, 2018

page always is not NULL, so we may remove this useless check.

Link: http://lkml.kernel.org/r/154419752044.18559.2452963074922917720.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: NCyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c16eb000

mm: put_and_wait_on_page_locked() while page is migrated · 9a1ea439

由 Hugh Dickins 提交于 12月 28, 2018

Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.

And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).

This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.

David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2

Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db91 ("sched/wait: Break up long wake
list walk") 11a19c7b ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")

Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2

We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.

But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().

Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.

__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?

shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f5 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.

It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.

[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvilsSigned-off-by: NHugh Dickins <hughd@google.com>
Reported-by: NBaoquan He <bhe@redhat.com>
Tested-by: NBaoquan He <bhe@redhat.com>
Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9a1ea439

30 10月, 2018 5 次提交

vfs: enable remap callers that can handle short operations · eca3654e

由 Darrick J. Wong 提交于 10月 30, 2018

Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it.  Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

eca3654e

vfs: make remap_file_range functions take and return bytes completed · 42ec3d4c

由 Darrick J. Wong 提交于 10月 30, 2018

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on.  This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value.  For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

42ec3d4c

vfs: pass remap flags to generic_remap_checks · 3d28193e

由 Darrick J. Wong 提交于 10月 30, 2018

Pass the same remap flags to generic_remap_checks for consistency.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

3d28193e

vfs: strengthen checking of file range inputs to generic_remap_checks · 9fd91a90

由 Darrick J. Wong 提交于 10月 30, 2018

File range remapping, if allowed to run past the destination file's EOF,
is an optimization on a regular file write.  Regular file writes that
extend the file length are subject to various constraints which are not
checked by range cloning.

This is a correctness problem because we're never allowed to touch
ranges that the page cache can't support (s_maxbytes); we're not
supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't
set; and we must obey resource limits (RLIMIT_FSIZE).

Therefore, add these checks to the new generic_remap_checks function so
that we curtail unexpected behavior.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

9fd91a90

vfs: check file ranges before cloning files · 1383a7ed

由 Darrick J. Wong 提交于 10月 30, 2018

Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

1383a7ed

27 10月, 2018 6 次提交

mm/filemap.c: use vmf_error() · 3c051324

由 Souptick Joarder 提交于 10月 26, 2018

These codes can be replaced with new inline vmf_error().

Link: http://lkml.kernel.org/r/20180927171411.GA23331@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3c051324

mm/filemap.c: Use existing variable · 3cb7b121

由 haiqing.shq 提交于 10月 26, 2018

Use the variable write_len instead of ov_iter_count(from).

Link: http://lkml.kernel.org/r/1537375855-2088-1-git-send-email-leviathan0992@gmail.comSigned-off-by: Nhaiqing.shq <leviathan0992@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3cb7b121

psi: pressure stall information for CPU, memory, and IO · eb414681

由 Johannes Weiner 提交于 10月 26, 2018

When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eb414681

delayacct: track delays from thrashing cache pages · b1d29ba8

由 Johannes Weiner 提交于 10月 26, 2018

Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b1d29ba8

mm: workingset: tell cache transitions from workingset thrashing · 1899ad18

由 Johannes Weiner 提交于 10月 26, 2018

Refaults happen during transitions between workingsets as well as in-place
thrashing.  Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache.  When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime.  This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1899ad18

mm: convert to use vm_fault_t · 4b96a37d

由 Souptick Joarder 提交于 10月 26, 2018

As part of vm_fault_t conversion filemap_page_mkwrite() for the NOMMU case
was missed. Now converted.

Link: http://lkml.kernel.org/r/20180828174952.GA29229@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b96a37d

24 10月, 2018 1 次提交

iov_iter: Use accessor function · 00e23707

由 David Howells 提交于 10月 22, 2018

Use accessor functions to access an iterator's type and direction. This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

00e23707

21 10月, 2018 13 次提交

page cache: Convert filemap_range_has_page to XArray · 8fa8e538

由 Matthew Wilcox 提交于 1月 16, 2018

Instead of calling find_get_pages_range() and putting any reference,
use xas_find() to iterate over any entries in the range, skipping the
shadow/swap entries.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

8fa8e538

M
page cache: Remove stray radix comment · 22ecdb4f
由 Matthew Wilcox 提交于 12月 04, 2017
```
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
22ecdb4f

page cache: Convert delete_batch to XArray · ef8e5717

由 Matthew Wilcox 提交于 12月 04, 2017

Rename the function from page_cache_tree_delete_batch to just
page_cache_delete_batch.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

ef8e5717

page cache: Convert filemap_map_pages to XArray · 070e807c

由 Matthew Wilcox 提交于 5月 17, 2018

Slight change of strategy here; if we have trouble getting hold of a
page for whatever reason (eg a compound page is split underneath us),
don't spin to stabilise the page, just continue the iteration, like we
would if we failed to trylock the page. Since this is a speculative
optimisation, it feels like we should allow the process to take an extra
fault if it turns out to need this page instead of spending time to pin
down a page it may not need.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

070e807c

M
page cache: Convert find_get_entries_tag to XArray · c1901cd3
由 Matthew Wilcox 提交于 5月 16, 2018
```
Slightly shorter and simpler code.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
c1901cd3

page cache; Convert find_get_pages_range_tag to XArray · a6906972

由 Matthew Wilcox 提交于 5月 16, 2018

The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

a6906972

page cache: Convert find_get_pages_contig to XArray · 3ece58a2

由 Matthew Wilcox 提交于 5月 16, 2018

There's no direct replacement for radix_tree_for_each_contig()
in the XArray API as it's an unusual thing to do.  Instead,
open-code a loop using xas_next().  This removes the only user of
radix_tree_for_each_contig() so delete the iterator from the API and
the test suite code for it.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

3ece58a2

page cache: Convert find_get_pages_range to XArray · fd1b3cee

由 Matthew Wilcox 提交于 5月 16, 2018

The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

fd1b3cee

M
page cache: Convert find_get_entries to XArray · f280bf09
由 Matthew Wilcox 提交于 5月 16, 2018
```
Slightly shorter and simpler code.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
f280bf09
M
page cache: Convert find_get_entry to XArray · 4c7472c0
由 Matthew Wilcox 提交于 5月 16, 2018
```
Slightly shorter and simpler code.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
4c7472c0
M
page cache: Convert page deletion to XArray · 5c024e6a
由 Matthew Wilcox 提交于 11月 21, 2017
```
The code is slightly shorter and simpler.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
5c024e6a

page cache: Add and replace pages using the XArray · 74d60958

由 Matthew Wilcox 提交于 11月 17, 2017

Use the XArray APIs to add and replace pages in the page cache.  This
removes two uses of the radix tree preload API and is significantly
shorter code.  It also removes the last user of __radix_tree_create()
outside radix-tree.c itself, so make it static.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

74d60958

page cache: Convert hole search to XArray · 0d3f9296

由 Matthew Wilcox 提交于 11月 21, 2017

The page cache offers the ability to search for a miss in the previous or
next N locations. Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array. This should be more efficient as it does not have to start the
lookup from the top for each index.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

0d3f9296

30 9月, 2018 1 次提交

xarray: Replace exceptional entries · 3159f943

由 Matthew Wilcox 提交于 11月 03, 2017

Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries.  This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry).  It is also a change in emphasis; exceptional entries are
intimidating and different.  As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
Reviewed-by: NJosef Bacik <jbacik@fb.com>

3159f943

08 6月, 2018 1 次提交

mm: use new return type vm_fault_t · 2bcd6454

由 Souptick Joarder 提交于 6月 07, 2018

Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2bcd6454

21 4月, 2018 1 次提交

mm/filemap.c: fix NULL pointer in page_cache_tree_insert() · abc1be13

由 Matthew Wilcox 提交于 4月 20, 2018

f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes.  It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all.  That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:

  __list_del_entry+0x30/0xd0
  list_lru_del+0xac/0x1ac
  page_cache_tree_insert+0xd8/0x110

The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
if they are ever used.  Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.

Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reported-by: NChris Fries <cfries@google.com>
Debugged-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

abc1be13

14 4月, 2018 1 次提交

mm/filemap.c: provide dummy filemap_page_mkwrite() for NOMMU · 45397228

由 Arnd Bergmann 提交于 4月 13, 2018

Building orangefs on MMU-less machines now results in a link error
because of the newly introduced use of the filemap_page_mkwrite()
function:

  ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

This adds a dummy version for it, similar to the existing
generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
to avoid the link error without adding #ifdefs in each file system that
uses these.

Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
Fixes: a5135eea ("orangefs: implement vm_ops->fault")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

45397228

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功