提交 · c1901cd33cf407d77a181f8dd4ffff98041ef480 · openeuler / Kernel

21 10月, 2018 9 次提交

M
page cache: Convert find_get_entries_tag to XArray · c1901cd3
由 Matthew Wilcox 提交于 5月 16, 2018
```
Slightly shorter and simpler code.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
c1901cd3

page cache; Convert find_get_pages_range_tag to XArray · a6906972

由 Matthew Wilcox 提交于 5月 16, 2018

The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

a6906972

page cache: Convert find_get_pages_contig to XArray · 3ece58a2

由 Matthew Wilcox 提交于 5月 16, 2018

There's no direct replacement for radix_tree_for_each_contig()
in the XArray API as it's an unusual thing to do.  Instead,
open-code a loop using xas_next().  This removes the only user of
radix_tree_for_each_contig() so delete the iterator from the API and
the test suite code for it.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

3ece58a2

page cache: Convert find_get_pages_range to XArray · fd1b3cee

由 Matthew Wilcox 提交于 5月 16, 2018

The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

fd1b3cee

M
page cache: Convert find_get_entries to XArray · f280bf09
由 Matthew Wilcox 提交于 5月 16, 2018
```
Slightly shorter and simpler code.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
f280bf09
M
page cache: Convert find_get_entry to XArray · 4c7472c0
由 Matthew Wilcox 提交于 5月 16, 2018
```
Slightly shorter and simpler code.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
4c7472c0
M
page cache: Convert page deletion to XArray · 5c024e6a
由 Matthew Wilcox 提交于 11月 21, 2017
```
The code is slightly shorter and simpler.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
```
5c024e6a

page cache: Add and replace pages using the XArray · 74d60958

由 Matthew Wilcox 提交于 11月 17, 2017

Use the XArray APIs to add and replace pages in the page cache.  This
removes two uses of the radix tree preload API and is significantly
shorter code.  It also removes the last user of __radix_tree_create()
outside radix-tree.c itself, so make it static.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

74d60958

page cache: Convert hole search to XArray · 0d3f9296

由 Matthew Wilcox 提交于 11月 21, 2017

The page cache offers the ability to search for a miss in the previous or
next N locations. Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array. This should be more efficient as it does not have to start the
lookup from the top for each index.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>

0d3f9296

30 9月, 2018 1 次提交

xarray: Replace exceptional entries · 3159f943

由 Matthew Wilcox 提交于 11月 03, 2017

Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries.  This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry).  It is also a change in emphasis; exceptional entries are
intimidating and different.  As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
Reviewed-by: NJosef Bacik <jbacik@fb.com>

3159f943

08 6月, 2018 1 次提交

mm: use new return type vm_fault_t · 2bcd6454

由 Souptick Joarder 提交于 6月 07, 2018

Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2bcd6454

21 4月, 2018 1 次提交

mm/filemap.c: fix NULL pointer in page_cache_tree_insert() · abc1be13

由 Matthew Wilcox 提交于 4月 20, 2018

f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes.  It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all.  That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:

  __list_del_entry+0x30/0xd0
  list_lru_del+0xac/0x1ac
  page_cache_tree_insert+0xd8/0x110

The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
if they are ever used.  Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.

Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reported-by: NChris Fries <cfries@google.com>
Debugged-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

abc1be13

14 4月, 2018 1 次提交

mm/filemap.c: provide dummy filemap_page_mkwrite() for NOMMU · 45397228

由 Arnd Bergmann 提交于 4月 13, 2018

Building orangefs on MMU-less machines now results in a link error
because of the newly introduced use of the filemap_page_mkwrite()
function:

  ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

This adds a dummy version for it, similar to the existing
generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
to avoid the link error without adding #ifdefs in each file system that
uses these.

Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
Fixes: a5135eea ("orangefs: implement vm_ops->fault")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

45397228

12 4月, 2018 1 次提交

page cache: use xa_lock · b93b0163

由 Matthew Wilcox 提交于 4月 10, 2018

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NJeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b93b0163

01 2月, 2018 1 次提交

mm/filemap.c: remove include of hardirq.h · 2b9fceb3

由 Yang Shi 提交于 1月 31, 2018

in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
doesn't use in_atomic() directly at all, so it sounds unnecessary to
include hardirq.h.

Link: http://lkml.kernel.org/r/1509985319-38633-1-git-send-email-yang.s@alibaba-inc.comSigned-off-by: NYang Shi <yang.s@alibaba-inc.com>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2b9fceb3

16 11月, 2017 11 次提交

mm: remove __GFP_COLD · 453f85d4

由 Mel Gorman 提交于 11月 15, 2017

As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of.  Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact.  Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache.  Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path.  A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

453f85d4

mm, pagevec: remove cold parameter for pagevecs · 86679820

由 Mel Gorman 提交于 11月 15, 2017

Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

86679820

mm, truncate: do not check mapping for every page being truncated · c7df8ad2

由 Mel Gorman 提交于 11月 15, 2017

During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.

This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called.  The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                             oneirq-v1r1        pickhelper-v1r1
Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)

As you'd expect, the gain is marginal but it can be detected.  The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.

radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field.  The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.

Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c7df8ad2

mm: batch radix tree operations when truncating pages · aa65c29c

由 Jan Kara 提交于 11月 15, 2017

Currently we remove pages from the radix tree one by one.  To speed up
page cache truncation, lock several pages at once and free them in one
go.  This allows us to batch radix tree operations in a more efficient
way and also save round-trips on mapping->tree_lock.  As a result we
gain about 20% speed improvement in page cache truncation.

Data from a simple benchmark timing 10000 truncates of 1024 pages (on
ext4 on ramdisk but the filesystem is barely visible in the profiles).
The range shows 1% and 95% percentiles of the measured times:

  4.14-rc2	4.14-rc2 + batched truncation
  248-256	209-219
  249-258	209-217
  248-255	211-239
  248-255	209-217
  247-256	210-218

[jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
  Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
[akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa65c29c

mm: factor out checks and accounting from __delete_from_page_cache() · 5ecc4d85

由 Jan Kara 提交于 11月 15, 2017

Move checks and accounting updates from __delete_from_page_cache() into
a separate function.  We will reuse it when batching page cache
truncation operations.

Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ecc4d85

mm: move clearing of page->mapping to page_cache_tree_delete() · 2300638b

由 Jan Kara 提交于 11月 15, 2017

Clearing of page->mapping makes sense in page_cache_tree_delete() as
well and it will help us with batching things this way.

Link: http://lkml.kernel.org/r/20171010151937.26984-6-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2300638b

mm: move accounting updates before page_cache_tree_delete() · 76253fbc

由 Jan Kara 提交于 11月 15, 2017

Move updates of various counters before page_cache_tree_delete() call.
It will be easier to batch things this way and there is no difference
whether the counters get updated before or after removal from the radix
tree.

Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

76253fbc

mm: factor out page cache page freeing into a separate function · 59c66c5f

由 Jan Kara 提交于 11月 15, 2017

Factor out page freeing from delete_from_page_cache() into a separate
function.  We will need to call the same when batching pagecache
deletion operations.

invalidate_complete_page2() and replace_page_cache_page() might want to
call this function as well however they currently don't seem to handle
THPs so it's unnecessary for them to take the hit of checking whether a
page is THP or not.

Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

59c66c5f

mm: remove nr_pages argument from pagevec_lookup_{,range}_tag() · 67fd707f

由 Jan Kara 提交于 11月 15, 2017

All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

67fd707f

mm: use pagevec_lookup_range_tag() in __filemap_fdatawait_range() · 312e9d2f

由 Jan Kara 提交于 11月 15, 2017

Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
interested only in pages from given range. Remove unnecessary code
resulting from this.

Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

312e9d2f

mm: implement find_get_pages_range_tag() · 72b045ae

由 Jan Kara 提交于 11月 15, 2017

Patch series "Ranged pagevec tagged lookup", v3.

In this series I provide a ranged variant of pagevec_lookup_tag() and
use it in places where it makes sense.  This series removes some common
code and it also has a potential for speeding up some operations
similarly as for pagevec_lookup_range() (but for now I can think of only
artificial cases where this happens).

This patch (of 16):

Implement a variant of find_get_pages_tag() that stops iterating at
given index.  Lots of users of this function (through pagevec_lookup())
actually want a range lookup and all of them are currently open-coding
this.

Also create corresponding pagevec_lookup_range_tag() function.

Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Steve French <sfrench@samba.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72b045ae

13 11月, 2017 1 次提交

afs: Get rid of the afs_writeback record · 4343d008

由 David Howells 提交于 11月 02, 2017

Get rid of the afs_writeback record that kAFS is using to match keys with
writes made by that key.

Instead, keep a list of keys that have a file open for writing and/or
sync'ing and iterate through those.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

4343d008

04 10月, 2017 1 次提交

mm: have filemap_check_and_advance_wb_err clear AS_EIO/AS_ENOSPC · f4e222c5

由 Jeff Layton 提交于 10月 03, 2017

Eryu noticed that he could sometimes get a leftover error reported when
it shouldn't be on fsync with ext2 and non-journalled ext4.

The problem is that writeback_single_inode still uses filemap_fdatawait.
That picks up a previously set AS_EIO flag, which would ordinarily have
been cleared before.

Since we're mostly using this function as a replacement for
filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
and AS_ENOSPC when reporting an error.  That should allow the new
function to better emulate the behavior of the old with respect to these
flags.

Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.orgSigned-off-by: NJeff Layton <jlayton@redhat.com>
Reported-by: NEryu Guan <eguan@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f4e222c5

25 9月, 2017 1 次提交

fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9

由 Lukas Czerner 提交于 9月 21, 2017

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

332391a9

15 9月, 2017 1 次提交

sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b

由 Tim Chen 提交于 8月 25, 2017

Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long.  This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

[ v2: Remove bookmark_wake_function ]
Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

11a19c7b

07 9月, 2017 4 次提交

mm: use find_get_pages_range() in filemap_range_has_page() · f7b68046

由 Jan Kara 提交于 9月 06, 2017

We want only pages from given range in filemap_range_has_page(),
furthermore we want at most a single page.

So use find_get_pages_range() instead of pagevec_lookup() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20170726114704.7626-10-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f7b68046

mm: implement find_get_pages_range() · b947cee4

由 Jan Kara 提交于 9月 06, 2017

Implement a variant of find_get_pages() that stops iterating at given
index. This may be substantial performance gain if the mapping is
sparse. See following commit for details. Furthermore lots of users of
this function (through pagevec_lookup()) actually want a range lookup
and all of them are currently open-coding this.

Also create corresponding pagevec_lookup_range() function.

Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b947cee4

mm: make pagevec_lookup() update index · d72dc8a2

由 Jan Kara 提交于 9月 06, 2017

Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d72dc8a2

dax: remove DAX code from page_cache_tree_insert() · d01ad197

由 Ross Zwisler 提交于 9月 06, 2017

Now that we no longer insert struct page pointers in DAX radix trees we
can remove the special casing for DAX in page_cache_tree_insert().

This also allows us to make dax_wake_mapping_entry_waiter() local to
fs/dax.c, removing it from dax.h.

Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d01ad197

05 9月, 2017 2 次提交

fs: support IOCB_NOWAIT in generic_file_buffered_read · 3239d834

由 Milosz Tanski 提交于 8月 29, 2017

Allow generic_file_buffered_read to bail out early instead of waiting for
the page lock or reading a page if IOCB_NOWAIT is specified.
Signed-off-by: NMilosz Tanski <milosz@adfin.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Acked-by: NSage Weil <sage@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3239d834

fs: pass iocb to do_generic_file_read · 47c27bc4

由 Christoph Hellwig 提交于 8月 29, 2017

And rename it to the more descriptive generic_file_buffered_read while
at it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

47c27bc4

29 8月, 2017 1 次提交

page waitqueue: always add new entries at the end · 9c3a815f

由 Linus Torvalds 提交于 8月 28, 2017

Commit 3510ca20 ("Minor page waitqueue cleanups") made the page
queue code always add new waiters to the back of the queue, which helps
upcoming patches to batch the wakeups for some horrid loads where the
wait queues grow to thousands of entries.

However, I forgot about the nasrt add_page_wait_queue() special case
code that is only used by the cachefiles code.  That one still continued
to add the new wait queue entries at the beginning of the list.

Fix it, because any sane batched wakeup will require that we don't
suddenly start getting new entries at the beginning of the list that we
already handled in a previous batch.

[ The current code always does the whole list while holding the lock, so
  wait queue ordering doesn't matter for correctness, but even then it's
  better to add later entries at the end from a fairness standpoint ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c3a815f

28 8月, 2017 2 次提交

Avoid page waitqueue race leaving possible page locker waiting · a8b169af

由 Linus Torvalds 提交于 8月 27, 2017

The "lock_page_killable()" function waits for exclusive access to the
page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
set.

That means that if it gets woken up, other waiters may have been
skipped.

That, in turn, means that if it sees the page being unlocked, it *must*
take that lock and return success, even if a lethal signal is also
pending.

So instead of checking for lethal signals first, we need to check for
them after we've checked the actual bit that we were waiting for.  Even
if that might then delay the killing of the process.

This matches the order of the old "wait_on_bit_lock()" infrastructure
that the page locking used to use (and is still used in a few other
areas).

Note that if we still return an error after having unsuccessfully tried
to acquire the page lock, that is ok: that means that some other thread
was able to get ahead of us and lock the page, and when that other
thread then unlocks the page, the wakeup event will be repeated.  So any
other pending waiters will now get properly woken up.

Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a8b169af

Minor page waitqueue cleanups · 3510ca20

由 Linus Torvalds 提交于 8月 27, 2017

Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists.  The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.

Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes.  That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.

In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably.  If you have thousands of threads
waiting for the same page, it will be painful.  We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.

That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:

 (a) we don't want to continue walking the page wait list if the bit
     we're waiting for already got set again (which seems to be one of
     the patterns of the bad load).  That makes no progress and just
     causes pointless cache pollution chasing the pointers.

 (b) we don't want to put the non-locking waiters always on the front of
     the queue, and the locking waiters always on the back.  Not only is
     that unfair, it means that we wake up thousands of reading threads
     that will just end up being blocked by the writer later anyway.

Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members.  It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3510ca20

01 8月, 2017 1 次提交

mm: remove optimizations based on i_size in mapping writeback waits · ffb959bb

由 Jeff Layton 提交于 7月 31, 2017

Marcelo added this i_size based optimization with a patch in 2004
(commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
    Date:   Tue Sep 7 17:51:17 2004 -0700

	small wait_on_page_writeback_range() optimization

	filemap_fdatawait() calls wait_on_page_writeback_range() with -1
	as "end" parameter.  This is not needed since we know the EOF
	from the inode.  Use that instead.

There may be races here, particularly with clustered or network
filesystems. It also seems like a bit of a layering violation since
we're operating on an address_space here, not an inode.

Finally, it's also questionable whether this optimization really helps
on workloads that we care about. Should we be optimizing for writeback
vs. truncate races in a codepath where we expect to wait anyway? It
doesn't seem worth the risk.

Remove this optimization from the filemap_fdatawait codepaths. This
means that filemap_fdatawait becomes a trivial wrapper around
filemap_fdatawait_range.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJeff Layton <jlayton@redhat.com>

ffb959bb

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功