提交 · 0e4b0743bbe5807535ba1b0389281f9a4c1b2bb7 · openeuler / raspberrypi-kernel

16 11月, 2013 3 次提交

fold try_to_ascend() into the sole remaining caller · 31dec132

由 Al Viro 提交于 10月 25, 2013

There used to be a bunch of tree-walkers in dcache.c, all alike.
try_to_ascend() had been introduced to abstract a piece of logics
duplicated in all of them. These days all these tree-walkers are
implemented via the same iterator (d_walk()), which is the only
remaining caller of try_to_ascend(), so let's fold it back...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

31dec132

dcache.c: get rid of pointless macros · 482db906

由 Al Viro 提交于 10月 25, 2013

D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()).
At this point they are actively misguiding - they imply that values are
compiler constants, which is no longer true.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

482db906

A
take read_seqbegin_or_lock() and friends to seqlock.h · 2bc74feb
由 Al Viro 提交于 10月 25, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
2bc74feb

13 11月, 2013 2 次提交

A
prepend_path() needs to reinitialize dentry/vfsmount/mnt on restarts · ede4cebc
由 Al Viro 提交于 11月 13, 2013
```
... and equivalent is needed in 3.12; it's broken there as well
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
ede4cebc

fix unpaired rcu lock in prepend_path() · 4ec6c2ae

由 Li Zhong 提交于 11月 13, 2013

Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4ec6c2ae

09 11月, 2013 7 次提交

dcache: don't clear DCACHE_DISCONNECTED too early · f80de2cd

由 J. Bruce Fields 提交于 7月 18, 2012

DCACHE_DISCONNECTED should not be cleared until we're sure the dentry is
connected all the way up to the root of the filesystem.  It *shouldn't*
be cleared as soon as the dentry is connected to a parent.  That will
cause bugs at least on exportable filesystems.
Acked-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f80de2cd

dcache: Don't set DISCONNECTED on "pseudo filesystem" dentries · e1a24bb0

由 J. Bruce Fields 提交于 6月 29, 2012

I can't for the life of me see any reason why anyone should care whether
a dentry that is never hooked into the dentry cache would need
DCACHE_DISCONNECTED set.

This originates from 4b936885 "fs:
improve scalability of pseudo filesystems", which probably just made the
false assumption the DCACHE_DISCONNECTED was meant to be set on anything
not connected to a parent somehow.

So this is just confusing.  Ideally the only uses of DCACHE_DISCONNECTED
would be in the filehandle-lookup code, which needs it to ensure
dentries are connected into the dentry tree before use.

I left d_alloc_pseudo there even though it's now equivalent to
__d_alloc(), just on the theory the name is better documentation of its
intended use outside dcache.c.

Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e1a24bb0

dcache: use IS_ROOT to decide where dentry is hashed · 7632e465

由 J. Bruce Fields 提交于 6月 28, 2012

Every hashed dentry is either hashed in the dentry_hashtable, or a
superblock's s_anon list.

__d_drop() assumes it can determine which is the case by checking
DCACHE_DISCONNECTED; this is not true.

It is true that when DCACHE_DISCONNECTED is cleared, the dentry is not
only hashed on dentry_hashtable, but is fully connected to its parents
back to the root.

But the converse is *not* true: fs/exportfs/expfs.c:reconnect_path()
attempts to connect a directory (found by filehandle lookup) back to
root by ascending to parents and performing lookups one at a time.  It
does not clear DCACHE_DISCONNECTED until it's done, and that is not at
all an atomic process.

In particular, it is possible for DCACHE_DISCONNECTED to be set on a
dentry which is hashed on the dentry_hashtable.

Instead, use IS_ROOT() to check which hash chain a dentry is on.  This
*does* work:

Dentries are hashed only by:

	- d_obtain_alias, which adds an IS_ROOT() dentry to sb_anon.

	- __d_rehash, called by _d_rehash: hashes to the dentry's
	  parent, and all callers of _d_rehash appear to have d_parent
	  set to a "real" parent.
	- __d_rehash, called by __d_move: rehashes the moved dentry to
	  hash chain determined by target, and assigns target's d_parent
	  to its d_parent, before dropping the dentry's d_lock.

Therefore I believe it's safe for a holder of a dentry's d_lock to
assume that it is hashed on sb_anon if and only if IS_ROOT(dentry) is
true.

I believe the incorrect assumption about DCACHE_DISCONNECTED was
originally introduced by ceb5bdc2 "fs: dcache per-bucket dcache hash
locking".

Also add a comment while we're here.

Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7632e465

VFS: Put a small type field into struct dentry::d_flags · b18825a7

由 David Howells 提交于 9月 12, 2013

Put a type field into struct dentry::d_flags to indicate if the dentry is one
of the following types that relate particularly to pathwalk:

	Miss (negative dentry)
	Directory
	"Automount" directory (defective - no i_op->lookup())
	Symlink
	Other (regular, socket, fifo, device)

The type field is set to one of the first five types on a dentry by calls to
__d_instantiate() and d_obtain_alias() from information in the inode (if one is
given).

The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
dentry as a negative dentry.

Accessors provided are:

	d_set_type(dentry, type)
	d_is_directory(dentry)
	d_is_autodir(dentry)
	d_is_symlink(dentry)
	d_is_file(dentry)
	d_is_negative(dentry)
	d_is_positive(dentry)

A bunch of checks in pathname resolution switched to those.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b18825a7

A
fold __d_shrink() into its only remaining caller · b61625d2
由 Al Viro 提交于 10月 04, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b61625d2

RCU'd vfsmounts · 48a066e7

由 Al Viro 提交于 9月 29, 2013

* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode.  We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock.  It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

48a066e7

A
switch shrink_dcache_for_umount() to use of d_walk() · 42c32608
由 Al Viro 提交于 11月 08, 2013
```
we have too many iterators in fs/dcache.c...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
42c32608

06 11月, 2013 1 次提交

seqcount: Add lockdep functionality to seqcount/seqlock structures · 1ca7d67c

由 John Stultz 提交于 10月 07, 2013

Currently seqlocks and seqcounts don't support lockdep.

After running across a seqcount related deadlock in the timekeeping
code, I used a less-refined and more focused variant of this patch
to narrow down the cause of the issue.

This is a first-pass attempt to properly enable lockdep functionality
on seqlocks and seqcounts.

Since seqcounts are used in the vdso gettimeofday code, I've provided
non-lockdep accessors for those needs.

I've also handled one case where there were nested seqlock writers
and there may be more edge cases.

Comments and feedback would be appreciated!
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

1ca7d67c

01 11月, 2013 1 次提交

vfs: decrapify dput(), fix cache behavior under normal load · 358eec18

由 Linus Torvalds 提交于 10月 31, 2013

We do not want to dirty the dentry->d_flags cacheline in dput() just to
set the DCACHE_REFERENCED flag when it is already set in the common case
anyway.  This way the first cacheline of the dentry (which contains the
RCU lookup information etc) can stay shared among multiple CPU's.

This finishes off some of the details of all the scalability patches
merged during the merge window.

Also don't mark dentry_kill() for inlining, since it's the uncommon path
and inlining it just makes the common path slower due to extra function
entry/exit overhead.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

358eec18

25 10月, 2013 2 次提交

vfs: introduce d_instantiate_no_diralias() · b70a80e7

由 Miklos Szeredi 提交于 10月 01, 2013

...which just returns -EBUSY if a directory alias would be created.

This is to be used by fuse mkdir to make sure that a buggy or malicious
userspace filesystem doesn't do anything nasty.  Previously fuse used a
private mutex for this purpose, which can now go away.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

b70a80e7

A
move taking vfsmount_lock down into prepend_path() · 94e92a6e
由 Al Viro 提交于 10月 01, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
94e92a6e

22 10月, 2013 1 次提交

vfs: fix new kernel-doc warnings · 69c88dc7

由 Randy Dunlap 提交于 10月 19, 2013

Move kernel-doc notation to immediately before its function to eliminate
kernel-doc warnings introduced by commit db14fc3a ("vfs: add
d_walk()")

  Warning(fs/dcache.c:1343): No description found for parameter 'data'
  Warning(fs/dcache.c:1343): No description found for parameter 'dentry'
  Warning(fs/dcache.c:1343): Excess function parameter 'parent' description in 'check_mount'
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69c88dc7

15 9月, 2013 1 次提交

vfs: fix typo in comment in recent dentry work · 05a8252b

由 Linus Torvalds 提交于 9月 15, 2013

Sedat points out that I transposed some letters in "LRU" and wrote "RLU"
instead in one of the new comments explaining the flow.  Let's just fix
it.
Reported-by: NSedat Dilek <sedat.dilek@jpberlin.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

05a8252b

14 9月, 2013 1 次提交

vfs: fix dentry LRU list handling and nr_dentry_unused accounting · 89dc77bc

由 Linus Torvalds 提交于 9月 13, 2013

The LRU list changes interacted badly with our nr_dentry_unused
accounting, and even worse with the new DCACHE_LRU_LIST bit logic.

This introduces helper functions to make sure everything follows the
proper dcache d_lru list rules: the dentry cache is complicated by the
fact that some of the hotpaths don't even want to look at the LRU list
at all, and the fact that we use the same list entry in the dentry for
both the LRU list and for our temporary shrinking lists when removing
things from the LRU.

The helper functions temporarily have some extra sanity checking for the
flag bits that have to match the current LRU state of the dentry.  We'll
remove that before the final 3.12 release, but considering how easy it
is to get wrong, this first cleanup version has some very particular
sanity checking.
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89dc77bc

13 9月, 2013 6 次提交

vfs: make d_path() get the root path under RCU · 68f0d9d9

由 Linus Torvalds 提交于 9月 12, 2013

This avoids the spinlocks and refcounts in the d_path() sequence too
(used by /proc and various other entities).  See commit 8b19e341 for
the equivalent getcwd() system call path.

And unlike getcwd(), d_path() doesn't copy the result to user space, so
I don't need to fear _that_ particular bug happening again.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

68f0d9d9

vfs: use __getname/__putname for getcwd() system call · 3272c544

由 Linus Torvalds 提交于 9月 12, 2013

It's a pathname.  It should use the pathname allocators and
deallocators, and PATH_MAX instead of PAGE_SIZE.  Never mind that the
two are commonly the same.

With this, the allocations scale up nicely too, and I can do getcwd()
system calls at a rate of about 300M/s, with no lock contention
anywhere.

Of course, nobody sane does that, especially since getcwd() is
traditionally a very slow operation in Unix.  But this was also the
simplest way to benchmark the prepend_path() improvements by Waiman, and
once I saw the profiles I couldn't leave it well enough alone.

But apart from being an performance improvement (from using per-cpu slab
allocators instead of the raw page allocator), it's actually a valid and
real cleanup.
Signed-off-by: NLinus "OCD" Torvalds <torvalds@linux-foundation.org>

3272c544

vfs: don't copy things to user space holding the rcu readlock · ff812d72

由 Linus Torvalds 提交于 9月 12, 2013

Oops.  That wasn't very smart.  We don't actually need the RCU lock any
more by the time we copy the cwd string to user space, but I had
stupidly surrounded the whole thing with it.

Introduced by commit 8b19e341 ("vfs: make getcwd() get the root and
pwd path under rcu")

Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org>

ff812d72

vfs: make getcwd() get the root and pwd path under rcu · 8b19e341

由 Linus Torvalds 提交于 9月 12, 2013

This allows us to skip all the crazy spinlocks and reference count
updates, and instead use the fs sequence read-lock to get an atomic
snapshot of the root and cwd information.

We might want to make the rule that "prepend_path()" is always called
with the RCU lock held, but the RCU lock nests fine and this is the
minimal fix.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b19e341

vfs: move get_fs_root_and_pwd() to single caller · 5762482f

由 Linus Torvalds 提交于 9月 12, 2013

Let's not pollute the include files with inline functions that are only
used in a single place. Especially not if we decide we might want to
change the semantics of said function to make it more efficient..
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5762482f

dcache: get/release read lock in read_seqbegin_or_lock() & friend · 18129977

由 Waiman Long 提交于 9月 12, 2013

This patch modifies read_seqbegin_or_lock() and need_seqretry() to use
newly introduced read_seqlock_excl() and read_sequnlock_excl()
primitives so that they won't change the sequence number even if they
fall back to take the lock.  This is OK as no change to the protected
data structure is being made.

It will prevent one fallback to lock taking from cascading into a series
of lock taking reducing performance because of the sequence number
change.  It will also allow other sequence readers to go forward while
an exclusive reader lock is taken.

This patch also updates some of the inaccurate comments in the code.
Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

18129977

11 9月, 2013 8 次提交

fs: convert inode and dentry shrinking to be node aware · 9b17c623

由 Dave Chinner 提交于 8月 28, 2013

Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9b17c623

list_lru: remove special case function list_lru_dispose_all. · 4e717f5c

由 Glauber Costa 提交于 8月 28, 2013

The list_lru implementation has one function, list_lru_dispose_all, with
only one user (the dentry code). At first, such function appears to make
sense because we are really not interested in the result of isolating each
dentry separately - all of them are going away anyway. However, it's
implementation is buggy in the following way:

When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
marking them with DCACHE_SHRINK_LIST. However, this is done without the
nlru->lock taken. The imediate result of that is that someone else may
add or remove the dentry from the LRU at the same time. When list_lru_del
happens in that scenario we will see an element that is not yet marked
with DCACHE_SHRINK_LIST (even though it will be in the future) and
obviously remove it from an lru where the element no longer is. Since
list_lru_dispose_all will in effect count down nlru's nr_items and
list_lru_del will do the same, this will lead to an imbalance.

The solution for this would not be so simple: we can obviously just keep
the lru_lock taken, but then we have no guarantees that we will be able to
acquire the dentry lock (dentry->d_lock). To properly solve this, we need
a communication mechanism between the lru and dentry code, so they can
coordinate this with each other.

Such mechanism already exists in the form of the list_lru_walk_cb
callback. So it is possible to construct a dcache-side prune function
that does the right thing only by calling list_lru_walk in a loop until no
more dentries are available.

With only one user, plus the fact that a sane solution for the problem
would involve boucing between dcache and list_lru anyway, I see little
justification to keep the special case list_lru_dispose_all in tree.
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4e717f5c

dcache: convert to use new lru list infrastructure · f6041567

由 Dave Chinner 提交于 8月 28, 2013

[glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f6041567

shrinker: convert superblock shrinkers to new API · 0a234c6d

由 Dave Chinner 提交于 8月 28, 2013

Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts.  The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted to
the count/scan API.  This is mainly a mechanical change.

[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0a234c6d

dcache: remove dentries from LRU before putting on dispose list · dd1f6b2e

由 Dave Chinner 提交于 8月 28, 2013

One of the big problems with modifying the way the dcache shrinker and LRU
implementation works is that the LRU is abused in several ways.  One of
these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list without
doing any accounting changes, and then use dentry_lru_prune() to remove it
from what-ever list it is now on to do the LRU accounting at that point.

This makes it -really hard- to change the LRU implementation.  The use of
the per-sb LRU lock serialises movement of the dentries between the
different lists and the removal of them, and this is the only reason that
it works.  If we want to break up the dentry LRU lock and lists into, say,
per-node lists, we remove the only serialisation that allows this lru
list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated from
the LRU list - dentries have to be removed from the LRU *before* being
placed on the dispose list.  This means that the LRU accounting and
isolation is completed before disposal is started, and that means we can
change the LRU implementation freely in future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST when
they are placed on the dispose list so that we don't think that parent
dentries found in try_prune_one_dentry() are on the LRU when the are
actually on the dispose list.  This would result in accounting the dentry
to the LRU a second time.  Hence dentry_lru_del() has to handle the
DCACHE_SHRINK_LIST case
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

dd1f6b2e

dentry: move to per-sb LRU locks · 19156840

由 Dave Chinner 提交于 8月 28, 2013

With the dentry LRUs being per-sb structures, there is no real need for
a global dentry_lru_lock. The locking can be made more fine-grained by
moving to a per-sb LRU lock, isolating the LRU operations of different
filesytsems completely from each other. The need for this is independent
of any performance consideration that may arise: in the interest of
abstracting the lru operations away, it is mandatory that each lru works
around its own lock instead of a global lock for all of them.

[glommer@openvz.org: updated changelog ]
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

19156840

dcache: convert dentry_stat.nr_unused to per-cpu counters · 62d36c77

由 Dave Chinner 提交于 8月 28, 2013

Before we split up the dcache_lru_lock, the unused dentry counter needs to
be made independent of the global dcache_lru_lock.  Convert it to per-cpu
counters to do this.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

62d36c77

fs: bump inode and dentry counters to long · 3942c07c

由 Glauber Costa 提交于 8月 28, 2013

This series reworks our current object cache shrinking infrastructure in
two main ways:

 * Noticing that a lot of users copy and paste their own version of LRU
   lists for objects, we put some effort in providing a generic version.
   It is modeled after the filesystem users: dentries, inodes, and xfs
   (for various tasks), but we expect that other users could benefit in
   the near future with little or no modification.  Let us know if you
   have any issues.

 * The underlying list_lru being proposed automatically and
   transparently keeps the elements in per-node lists, and is able to
   manipulate the node lists individually.  Given this infrastructure, we
   are able to modify the up-to-now hammer called shrink_slab to proceed
   with node-reclaim instead of always searching memory from all over like
   it has been doing.

Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock.  The locks usually disappear from the profilers with this
change.

Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.

With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim.  Historically,
those two pieces of work have been posted together.  This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested.  You can see more about the
history of such work at http://lwn.net/Articles/552769/

Dave Chinner (18):
  dcache: convert dentry_stat.nr_unused to per-cpu counters
  dentry: move to per-sb LRU locks
  dcache: remove dentries from LRU before putting on dispose list
  mm: new shrinker API
  shrinker: convert superblock shrinkers to new API
  list: add a new LRU list type
  inode: convert inode lru list to generic lru list code.
  dcache: convert to use new lru list infrastructure
  list_lru: per-node list infrastructure
  shrinker: add node awareness
  fs: convert inode and dentry shrinking to be node aware
  xfs: convert buftarg LRU to generic code
  xfs: rework buffer dispose list tracking
  xfs: convert dquot cache lru to list_lru
  fs: convert fs shrinkers to new scan/count API
  drivers: convert shrinkers to new count/scan API
  shrinker: convert remaining shrinkers to count/scan API
  shrinker: Kill old ->shrink API.

Glauber Costa (7):
  fs: bump inode and dentry counters to long
  super: fix calculation of shrinkable objects for small numbers
  list_lru: per-node API
  vmscan: per-node deferred work
  i915: bail out earlier when shrinker cannot acquire mutex
  hugepage: convert huge zero page shrinker to new shrinker API
  list_lru: dynamically adjust node arrays

This patch:

There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc.  This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.

Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset.  So we believe it is time for a change.  This
patch just moves int to longs.  Machines where it matters should have a
big long anyway.
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3942c07c

10 9月, 2013 2 次提交

split read_seqretry_or_unlock(), convert d_walk() to resulting primitives · 48f5ec21

由 Al Viro 提交于 9月 09, 2013

Separate "check if we need to retry" from "unlock if we are done and
had seq_writelock"; that allows to use these guys in d_walk(), where
we need to recheck every time we ascend back to parent, but do *not*
want to unlock until the very end.  Lift rcu_read_lock/rcu_read_unlock
out into callers.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

48f5ec21

dcache: Translating dentry into pathname without taking rename_lock · 232d2d60

由 Waiman Long 提交于 9月 09, 2013

When running the AIM7's short workload, Linus' lockref patch eliminated
most of the spinlock contention. However, there were still some left:

     8.46%     reaim  [kernel.kallsyms]     [k] _raw_spin_lock
                 |--42.21%-- d_path
                 |          proc_pid_readlink
                 |          SyS_readlinkat
                 |          SyS_readlink
                 |          system_call
                 |          __GI___readlink
                 |
                 |--40.97%-- sys_getcwd
                 |          system_call
                 |          __getcwd

The big one here is the rename_lock (seqlock) contention in d_path()
and the getcwd system call. This patch will eliminate the need to take
the rename_lock while translating dentries into the full pathnames.

The need to take the rename_lock is to make sure that no rename
operation can be ongoing while the translation is in progress. However,
only one thread can take the rename_lock thus blocking all the other
threads that need it even though the translation process won't make
any change to the dentries.

This patch will replace the writer's write_seqlock/write_sequnlock
sequence of the rename_lock of the callers of the prepend_path() and
__dentry_path() functions with the reader's read_seqbegin/read_seqretry
sequence within these 2 functions. As a result, the code will have to
retry if one or more rename operations had been performed. In addition,
RCU read lock will be taken during the translation process to make sure
that no dentries will go away. To prevent live-lock from happening,
the code will switch back to take the rename_lock if read_seqretry()
fails for three times.

To further reduce spinlock contention, this patch does not take the
dentry's d_lock when copying the filename from the dentries. Instead,
it treats the name pointer and length as unreliable and just copy
the string byte-by-byte over until it hits a null byte or the end of
string as specified by the length. This should avoid stepping into
invalid memory address. The error cases are left to be handled by
the sequence number check.

The following code re-factoring are also made:
1. Move prepend('/') into prepend_name() to remove one conditional
   check.
2. Move the global root check in prepend_path() back to the top of
   the while loop.

With this patch, the _raw_spin_lock will now account for only 1.2%
of the total CPU cycles for the short workload. This patch also has
the effect of reducing the effect of running perf on its profile
since the perf command itself can be a heavy user of the d_path()
function depending on the complexity of the workload.

When taking the perf profile of the high-systime workload, the amount
of spinlock contention contributed by running perf without this patch
was about 16%. With this patch, the spinlock contention caused by
the running of perf will go away and we will have a more accurate
perf profile.
Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

232d2d60

09 9月, 2013 2 次提交

vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e

由 Linus Torvalds 提交于 9月 08, 2013

This simplifies the RCU to refcounting code in particular.

I was originally intending to leave this for later, but walking through
all the dput() logic (see previous commit), I realized that the dput()
"might_sleep()" check was misleadingly weak. And I removed it as
misleading, both for performance profiling and for debugging.

However, the might_sleep() debugging case is actually true: the final
dput() can indeed sleep, if the inode of the dentry that you are
releasing ends up sleeping at iput time (see dentry_iput()). So the
problem with the might_sleep() in dput() wasn't that it wasn't true, it
was that it wasn't actually testing and triggering on the interesting
case.

In particular, just about *any* dput() can indeed sleep, if you happen
to race with another thread deleting the file in question, and you then
lose the race to the be the last dput() for that file. But because it's
a very rare race, the debugging code would never trigger it in practice.

Why is this problematic? The new d_rcu_to_refcount() (see commit
15570086: "vfs: reimplement d_rcu_to_refcount() using
lockref_get_or_lock()") does a dput() for the failure case, and it does
it under the RCU lock. So potentially sleeping really is a bug.

But there's no way I'm going to fix this with the previous complicated
"lockref_get_or_lock()" interface. And rather than revert to the old
and crufty nested dentry locking code (which did get this right by
delaying the reference count updates until they were verified to be
safe), let's make forward progress.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0d98439e

vfs: reorganize dput() memory accesses · 8aab6a27

由 Linus Torvalds 提交于 9月 08, 2013

This is me being a bit OCD after all the dentry optimization work this
merge window: profiles end up showing 'dput()' as a rather expensive
operation, and there were two unrelated bad reasons for that.

The first reason was reading d_lockref.count for debugging purposes,
which touches the lockref cacheline (for reads) before really need to.
More importantly, the debugging test in question is _wrong_, and has
hidden bugs. It's true that we can only sleep when the count goes down
to zero, but the test as-is hides the much more subtle bug that happens
if we race with somebody else deleting the file.

Anyway we _will_ touch that cacheline, but let's do it for a write and
in the right routine (ie in "lockref_put_or_lock()") which annotates the
costs better. So remove the misleading debug code.

The other was an unnecessary access to the cacheline that contains the
d_lru list, just to check whether we already were on the LRU list or
not. This is exactly what we have d_flags for, so that we can avoid
touching extra cache lines for the common case. So just add another bit
for "is this dentry on the LRU".

Finally, mark the tests properly likely/unlikely, so that the common
fast-paths are dense in the instruction stream.

This makes the profiles look much saner.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8aab6a27

06 9月, 2013 3 次提交

vfs: check unlinked ancestors before mount · eed81007

由 Miklos Szeredi 提交于 9月 05, 2013

We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
prevent mounts to be added to the disconnected subtree using relative paths
after the d_drop().

This patch fixes these issues by checking for unlinked (unhashed, non-root)
ancestors before proceeding with the mount. This is done with rename
seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
including our dentry itself. This ensures that the only one of
check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

eed81007

vfs: check submounts and drop atomically · 848ac114

由 Miklos Szeredi 提交于 9月 05, 2013

We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount.

 Process A: have_submounts() -> returns false
 Process B: mount() -> success
 Process A: d_drop()

This patch prepares the ground for the fix by doing the following
operations all under the same rename lock:

  have_submounts()
  shrink_dcache_parent()
  d_drop()

This is actually an optimization since have_submounts() and
shrink_dcache_parent() both traverse the same dentry tree separately.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
CC: David Howells <dhowells@redhat.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

848ac114

vfs: add d_walk() · db14fc3a

由 Miklos Szeredi 提交于 9月 05, 2013

This one replaces three instances open coded tree walking (have_submounts,
select_parent, d_genocide) with a common helper.

In addition to slightly reducing the kernel size, this simplifies the
callers and makes them less bug prone.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

db14fc3a