提交 · 0ea97a2d61df729ccce75b00a2fa37d39a508ab6 · openanolis / cloud-kernel

10 8月, 2018 1 次提交

make sure that __dentry_kill() always invalidates d_seq, unhashed or not · 4c0d7cd5

由 Al Viro 提交于 8月 09, 2018

RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq.  That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all.  Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.

We could try and be careful in the (few) places where that could
happen.  Or we could just make the final dput() invalidate the damn
thing, unhashed or not.  The latter is much simpler and easier to
backport, so let's do it that way.
Reported-by: N"Dae R. Jeong" <threeearcat@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4c0d7cd5

06 8月, 2018 1 次提交

root dentries need RCU-delayed freeing · 90bad5e0

由 Al Viro 提交于 8月 06, 2018

Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.

It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock().  If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.

Fixes: 48a066e7 ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

90bad5e0

04 8月, 2018 1 次提交

new primitive: discard_new_inode() · c2b6d621

由 Al Viro 提交于 6月 28, 2018

	We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction.  However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.

	Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations.  That primitive unlocks new
inode, but leaves I_CREATING in place.

	iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
	insert_inode_locked() treats the same as instant -EBUSY.
	ilookup() treats those as icache miss.

[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c2b6d621

02 8月, 2018 1 次提交

kill d_instantiate_no_diralias() · c971e6a0

由 Al Viro 提交于 5月 28, 2018

The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success.  ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c971e6a0

14 5月, 2018 1 次提交

get rid of dead code in d_find_alias() · 61fec493

由 Al Viro 提交于 4月 25, 2018

All "try disconnected alias if nothing else fits" logics in d_find_alias()
got accidentally disabled by Neil a while ago; for most of the callers it
was the right thing to do, so fixes belong in few callers that *do* want
disconnected aliases.  This just takes the now-dead code in d_find_alias()
out.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

61fec493

12 5月, 2018 1 次提交

do d_instantiate/unlock_new_inode combinations safely · 1e2e547a

由 Al Viro 提交于 5月 04, 2018

For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
	lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

	Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org	# 2.6.29 and later
Tested-by: NMike Marshall <hubcap@omnibond.com>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1e2e547a

20 4月, 2018 1 次提交
- A
  restore cond_resched() in shrink_dcache_parent() · 4fb48871
  由 Al Viro 提交于 4月 19, 2018
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  4fb48871
16 4月, 2018 4 次提交

dput(): turn into explicit while() loop · 1088a640

由 Al Viro 提交于 4月 15, 2018

No need to mess with gotos when the code yielded by straight while()
isn't any worse...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1088a640

A
dcache: move cond_resched() into the end of __dentry_kill() · 9c5f1d30
由 Al Viro 提交于 4月 15, 2018
```
cond_resched() in shrink_dentry_list() is too early
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9c5f1d30
A
d_walk(): kill 'finish' callback · 3a8e3611
由 Al Viro 提交于 4月 15, 2018
```
no users left
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
3a8e3611

d_invalidate(): unhash immediately · ff17fa56

由 Al Viro 提交于 4月 15, 2018

Once that is done, we can just hunt mountpoints down one by one;
no new mountpoints can be added from now on, so we don't need
anything tricky in finish() callback, etc.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ff17fa56

12 4月, 2018 2 次提交

fs/dcache.c: add cond_resched() in shrink_dentry_list() · 32785c05

由 Nikolay Borisov 提交于 4月 10, 2018

As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000).  This, in turn, could trigger the softlockup detector and
possibly trigger a panic.  In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump.  The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:

PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487

So we want to kill between 15k-25k dentries without yielding.

And one possible call stack looks like:

4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909

While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups.  So just move
cond_resched into shrink_dentry_list from its callers.

David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.

Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

32785c05

dcache: account external names as indirectly reclaimable memory · f1782c9b

由 Roman Gushchin 提交于 4月 10, 2018

I received a report about suspicious growth of unreclaimable slabs on
some machines.  I've found that it happens on machines with low memory
pressure, and these unreclaimable slabs are external names attached to
dentries.

External names are allocated using generic kmalloc() function, so they
are accounted as unreclaimable.  But they are held by dentries, which
are reclaimable, and they will be reclaimed under the memory pressure.

In particular, this breaks MemAvailable calculation, as it doesn't take
unreclaimable slabs into account.  This leads to a silly situation, when
a machine is almost idle, has no memory pressure and therefore has a big
dentry cache.  And the resulting MemAvailable is too low to start a new
workload.

To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
used to track the amount of memory, consumed by external names.  The
counter is increased in the dentry allocation path, if an external name
structure is allocated; and it's decreased in the dentry freeing path.

To reproduce the problem I've used the following Python script:

  import os

  for iter in range (0, 10000000):
      try:
          name = ("/some_long_name_%d" % iter) + "_" * 220
          os.stat(name)
      except Exception:
          pass

Without this patch:
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7811688 kB
  $ python indirect.py
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    2753052 kB

With the patch:
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7809516 kB
  $ python indirect.py
  $ cat /proc/meminfo | grep MemAvailable
  MemAvailable:    7749144 kB

[guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
  Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
[guro@fb.com: fix indirectly reclaimable memory accounting]
  Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f1782c9b

30 3月, 2018 12 次提交

A
d_genocide: move export to definition · cbd4a5bc
由 Al Viro 提交于 3月 29, 2018
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
cbd4a5bc
A
fold dentry_lock_for_move() into its sole caller and clean it up · 42177007
由 Al Viro 提交于 3月 11, 2018
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
42177007

make non-exchanging __d_move() copy ->d_parent rather than swap them · 076515fc

由 Al Viro 提交于 3月 10, 2018

Currently d_move(from, to) does the following:
	* name/parent of from <- old name/parent of to, from hashed there
	* to is unhashed
	* name of to is preserved
	* if from used to be detached, to gets detached
	* if from used to be attached, parent of to <- old parent of from.

That's both user-visibly bogus and complicates reasoning a lot.
Much saner semantics would be
	* name/parent of from <- name/parent of to, from hashed there.
	* to is unhashed
	* name/parent of to is unchanged.

The price, of course, is that old parent of from might lose a reference.
However,
	* all potentially cross-directory callers of d_move() have both
parents pinned directly; typically, dentries themselves are grabbed
only after we have grabbed and locked both parents.  IOW, the decrement
of old parent's refcount in case of d_move() won't reach zero.
	* __d_move() from d_splice_alias() is done to detached alias.
No refcount decrements in that case
	* __d_move() from __d_unalias() *can* get the refcount to zero.
So let's grab a reference to alias' old parent before calling __d_unalias()
and dput() it after we'd dropped rename_lock.

That does make d_splice_alias() potentially blocking.  However, it has
no callers in non-sleepable contexts (and the case where we'd grown
that dget/dput pair is _very_ rare, so performance is not an issue).

Another thing that needs adjustment is unlocking in the end of __d_move();
folded it in.  And cleaned the remnants of bogus ordering from the
"lock them in the beginning" counterpart - it's never been right and
now (well, for 7 years now) we have that thing always serialized on
rename_lock anyway.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

076515fc

split d_path() and friends into a separate file · 7a5cf791

由 Al Viro 提交于 3月 05, 2018

Those parts of fs/dcache.c are pretty much self-contained.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7a5cf791

A
dcache.c: trim includes · 43986d63
由 Al Viro 提交于 2月 25, 2018
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
43986d63

fs/dcache: Avoid a try_lock loop in shrink_dentry_list() · 8f04da2a

由 John Ogness 提交于 2月 23, 2018

shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be6184
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
the same code applies that is used to kill a dentry in dput(). This
also has the benefit that the locking order is now the same. First
the inode is locked, then the parent.
Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8f04da2a

get rid of trylock loop around dentry_kill() · f657a666

由 Al Viro 提交于 2月 23, 2018

In case when trylock in there fails, deal with it directly in
dentry_kill().  Note that in cases when we drop and retake
->d_lock, we need to recheck whether to retain the dentry.
Another thing is that dropping/retaking ->d_lock might have
ended up with negative dentry turning into positive; that,
of course, can happen only once...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f657a666

A
handle move to LRU in retain_dentry() · 62d9956c
由 Al Viro 提交于 3月 06, 2018
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
62d9956c
A
dput(): consolidate the "do we need to retain it?" into an inlined helper · a338579f
由 Al Viro 提交于 2月 23, 2018
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a338579f

split the slow part of lock_parent() off · 8b987a46

由 Al Viro 提交于 2月 23, 2018

Turn the "trylock failed" part into uninlined __lock_parent().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8b987a46

now lock_parent() can't run into killed dentry · 65d8eb5a

由 Al Viro 提交于 2月 23, 2018

all remaining callers hold either a reference or ->i_lock
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

65d8eb5a

get rid of trylock loop in locking dentries on shrink list · 3b3f09f4

由 Al Viro 提交于 2月 23, 2018

In case of trylock failure don't re-add to the list - drop the locks
and carefully get them in the right order.  For shrink_dentry_list(),
somebody having grabbed a reference to dentry means that we can
kick it off-list, so if we find dentry being modified under us we
don't need to play silly buggers with retries anyway - off the list
it is.

The locking logics taken out into a helper of its own; lock_parent()
is no longer used for dentries that can be killed under us.

[fix from Eric Biggers folded]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3b3f09f4

12 3月, 2018 4 次提交

d_delete(): get rid of trylock loop · c19457f0

由 Al Viro 提交于 2月 23, 2018

just grab ->i_lock first; we have a positive dentry, nothing's going
to happen to inode
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c19457f0

fs/dcache: Move dentry_kill() below lock_parent() · c1d0c1a2

由 John Ogness 提交于 2月 23, 2018

A subsequent patch will modify dentry_kill() to call lock_parent().
Move the dentry_kill() implementation "as is" below lock_parent()
first. This will help simplify the review of the subsequent patch
with dentry_kill() changes.
Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c1d0c1a2

fs/dcache: Remove stale comment from dentry_kill() · 06080d10

由 John Ogness 提交于 2月 23, 2018

Commit 0d98439e ("vfs: use lockred "dead" flag to mark unrecoverably
dead dentries") removed the `ref' parameter in dentry_kill() but its
documentation remained. Remove it.
Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

06080d10

take write_seqcount_invalidate() into __d_drop() · 0632a9ac

由 Al Viro 提交于 3月 07, 2018

... and reorder it with making d_unhashed() true.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0632a9ac

26 2月, 2018 2 次提交

fs: dcache: Use READ_ONCE when accessing i_dir_seq · 8cc07c80

由 Will Deacon 提交于 2月 19, 2018

i_dir_seq is subject to concurrent modification by a cmpxchg or
store-release operation, so ensure that the relaxed access in
d_alloc_parallel uses READ_ONCE.
Reported-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8cc07c80

fs: dcache: Avoid livelock between d_alloc_parallel and __d_add · 015555fd

由 Will Deacon 提交于 2月 19, 2018

If d_alloc_parallel runs concurrently with __d_add, it is possible for
d_alloc_parallel to continuously retry whilst i_dir_seq has been
incremented to an odd value by __d_add:

CPU0:
__d_add
	n = start_dir_add(dir);
		cmpxchg(&dir->i_dir_seq, n, n + 1) == n

CPU1:
d_alloc_parallel
retry:
	seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
	hlist_bl_lock(b);
		bit_spin_lock(0, (unsigned long *)b); // Always succeeds

CPU0:
	__d_lookup_done(dentry)
		hlist_bl_lock
			bit_spin_lock(0, (unsigned long *)b); // Never succeeds

CPU1:
	if (unlikely(parent->d_inode->i_dir_seq != seq)) {
		hlist_bl_unlock(b);
		goto retry;
	}

Since the simple bit_spin_lock used to implement hlist_bl_lock does not
provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
exit its retry loop because the sequence number always has the bottom
bit set.

This patch resolves the livelock by not taking hlist_bl_lock in
d_alloc_parallel if the sequence counter is odd, since any subsequent
masked comparison with i_dir_seq will fail anyway.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: NNaresh Madhusudana <naresh.madhusudana@arm.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

015555fd

24 2月, 2018 1 次提交

lock_parent() needs to recheck if dentry got __dentry_kill'ed under it · 3b821409

由 Al Viro 提交于 2月 23, 2018

In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Cc: stable@vger.kernel.org # 3.12+, counting backports
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3b821409

02 2月, 2018 2 次提交

fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports" · babcbbc7

由 Andrey Ryabinin 提交于 2月 01, 2018

This reverts commit df4c0e36.

It's no longer needed since dentry_string_cmp() now uses
read_word_at_a_time() to avoid kasan's reports.
Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

babcbbc7

fs/dcache: Use read_word_at_a_time() in dentry_string_cmp() · bfe7aa6c

由 Andrey Ryabinin 提交于 2月 01, 2018

dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may
read slightly more than it was requested in kmallac().  Normally this
would make KASAN to report out-of-bounds access, but this was
workarounded by commit df4c0e36 ("fs: dcache: manually unpoison
dname after allocation to shut up kasan's reports").

This workaround is not perfect, since it allows out-of-bounds access to
dentry's name for all the code, not just in dentry_string_cmp().

So it would be better to use read_word_at_a_time() instead and revert
commit df4c0e36.
Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bfe7aa6c

26 1月, 2018 2 次提交

dcache: delete unused d_hash_mask · b35d786b

由 Alexey Dobriyan 提交于 11月 20, 2017

Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b35d786b

A
dcache: subtract d_hash_shift from 32 in advance · 854d3e63
由 Alexey Dobriyan 提交于 11月 20, 2017
```
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
854d3e63

24 1月, 2018 2 次提交

vfs: factor out helpers d_instantiate_anon() and d_alloc_anon() · f9c34674

由 Miklos Szeredi 提交于 1月 19, 2018

Those helpers are going to be used by overlayfs to implement
NFS export decode.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

f9c34674

ovl: verify directory index entries on mount · e8f9e5b7

由 Amir Goldstein 提交于 1月 11, 2018

Directory index entries should have 'upper' xattr pointing to the real
upper dir. Verifying that the upper dir file handle is not stale is
expensive, so only verify stale directory index entries on mount if
NFS export feature is enabled.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

e8f9e5b7

16 1月, 2018 2 次提交

vfs: Define usercopy region in names_cache slab caches · 6a9b8820

由 David Windsor 提交于 6月 10, 2017

VFS pathnames are stored in the names_cache slab cache, either inline
or across an entire allocation entry (when approaching PATH_MAX). These
are copied to/from userspace, so they must be entirely whitelisted.

cache object allocation:
    include/linux/fs.h:
        #define __getname()    kmem_cache_alloc(names_cachep, GFP_KERNEL)

example usage trace:
    strncpy_from_user+0x4d/0x170
    getname_flags+0x6f/0x1f0
    user_path_at_empty+0x23/0x40
    do_mount+0x69/0xda0
    SyS_mount+0x83/0xd0

    fs/namei.c:
        getname_flags(...):
            ...
            result = __getname();
            ...
            kname = (char *)result->iname;
            result->name = kname;
            len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
            ...
            if (unlikely(len == EMBEDDED_NAME_MAX)) {
                const size_t size = offsetof(struct filename, iname[1]);
                kname = (char *)result;

                result = kzalloc(size, GFP_KERNEL);
                ...
                result->name = kname;
                len = strncpy_from_user(kname, filename, PATH_MAX);

In support of usercopy hardening, this patch defines the entire cache
object in the names_cache slab cache as whitelisted, since it may entirely
hold name strings to be copied to/from userspace.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: NDavid Windsor <dave@nullcore.net>
[kees: adjust commit log, add usage trace]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>

6a9b8820

dcache: Define usercopy region in dentry_cache slab cache · 80344266

由 David Windsor 提交于 6月 10, 2017

When a dentry name is short enough, it can be stored directly in the
dentry itself (instead in a separate kmalloc allocation). These dentry
short names, stored in struct dentry.d_iname and therefore contained in
the dentry_cache slab cache, need to be coped to userspace.

cache object allocation:
    fs/dcache.c:
        __d_alloc(...):
            ...
            dentry = kmem_cache_alloc(dentry_cache, ...);
            ...
            dentry->d_name.name = dentry->d_iname;

example usage trace:
    filldir+0xb0/0x140
    dcache_readdir+0x82/0x170
    iterate_dir+0x142/0x1b0
    SyS_getdents+0xb5/0x160

    fs/readdir.c:
        (called via ctx.actor by dir_emit)
        filldir(..., const char *name, ...):
            ...
            copy_to_user(..., name, namlen)

    fs/libfs.c:
        dcache_readdir(...):
            ...
            next = next_positive(dentry, p, 1)
            ...
            dir_emit(..., next->d_name.name, ...)

In support of usercopy hardening, this patch defines a region in the
dentry_cache slab cache in which userspace copy operations are allowed.

This region is known as the slab cache's usercopy region. Slab caches can
now check that each dynamic copy operation involving cache-managed memory
falls entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: NDavid Windsor <dave@nullcore.net>
[kees: adjust hunks for kmalloc-specific things moved later]
[kees: adjust commit log, provide usage trace]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>

80344266

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功