提交 · 7f879e1a94ac99586abf0659c03f35c1e48279c4 · openeuler / Kernel

01 2月, 2020 1 次提交

mm/filemap.c: clean up filemap_write_and_wait() · ddf8f376

由 Ira Weiny 提交于 1月 30, 2020

At some point filemap_write_and_wait() and
filemap_write_and_wait_range() got the exact same implementation with
the exception of the range being specified in *_range()

Similar to other functions in fs.h which call *_range(..., 0,
LLONG_MAX), change filemap_write_and_wait() to be a static inline which
calls filemap_write_and_wait_range()

Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ddf8f376

24 1月, 2020 1 次提交

vfs: add vfs_iocb_iter_[read|write] helper functions · 5dcdc43e

由 Jiufei Xue 提交于 11月 20, 2019

This doesn't cause any behavior changes and will be used by overlay async
IO implementation.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

5dcdc43e

03 1月, 2020 1 次提交

compat_ioctl: simplify the implementation · 77b90401

由 Arnd Bergmann 提交于 11月 27, 2019

Now that both native and compat ioctl syscalls are
in the same file, a couple of simplifications can
be made, bringing the implementation closer together:

- do_vfs_ioctl(), ioctl_preallocate(), and compat_ioctl_preallocate()
  can become static, allowing the compiler to optimize better

- slightly update the coding style for consistency between
  the functions.

- rather than listing each command in two switch statements
  for the compat case, just call a single function that has
  all the common commands.

As a side-effect, FS_IOC_RESVSP/FS_IOC_RESVSP64 are now available
to x86 compat tasks, along with FS_IOC_RESVSP_32/FS_IOC_RESVSP64_32.
This is harmless for i386 emulation, and can be considered a bugfix
for x32 emulation, which never supported these in the past.
Reviewed-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

77b90401

08 12月, 2019 1 次提交

sched/rt, fs: Use CONFIG_PREEMPTION · 2496396f

由 Thomas Gleixner 提交于 10月 15, 2019

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the i_size() and part_nr_sects_…() code over to use
CONFIG_PREEMPTION. Update the comment for fsstack_copy_inode_size() also
to refer to CONFIG_PREEMPTION.

[bigeasy: +PREEMPT comments]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-24-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

2496396f

01 12月, 2019 1 次提交

fs/direct-io.c: keep dio_warn_stale_pagecache() when CONFIG_BLOCK=n · a92853b6

由 Konstantin Khlebnikov 提交于 11月 30, 2019

This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.

See also commit 5a9d929d ("iomap: report collisions between directio
and buffered writes to userspace").

Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.

Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a92853b6

27 11月, 2019 1 次提交

Revert "vfs: properly and reliably lock f_pos in fdget_pos()" · 2be7d348

由 Linus Torvalds 提交于 11月 26, 2019

This reverts commit 0be0ee71.

I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.

While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads.  Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service.  In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.

There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor.  Which doesn't work well for
them.

The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side).  It may be that it's just this that causes problems, but we
clearly weren't ready yet.

Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:

  /dev/fuse		(fuse_dev_operations)
  /dev/mcelog		(mce_chrdev_ops)
  /dev/mei0		(mei_fops)
  /dev/net/tun		(tun_fops)
  /dev/nvme0		(nvme_dev_fops)
  /dev/tpm0		(tpm_fops)
  /proc/self/ns/mnt	(ns_file_operations)
  /dev/snd/pcm*		(snd_pcm_f_ops[])

and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.

Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.

We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).
Reported-by: NKenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2be7d348

26 11月, 2019 1 次提交

vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71

由 Linus Torvalds 提交于 11月 11, 2019

fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file.  Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it.  With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()".  We've done that for all the obvious and common
situations, we may need a few more.  Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

 "And I appreciate if people could help at least somehow with "getting
  rid of mixed case entirely" (i.e. always lock f_pos_lock on
  !FMODE_STREAM), because this transition starts to diverge from my
  particular use-case too far. To me it makes sense to do that
  transition as follows:

   - convert nonseekable_open -> stream_open via stream_open.cocci;
   - audit other nonseekable_open calls and convert left users that
     truly don't depend on position to stream_open;
   - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
     will cover pipes and sockets), or maybe convert pipes and sockets
     to FMODE_STREAM manually;
   - extend stream_open.cocci to analyze file_operations that use
     no_llseek or noop_llseek, but do not use nonseekable_open or
     alloc_file_pseudo. This might find files that have stream semantic
     but are opened differently;
   - extend stream_open.cocci to analyze file_operations whose
     .read/.write do not use ppos at all (independently of how file was
     opened);
   - ...
   - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
     !FMODE_STREAM;
   - gather bug reports for deadlocked read/write and convert missed
     cases to FMODE_STREAM, probably extending stream_open.cocci along
     the road to catch similar cases

  i.e. always take f_pos_lock unless a file is explicitly marked as
  being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here.  There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations.  Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0be0ee71

14 11月, 2019 2 次提交

block: remove (__)blkdev_reread_part as an exported API · f0b870df

由 Christoph Hellwig 提交于 11月 14, 2019

In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>	[dasd]
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f0b870df

block: move rescan_partitions to fs/block_dev.c · a1548b67

由 Christoph Hellwig 提交于 11月 14, 2019

Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a1548b67

28 10月, 2019 1 次提交

fs: add generic UNRESVSP and ZERO_RANGE ioctl handlers · 837a6e7f

由 Christoph Hellwig 提交于 10月 24, 2019

These use the same scheme as the pre-existing mapping of the XFS
RESVP ioctls to ->falloc, so just extend it and remove the XFS
implementation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[darrick: fix compile error on s390]
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

837a6e7f

23 10月, 2019 1 次提交

compat_ioctl: add compat_ptr_ioctl() · 2952db0f

由 Arnd Bergmann 提交于 9月 11, 2018

Many drivers have ioctl() handlers that are completely compatible between
32-bit and 64-bit architectures, except for the argument that is passed
down from user space and may have to be passed through compat_ptr()
in order to become a valid 64-bit pointer.

Using ".compat_ptr = compat_ptr_ioctl" in file operations should let
us simplify a lot of those drivers to avoid #ifdef checks, and convert
additional drivers that don't have proper compat handling yet.

On most architectures, the compat_ptr_ioctl() just passes all arguments
to the corresponding ->ioctl handler. The exception is arch/s390, where
compat_ptr() clears the top bit of a 32-bit pointer value, so user space
pointers to the second 2GB alias the first 2GB, as is the case for native
32-bit s390 user space.

The compat_ptr_ioctl() function must therefore be used only with
ioctl functions that either ignore the argument or pass a pointer to a
compatible data type.

If any ioctl command handled by fops->unlocked_ioctl passes a plain
integer instead of a pointer, or any of the passed data types is
incompatible between 32-bit and 64-bit architectures, a proper handler
is required instead of compat_ptr_ioctl.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
---
v3: add a better description
v2: use compat_ptr_ioctl instead of generic_compat_ioctl_ptrarg,
as suggested by Al Viro

2952db0f

25 9月, 2019 1 次提交

mm,thp: avoid writes to file with THP in pagecache · 09d91cda

由 Song Liu 提交于 9月 23, 2019

In previous patch, an application could put part of its text section in
THP via madvise().  These THPs will be protected from writes when the
application is still running (TXTBSY).  However, after the application
exits, the file is available for writes.

This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write.  A new counter nr_thps is added to struct
address_space.  In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.

Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.comSigned-off-by: NSong Liu <songliubraving@fb.com>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NRik van Riel <riel@surriel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

09d91cda

31 8月, 2019 1 次提交

fs: Export generic_fadvise() · cf1ea059

由 Jan Kara 提交于 8月 29, 2019

Filesystems will need to call this function from their fadvise handlers.

CC: stable@vger.kernel.org
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

cf1ea059

30 8月, 2019 2 次提交

vfs: Add timestamp_truncate() api · 50e17c00

由 Deepa Dinamani 提交于 1月 21, 2018

timespec_trunc() function is used to truncate a
filesystem timestamp to the right granularity.
But, the function does not clamp tv_sec part of the
timestamps according to the filesystem timestamp limits.

The replacement api: timestamp_truncate() also alters the
signature of the function to accommodate filesystem
timestamp clamping according to flesystem limits.

Note that the tv_nsec part is set to 0 if tv_sec is not within
the range supported for the filesystem.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Acked-by: NJeff Layton <jlayton@kernel.org>

50e17c00

vfs: Add file timestamp range support · 188d20bc

由 Deepa Dinamani 提交于 1月 21, 2018

Add fields to the superblock to track the min and max
timestamps supported by filesystems.

Initially, when a superblock is allocated, initialize
it to the max and min values the fields can hold.
Individual filesystems override these to match their
actual limits.

Pseudo filesystems are assumed to always support the
min and max allowable values for the fields.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Acked-by: NJeff Layton <jlayton@kernel.org>

188d20bc

20 8月, 2019 1 次提交

vfs: don't allow writes to swap files · dc617f29

由 Darrick J. Wong 提交于 8月 20, 2019

Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

dc617f29

19 8月, 2019 1 次提交

locks: create a new notifier chain for lease attempts · 18f6622e

由 Jeff Layton 提交于 8月 18, 2019

With the new file caching infrastructure in nfsd, we can end up holding
files open for an indefinite period of time, even when they are still
idle. This may prevent the kernel from handing out leases on the file,
which is something we don't want to block.

Fix this by running a SRCU notifier call chain whenever on any
lease attempt. nfsd can then purge the cache for that inode before
returning.

Since SRCU is only conditionally compiled in, we must only define the
new chain if it's enabled, and users of the chain must ensure that
SRCU is enabled.
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

18f6622e

13 8月, 2019 1 次提交

fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl · 22d94f49

由 Eric Biggers 提交于 8月 04, 2019

Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY.  This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".

Why we need this
~~~~~~~~~~~~~~~~

The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.

Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions.  But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic.  This is because it depends on
whether the files are currently present in the inode cache.

Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work.  Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons.  First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.

Second, almost all users of fscrypt actually do need the keys to be
global.  The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process.  This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.

On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2].  This is ugly and raises security concerns.  Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).

By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.

Why not use the add_key() syscall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:

- Supporting key removal with the semantics such that the secret is
  removed immediately and any unused inodes using the key are evicted;
  also, the eviction of any in-use inodes can be retried.

- Calculating a key-dependent cryptographic identifier and returning it
  to userspace.

- Allowing keys to be added and removed by non-root users, but only keys
  for v2 encryption policies; and to prevent denial-of-service attacks,
  users can only remove keys they themselves have added, and a key is
  only really removed after all users who added it have removed it.

Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.

However, to reuse code the implementation still uses the keyrings
service internally.  Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.

References:

    [1] https://github.com/google/fscrypt
    [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
    [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
    [4] https://github.com/google/fscrypt/issues/116
    [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
    [6] https://github.com/google/fscrypt/issues/128
    [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystemReviewed-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NEric Biggers <ebiggers@google.com>

22d94f49

31 7月, 2019 1 次提交

loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0

由 Jan Kara 提交于 7月 30, 2019

Commit 33ec3e53 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:

for i in {a..z}{a..z}; do
        dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
        mkfs.ext2 $i.image
        mkdir mnt$i
done

echo "Run"
for i in {a..z}{a..z}; do
        mount -o loop -t ext2 $i.image mnt$i &
done

Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.

Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89e524c0

29 7月, 2019 1 次提交

fs-verity: add inode and superblock fields · 5585f2af

由 Eric Biggers 提交于 7月 22, 2019

Analogous to fs/crypto/, add fields to the VFS inode and superblock for
use by the fs/verity/ support layer:

- ->s_vop: points to the fsverity_operations if the filesystem supports
  fs-verity, otherwise is NULL.

- ->i_verity_info: points to cached fs-verity information for the inode
  after someone opens it, otherwise is NULL.

- S_VERITY: bit in ->i_flags that identifies verity inodes, even when
  they haven't been opened yet and thus still have NULL ->i_verity_info.
Reviewed-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: NEric Biggers <ebiggers@google.com>

5585f2af

04 7月, 2019 1 次提交

locks: Cleanup lm_compare_owner and lm_owner_key · f85d9338

由 Benjamin Coddington 提交于 5月 23, 2019

After the update to use nlm_lockowners for the NLM server, there are no
more users of lm_compare_owner and lm_owner_key.
Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f85d9338

01 7月, 2019 2 次提交

vfs: create a generic checking function for FS_IOC_FSSETXATTR · 7b0e492e

由 Darrick J. Wong 提交于 7月 01, 2019

Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>

7b0e492e

vfs: create a generic checking and prep function for FS_IOC_SETFLAGS · 5aca2842

由 Darrick J. Wong 提交于 7月 01, 2019

Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.

Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NBob Peterson <rpeterso@redhat.com>

5aca2842

21 6月, 2019 1 次提交

mm: add filemap_fdatawait_range_keep_errors() · aa0bfcd9

由 Ross Zwisler 提交于 6月 20, 2019

In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.
Signed-off-by: NRoss Zwisler <zwisler@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org

aa0bfcd9

19 6月, 2019 1 次提交

locks: eliminate false positive conflicts for write lease · 387e3746

由 Amir Goldstein 提交于 6月 07, 2019

check_conflicting_open() is checking for existing fd's open for read or
for write before allowing to take a write lease.  The check that was
implemented using i_count and d_count is an approximation that has
several false positives.  For example, overlayfs since v4.19, takes an
extra reference on the dentry; An open with O_PATH takes a reference on
the dentry although the file cannot be read nor written.

Change the implementation to use i_readcount and i_writecount to
eliminate the false positive conflicts and allow a write lease to be
taken on an overlayfs file.

The change of behavior with existing fd's open with O_PATH is symmetric
w.r.t. current behavior of lease breakers - an open with O_PATH currently
does not break a write lease.

This increases the size of struct inode by 4 bytes on 32bit archs when
CONFIG_FILE_LOCKING is defined and CONFIG_IMA was not already
defined.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>

387e3746

10 6月, 2019 4 次提交

vfs: introduce file_modified() helper · e38f7f53

由 Amir Goldstein 提交于 6月 05, 2019

The combination of file_remove_privs() and file_update_mtime() is
quite common in filesystem ->write_iter() methods.

Modelled after the helper file_accessed(), introduce file_modified()
and use it from generic_remap_file_range_prep().

Note that the order of calling file_remove_privs() before
file_update_mtime() in the helper was matched to the more common order by
filesystems and not the current order in generic_remap_file_range_prep().
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

e38f7f53

vfs: add missing checks to copy_file_range · 96e6e8f4

由 Amir Goldstein 提交于 6月 05, 2019

Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.

[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

96e6e8f4

vfs: introduce generic_file_rw_checks() · a3171351

由 Amir Goldstein 提交于 6月 05, 2019

Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.
Suggested-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

a3171351

vfs: introduce generic_copy_file_range() · f16acc9d

由 Dave Chinner 提交于 6月 05, 2019

Right now if vfs_copy_file_range() does not use any offload
mechanism, it falls back to calling do_splice_direct(). This fails
to do basic sanity checks on the files being copied. Before we
start adding this necessarily functionality to the fallback path,
separate it out into generic_copy_file_range().

generic_copy_file_range() has the same prototype as
->copy_file_range() so that filesystems can use it in their custom
->copy_file_range() method if they so choose.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f16acc9d

09 6月, 2019 1 次提交

docs: fs: fix broken links to vfs.txt with was renamed to vfs.rst · 5c437fa2

由 Mauro Carvalho Chehab 提交于 6月 07, 2019

A recent documentation conversion renamed this file but forgot
to update the links.

Fixes: af96c1e3 ("docs: filesystems: vfs: Convert vfs.txt to RST")
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

5c437fa2

29 5月, 2019 1 次提交

fanotify: Disallow permission events for proc filesystem · 0b3b094a

由 Jan Kara 提交于 5月 15, 2019

Proc filesystem has special locking rules for various files. Thus
fanotify which opens files on event delivery can easily deadlock
against another process that waits for fanotify permission event to be
handled. Since permission events on /proc have doubtful value anyway,
just disallow them.

Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>

0b3b094a

26 5月, 2019 4 次提交

vfs: Kill sget_userns() · 023d066a

由 David Howells 提交于 3月 25, 2019

Kill sget_userns(), folding it into sget() as that's the only remaining
user.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org

023d066a

A
fold mount_pseudo_xattr() into pseudo_fs_get_tree() · 8d9e46d8
由 Al Viro 提交于 5月 11, 2019
```
... now that all other callers are gone
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
8d9e46d8

vfs: Kill mount_ns() · bb7b6b2b

由 David Howells 提交于 3月 25, 2019

Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount
API.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bb7b6b2b

mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18

由 Al Viro 提交于 5月 20, 2019

Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work.  These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root.  However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it.  Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1f58bb18

03 5月, 2019 1 次提交

fs: add sync_file_range() helper · 22f96b38

由 Jens Axboe 提交于 4月 09, 2019

This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

22f96b38

02 5月, 2019 1 次提交

new inode method: ->free_inode() · fdb0da89

由 Al Viro 提交于 4月 10, 2019

A lot of ->destroy_inode() instances end with call_rcu() of a callback
that does RCU-delayed part of freeing.  Introduce a new method for
doing just that, with saner signature.

Rules:
->destroy_inode		->free_inode
	f			g		immediate call of f(),
						RCU-delayed call of g()
	f			NULL		immediate call of f(),
						no RCU-delayed calls
	NULL			g		RCU-delayed call of g()
	NULL			NULL		RCU-delayed default freeing

IOW, NULL ->free_inode gives the same behaviour as now.

Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
mandate the latter form, but that would have very little benefit beyond
making rules a bit more symmetric.  It would break backwards compatibility,
require extra boilerplate and expected semantics for (NULL, NULL) pair
would have no use whatsoever...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fdb0da89

26 4月, 2019 1 次提交

ext4: Support case-insensitive file name lookups · b886ee3e

由 Gabriel Krisman Bertazi 提交于 4月 25, 2019

This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string.  This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children.  This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries.  This is bad for performance but requires some leveraging of
the vfs layer to fix.  We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware.  The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file.  This means that case-insensitive
file name lookup will not work only for that file.  An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding.  When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI.  It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for EXT4 because:

  - It has a lower cost than NFC/NFKC (which requires
    decomposing to NFD as an intermediary step)
  - It doesn't eliminate important semantic meaning like
    compatibility decompositions.

Although:

  - This implementation is not completely linguistic accurate, because
  different languages have conflicting rules, which would require the
  specialization of the filesystem to a given locale, which brings all
  sorts of problems for removable media and for users who use more than
  one language.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

b886ee3e

25 4月, 2019 1 次提交

afs: Add file locking tracepoints · d4696601

由 David Howells 提交于 4月 25, 2019

Add two tracepoints for monitoring AFS file locking.  Firstly, add one that
follows the operational part:

    echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_op/enable

And add a second that more follows the event-driven part:

    echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_ev/enable

Individual file_lock structs seen by afs are tagged with debugging IDs that
are displayed in the trace log to make it easier to see what's going on,
especially as setting the first lock always seems to involve copying the
file_lock twice.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

d4696601

13 4月, 2019 1 次提交

fs: drop unused fput_atomic definition · 1caf7a70

由 Lukas Bulwahn 提交于 4月 11, 2019

commit d7065da0 ("get rid of the magic around f_count in aio") added
fput_atomic to include/linux/fs.h, motivated by its use in __aio_put_req()
in fs/aio.c.

Later, commit 3ffa3c0e ("aio: now fput() is OK from interrupt context;
get rid of manual delayed __fput()") removed the only use of fput_atomic
in __aio_put_req(), but did not remove the since then unused fput_atomic
definition in include/linux/fs.h.

We curate this now and finally remove the unused definition.

This issue was identified during a code review due to a coccinelle warning
from the atomic_as_refcounter.cocci rule pointing to the use of atomic_t
in fput_atomic.
Suggested-by: NKrystian Radlak <kradlak@exida.com>
Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1caf7a70

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功