- 31 7月, 2019 1 次提交
-
-
由 Jan Kara 提交于
Commit 33ec3e53 ("loop: Don't change loop device under exclusive opener") made LOOP_SET_FD ioctl acquire exclusive block device reference while it updates loop device binding. However this can make perfectly valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding temporarily the exclusive bdev reference in cases like this: for i in {a..z}{a..z}; do dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024 mkfs.ext2 $i.image mkdir mnt$i done echo "Run" for i in {a..z}{a..z}; do mount -o loop -t ext2 $i.image mnt$i & done Fix the problem by not getting full exclusive bdev reference in LOOP_SET_FD but instead just mark the bdev as being claimed while we update the binding information. This just blocks new exclusive openers instead of failing them with EBUSY thus fixing the problem. Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener") Cc: stable@vger.kernel.org Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 7月, 2019 1 次提交
-
-
由 Benjamin Coddington 提交于
After the update to use nlm_lockowners for the NLM server, there are no more users of lm_compare_owner and lm_owner_key. Signed-off-by: NBenjamin Coddington <bcodding@redhat.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 01 7月, 2019 2 次提交
-
-
由 Darrick J. Wong 提交于
Create a generic checking function for the incoming FS_IOC_FSSETXATTR fsxattr values so that we can standardize some of the implementation behaviors. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NJan Kara <jack@suse.cz>
-
由 Darrick J. Wong 提交于
Create a generic function to check incoming FS_IOC_SETFLAGS flag values and later prepare the inode for updates so that we can standardize the implementations that follow ext4's flag values. Note that the efivarfs implementation no longer fails a no-op SETFLAGS without CAP_LINUX_IMMUTABLE since that's the behavior in ext*. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NDavid Sterba <dsterba@suse.com> Reviewed-by: NBob Peterson <rpeterso@redhat.com>
-
- 21 6月, 2019 1 次提交
-
-
由 Ross Zwisler 提交于
In the spirit of filemap_fdatawait_range() and filemap_fdatawait_keep_errors(), introduce filemap_fdatawait_range_keep_errors() which both takes a range upon which to wait and does not clear errors from the address space. Signed-off-by: NRoss Zwisler <zwisler@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
-
- 19 6月, 2019 1 次提交
-
-
由 Amir Goldstein 提交于
check_conflicting_open() is checking for existing fd's open for read or for write before allowing to take a write lease. The check that was implemented using i_count and d_count is an approximation that has several false positives. For example, overlayfs since v4.19, takes an extra reference on the dentry; An open with O_PATH takes a reference on the dentry although the file cannot be read nor written. Change the implementation to use i_readcount and i_writecount to eliminate the false positive conflicts and allow a write lease to be taken on an overlayfs file. The change of behavior with existing fd's open with O_PATH is symmetric w.r.t. current behavior of lease breakers - an open with O_PATH currently does not break a write lease. This increases the size of struct inode by 4 bytes on 32bit archs when CONFIG_FILE_LOCKING is defined and CONFIG_IMA was not already defined. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJeff Layton <jlayton@kernel.org>
-
- 10 6月, 2019 4 次提交
-
-
由 Amir Goldstein 提交于
The combination of file_remove_privs() and file_update_mtime() is quite common in filesystem ->write_iter() methods. Modelled after the helper file_accessed(), introduce file_modified() and use it from generic_remap_file_range_prep(). Note that the order of calling file_remove_privs() before file_update_mtime() in the helper was matched to the more common order by filesystems and not the current order in generic_remap_file_range_prep(). Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Amir Goldstein 提交于
Like the clone and dedupe interfaces we've recently fixed, the copy_file_range() implementation is missing basic sanity, limits and boundary condition tests on the parameters that are passed to it from userspace. Create a new "generic_copy_file_checks()" function modelled on the generic_remap_checks() function to provide this missing functionality. [Amir] Shorten copy length instead of checking pos_in limits because input file size already abides by the limits. Signed-off-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Amir Goldstein 提交于
Factor out helper with some checks on in/out file that are common to clone_file_range and copy_file_range. Suggested-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Dave Chinner 提交于
Right now if vfs_copy_file_range() does not use any offload mechanism, it falls back to calling do_splice_direct(). This fails to do basic sanity checks on the files being copied. Before we start adding this necessarily functionality to the fallback path, separate it out into generic_copy_file_range(). generic_copy_file_range() has the same prototype as ->copy_file_range() so that filesystems can use it in their custom ->copy_file_range() method if they so choose. Signed-off-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 09 6月, 2019 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
A recent documentation conversion renamed this file but forgot to update the links. Fixes: af96c1e3 ("docs: filesystems: vfs: Convert vfs.txt to RST") Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 29 5月, 2019 1 次提交
-
-
由 Jan Kara 提交于
Proc filesystem has special locking rules for various files. Thus fanotify which opens files on event delivery can easily deadlock against another process that waits for fanotify permission event to be handled. Since permission events on /proc have doubtful value anyway, just disallow them. Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 26 5月, 2019 4 次提交
-
-
由 David Howells 提交于
Kill sget_userns(), folding it into sget() as that's the only remaining user. Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org
-
由 Al Viro 提交于
... now that all other callers are gone Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount API. Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Once upon a time we used to set ->d_name of e.g. pipefs root so that d_path() on pipes would work. These days it's completely pointless - dentries of pipes are not even connected to pipefs root. However, mount_pseudo() had set the root dentry name (passed as the second argument) and callers kept inventing names to pass to it. Including those that didn't *have* any non-root dentries to start with... All of that had been pointless for about 8 years now; it's time to get rid of that cargo-culting... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 03 5月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
This just pulls out the ksys_sync_file_range() code to work on a struct file instead of an fd, so we can use it elsewhere. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 5月, 2019 1 次提交
-
-
由 Al Viro 提交于
A lot of ->destroy_inode() instances end with call_rcu() of a callback that does RCU-delayed part of freeing. Introduce a new method for doing just that, with saner signature. Rules: ->destroy_inode ->free_inode f g immediate call of f(), RCU-delayed call of g() f NULL immediate call of f(), no RCU-delayed calls NULL g RCU-delayed call of g() NULL NULL RCU-delayed default freeing IOW, NULL ->free_inode gives the same behaviour as now. Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could mandate the latter form, but that would have very little benefit beyond making rules a bit more symmetric. It would break backwards compatibility, require extra boilerplate and expected semantics for (NULL, NULL) pair would have no use whatsoever... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 4月, 2019 1 次提交
-
-
由 Gabriel Krisman Bertazi 提交于
This patch implements the actual support for case-insensitive file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the casefold feature set is able to configure directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups to succeed in that directory in a case-insensitive fashion, i.e: match a directory entry even if the name used by userspace is not a byte per byte match with the disk name, but is an equivalent case-insensitive version of the Unicode string. This operation is called a case-insensitive file name lookup. The feature is configured as an inode attribute applied to directories and inherited by its children. This attribute can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. * dcache handling: For a +F directory, Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite which equivalent string was used in a previous lookup, without having to resort to ->lookup(). d_hash() of casefolded directories is implemented as the hash of the casefolded string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare() uses the utf8_strncasecmp() infrastructure, which handles the comparison of equivalent, same case, names as well. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetched from the disk is a byte-per-byte match with what the user requested, making this implementation 'name-preserving'. i.e. no actual information is lost when writing to storage. DX is supported by modifying the hashes used in +F directories to make them case/encoding-aware. The new disk hashes are calculated as the hash of the full casefolded string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide an exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that case-insensitive file name lookup will not work only for that file. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that optional bit is set, any attempt to create a file name using an invalid UTF-8 sequence will fail and return an error to userspace. * Normalization algorithm: The UTF-8 algorithms used to compare strings in ext4 is implemented lives in fs/unicode, and is based on a previous version developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 12.1, or higher, combined with the elimination of ignorable code points (NFDi) and full case-folding (CF) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method for EXT4 because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like compatibility decompositions. Although: - This implementation is not completely linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 25 4月, 2019 1 次提交
-
-
由 David Howells 提交于
Add two tracepoints for monitoring AFS file locking. Firstly, add one that follows the operational part: echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_op/enable And add a second that more follows the event-driven part: echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_ev/enable Individual file_lock structs seen by afs are tagged with debugging IDs that are displayed in the trace log to make it easier to see what's going on, especially as setting the first lock always seems to involve copying the file_lock twice. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
- 13 4月, 2019 1 次提交
-
-
由 Lukas Bulwahn 提交于
commit d7065da0 ("get rid of the magic around f_count in aio") added fput_atomic to include/linux/fs.h, motivated by its use in __aio_put_req() in fs/aio.c. Later, commit 3ffa3c0e ("aio: now fput() is OK from interrupt context; get rid of manual delayed __fput()") removed the only use of fput_atomic in __aio_put_req(), but did not remove the since then unused fput_atomic definition in include/linux/fs.h. We curate this now and finally remove the unused definition. This issue was identified during a code review due to a coccinelle warning from the atomic_as_refcounter.cocci rule pointing to the use of atomic_t in fput_atomic. Suggested-by: NKrystian Radlak <kradlak@exida.com> Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 07 4月, 2019 1 次提交
-
-
由 Kirill Smelkov 提交于
fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f26 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f26 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: NKirill Smelkov <kirr@nexedi.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 3月, 2019 1 次提交
-
-
由 Al Viro 提交于
open_tree(dfd, pathname, flags) Returns an O_PATH-opened file descriptor or an error. dfd and pathname specify the location to open, in usual fashion (see e.g. fstatat(2)). flags should be an OR of some of the following: * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW - same meanings as usual * OPEN_TREE_CLOEXEC - make the resulting descriptor close-on-exec * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE - instead of opening the location in question, create a detached mount tree matching the subtree rooted at location specified by dfd/pathname. With AT_RECURSIVE the entire subtree is cloned, without it - only the part within in the mount containing the location in question. In other words, the same as mount --rbind or mount --bind would've taken. The detached tree will be dissolved on the final close of obtained file. Creation of such detached trees requires the same capabilities as doing mount --bind. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 08 3月, 2019 1 次提交
-
-
由 Rasmus Villemoes 提交于
Instead of doing this compile-time check in some slightly arbitrary user of struct filename, put it next to the definition. Link: http://lkml.kernel.org/r/20190208203015.29702-3-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 3月, 2019 1 次提交
-
-
由 Greg Thelen 提交于
Commit 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") refers to inode_switch_wb_work_fn() which never got merged. Switch the comments to inode_switch_wbs_work_fn(). Link: http://lkml.kernel.org/r/20190305004617.142590-1-gthelen@google.com Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: NGreg Thelen <gthelen@google.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 3月, 2019 1 次提交
-
-
由 Linus Torvalds 提交于
Al Viro root-caused a race where the IOCB_CMD_POLL handling of fget/fput() could cause us to access the file pointer after it had already been freed: "In more details - normally IOCB_CMD_POLL handling looks so: 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() 2) aio_poll() resolves the descriptor to struct file by req->file = fget(iocb->aio_fildes) 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that aio_kiocb to 2 (bumps by 1, that is). 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() had been called and only once") it locks the queue. That's what the extra reference to iocb had been for - we know we can safely access it. 5) With queue locked, we check if ->woken has already been set to true (by aio_poll_wake()) and, if it had been, we unlock the queue, drop a reference to aio_kiocb and bugger off - at that point it's a responsibility to aio_poll_wake() and the stuff called/scheduled by it. That code will drop the reference to file in req->file, along with the other reference to our aio_kiocb. 6) otherwise, we see whether we need to wait. If we do, we unlock the queue, drop one reference to aio_kiocb and go away - eventual wakeup (or cancel) will deal with the reference to file and with the other reference to aio_kiocb 7) otherwise we remove ourselves from waitqueue (still under the queue lock), so that wakeup won't get us. No async activity will be happening, so we can safely drop req->file and iocb ourselves. If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb won't get freed under us, so we can do all the checks and locking safely. And we don't touch ->file if we detect that case. However, vfs_poll() most certainly *does* touch the file it had been given. So wakeup coming while we are still in ->poll() might end up doing fput() on that file. That case is not too rare, and usually we are saved by the still present reference from descriptor table - that fput() is not the final one. But if another thread closes that descriptor right after our fget() and wakeup does happen before ->poll() returns, we are in trouble - final fput() done while we are in the middle of a method: Al also wrote a patch to take an extra reference to the file descriptor to fix this, but I instead suggested we just streamline the whole file pointer handling by submit_io() so that the generic aio submission code simply keeps the file pointer around until the aio has completed. Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL") Acked-by: NAl Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 2月, 2019 5 次提交
-
-
由 Jens Axboe 提交于
Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 David Howells 提交于
The kern_mount_data() isn't used any more so remove it. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
the former is an analogue of mount_{single,nodev} for use in ->get_tree() instances, the latter - analogue of sget() for the same. These are fairly similar to the originals, but the callback signature for sget_fc() is different from sget() ones, so getting bits and pieces shared would be too convoluted; we might get around to that later, but for now let's just remember to keep them in sync. They do live next to each other, and changes in either won't be hard to spot. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
[AV - unfuck kern_mount_data(); we want non-NULL ->mnt_ns on long-living mounts] [AV - reordering fs/namespace.c is badly overdue, but let's keep it separate from that series] [AV - drop simple_pin_fs() change] [AV - clean vfs_kern_mount() failure exits up] Implement a filesystem context concept to be used during superblock creation for mount and superblock reconfiguration for remount. The mounting procedure then becomes: (1) Allocate new fs_context context. (2) Configure the context. (3) Create superblock. (4) Query the superblock. (5) Create a mount for the superblock. (6) Destroy the context. Rather than calling fs_type->mount(), an fs_context struct is created and fs_type->init_fs_context() is called to set it up. Pointers exist for the filesystem and LSM to hang their private data off. A set of operations has to be set by ->init_fs_context() to provide freeing, duplication, option parsing, binary data parsing, validation, mounting and superblock filling. Legacy filesystems are supported by the provision of a set of legacy fs_context operations that build up a list of mount options and then invoke fs_type->mount() from within the fs_context ->get_tree() operation. This allows all filesystems to be accessed using fs_context. It should be noted that, whilst this patch adds a lot of lines of code, there is quite a bit of duplication with existing code that can be eliminated should all filesystems be converted over. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 24 2月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 1月, 2019 4 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Replace do_remount_sb() with a function, reconfigure_super(), that's fs_context aware. The fs_context is expected to be parameterised already and have ->root pointing to the superblock to be reconfigured. A legacy wrapper is provided that is intended to be called from the fs_context ops when those appear, but for now is called directly from reconfigure_super(). This wrapper invokes the ->remount_fs() superblock op for the moment. It is intended that the remount_fs() op will be phased out. The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate that the context is being used for reconfiguration. do_umount_root() is provided to consolidate remount-to-R/O for umount and emergency remount by creating a context and invoking reconfiguration. do_remount(), do_umount() and do_emergency_remount_callback() are switched to use the new process. [AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the umount / bug, gets rid of pointless complexity] [AV -- set ->net_ns in all cases; nfs remount will need that] [AV -- shift security_sb_remount() call into reconfigure_super(); the callers that didn't do security_sb_remount() have NULL fc->security anyway, so it's a no-op for them] Signed-off-by: NDavid Howells <dhowells@redhat.com> Co-developed-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Roll the handling of subtypes into do_new_mount() and vfs_get_tree(). The former determines any subtype string and hangs it off the fs_context; the latter applies it. Make do_new_mount() create, parameterise and commit an fs_context and create a mount for itself rather than calling vfs_kern_mount(). [AV -- missing kstrdup()] [AV -- ... and no kstrdup() if we get to setting ->s_submount - we simply transfer it from fc, leaving NULL behind] [AV -- constify ->s_submount, while we are at it] Reviewed-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Waiman Long 提交于
The list_lru structure is essentially just a pointer to a table of per-node LRU lists. Even if CONFIG_MEMCG_KMEM is defined, the list field is just used for LRU list registration and shrinker_id is set at initialization. Those fields won't need to be touched that often. So there is no point to make the list_lru structures to sit in their own cachelines. Signed-off-by: NWaiman Long <longman@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 1月, 2019 1 次提交
-
-
由 Chandan Rajendra 提交于
In order to have a common code base for fscrypt "post read" processing for all filesystems which support encryption, this commit removes filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION) and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose value affects all the filesystems making use of fscrypt. Reviewed-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NEric Biggers <ebiggers@google.com>
-
- 22 1月, 2019 1 次提交
-
-
由 Phillip Potter 提交于
Many file systems use a copy&paste implementation of dirent to on-disk file type conversions. Create a common implementation to be used by file systems with some useful conversion helpers to reduce open coded file type conversions in file system code. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NPhillip Potter <phil@philpotter.co.uk> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 29 12月, 2018 1 次提交
-
-
由 Jan Kara 提交于
Provide a variant of buffer_migrate_page() that also checks whether there are no unexpected references to buffer heads. This function will then be safe to use for block device pages. [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)] Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 12月, 2018 1 次提交
-
-
由 NeilBrown 提交于
posix_unblock_lock() is not specific to posix locks, and behaves nearly identically to locks_delete_block() - the former returning a status while the later doesn't. So discard posix_unblock_lock() and use locks_delete_block() instead, after giving that function an appropriate return value. Signed-off-by: NNeilBrown <neilb@suse.com> Reviewed-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NJeff Layton <jlayton@kernel.org>
-