1. 31 7月, 2019 1 次提交
    • J
      loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0
      Jan Kara 提交于
      Commit 33ec3e53 ("loop: Don't change loop device under exclusive
      opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
      while it updates loop device binding. However this can make perfectly
      valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
      temporarily the exclusive bdev reference in cases like this:
      
      for i in {a..z}{a..z}; do
              dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
              mkfs.ext2 $i.image
              mkdir mnt$i
      done
      
      echo "Run"
      for i in {a..z}{a..z}; do
              mount -o loop -t ext2 $i.image mnt$i &
      done
      
      Fix the problem by not getting full exclusive bdev reference in
      LOOP_SET_FD but instead just mark the bdev as being claimed while we
      update the binding information. This just blocks new exclusive openers
      instead of failing them with EBUSY thus fixing the problem.
      
      Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
      Cc: stable@vger.kernel.org
      Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      89e524c0
  2. 04 7月, 2019 1 次提交
  3. 01 7月, 2019 2 次提交
  4. 21 6月, 2019 1 次提交
  5. 19 6月, 2019 1 次提交
    • A
      locks: eliminate false positive conflicts for write lease · 387e3746
      Amir Goldstein 提交于
      check_conflicting_open() is checking for existing fd's open for read or
      for write before allowing to take a write lease.  The check that was
      implemented using i_count and d_count is an approximation that has
      several false positives.  For example, overlayfs since v4.19, takes an
      extra reference on the dentry; An open with O_PATH takes a reference on
      the dentry although the file cannot be read nor written.
      
      Change the implementation to use i_readcount and i_writecount to
      eliminate the false positive conflicts and allow a write lease to be
      taken on an overlayfs file.
      
      The change of behavior with existing fd's open with O_PATH is symmetric
      w.r.t. current behavior of lease breakers - an open with O_PATH currently
      does not break a write lease.
      
      This increases the size of struct inode by 4 bytes on 32bit archs when
      CONFIG_FILE_LOCKING is defined and CONFIG_IMA was not already
      defined.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      387e3746
  6. 10 6月, 2019 4 次提交
  7. 09 6月, 2019 1 次提交
  8. 29 5月, 2019 1 次提交
  9. 26 5月, 2019 4 次提交
  10. 03 5月, 2019 1 次提交
  11. 02 5月, 2019 1 次提交
    • A
      new inode method: ->free_inode() · fdb0da89
      Al Viro 提交于
      A lot of ->destroy_inode() instances end with call_rcu() of a callback
      that does RCU-delayed part of freeing.  Introduce a new method for
      doing just that, with saner signature.
      
      Rules:
      ->destroy_inode		->free_inode
      	f			g		immediate call of f(),
      						RCU-delayed call of g()
      	f			NULL		immediate call of f(),
      						no RCU-delayed calls
      	NULL			g		RCU-delayed call of g()
      	NULL			NULL		RCU-delayed default freeing
      
      IOW, NULL ->free_inode gives the same behaviour as now.
      
      Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
      mandate the latter form, but that would have very little benefit beyond
      making rules a bit more symmetric.  It would break backwards compatibility,
      require extra boilerplate and expected semantics for (NULL, NULL) pair
      would have no use whatsoever...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fdb0da89
  12. 26 4月, 2019 1 次提交
    • G
      ext4: Support case-insensitive file name lookups · b886ee3e
      Gabriel Krisman Bertazi 提交于
      This patch implements the actual support for case-insensitive file name
      lookups in ext4, based on the feature bit and the encoding stored in the
      superblock.
      
      A filesystem that has the casefold feature set is able to configure
      directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
      to succeed in that directory in a case-insensitive fashion, i.e: match
      a directory entry even if the name used by userspace is not a byte per
      byte match with the disk name, but is an equivalent case-insensitive
      version of the Unicode string.  This operation is called a
      case-insensitive file name lookup.
      
      The feature is configured as an inode attribute applied to directories
      and inherited by its children.  This attribute can only be enabled on
      empty directories for filesystems that support the encoding feature,
      thus preventing collision of file names that only differ by case.
      
      * dcache handling:
      
      For a +F directory, Ext4 only stores the first equivalent name dentry
      used in the dcache. This is done to prevent unintentional duplication of
      dentries in the dcache, while also allowing the VFS code to quickly find
      the right entry in the cache despite which equivalent string was used in
      a previous lookup, without having to resort to ->lookup().
      
      d_hash() of casefolded directories is implemented as the hash of the
      casefolded string, such that we always have a well-known bucket for all
      the equivalencies of the same string. d_compare() uses the
      utf8_strncasecmp() infrastructure, which handles the comparison of
      equivalent, same case, names as well.
      
      For now, negative lookups are not inserted in the dcache, since they
      would need to be invalidated anyway, because we can't trust missing file
      dentries.  This is bad for performance but requires some leveraging of
      the vfs layer to fix.  We can live without that for now, and so does
      everyone else.
      
      * on-disk data:
      
      Despite using a specific version of the name as the internal
      representation within the dcache, the name stored and fetched from the
      disk is a byte-per-byte match with what the user requested, making this
      implementation 'name-preserving'. i.e. no actual information is lost
      when writing to storage.
      
      DX is supported by modifying the hashes used in +F directories to make
      them case/encoding-aware.  The new disk hashes are calculated as the
      hash of the full casefolded string, instead of the string directly.
      This allows us to efficiently search for file names in the htree without
      requiring the user to provide an exact name.
      
      * Dealing with invalid sequences:
      
      By default, when a invalid UTF-8 sequence is identified, ext4 will treat
      it as an opaque byte sequence, ignoring the encoding and reverting to
      the old behavior for that unique file.  This means that case-insensitive
      file name lookup will not work only for that file.  An optional bit can
      be set in the superblock telling the filesystem code and userspace tools
      to enforce the encoding.  When that optional bit is set, any attempt to
      create a file name using an invalid UTF-8 sequence will fail and return
      an error to userspace.
      
      * Normalization algorithm:
      
      The UTF-8 algorithms used to compare strings in ext4 is implemented
      lives in fs/unicode, and is based on a previous version developed by
      SGI.  It implements the Canonical decomposition (NFD) algorithm
      described by the Unicode specification 12.1, or higher, combined with
      the elimination of ignorable code points (NFDi) and full
      case-folding (CF) as documented in fs/unicode/utf8_norm.c.
      
      NFD seems to be the best normalization method for EXT4 because:
      
        - It has a lower cost than NFC/NFKC (which requires
          decomposing to NFD as an intermediary step)
        - It doesn't eliminate important semantic meaning like
          compatibility decompositions.
      
      Although:
      
        - This implementation is not completely linguistic accurate, because
        different languages have conflicting rules, which would require the
        specialization of the filesystem to a given locale, which brings all
        sorts of problems for removable media and for users who use more than
        one language.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b886ee3e
  13. 25 4月, 2019 1 次提交
    • D
      afs: Add file locking tracepoints · d4696601
      David Howells 提交于
      Add two tracepoints for monitoring AFS file locking.  Firstly, add one that
      follows the operational part:
      
          echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_op/enable
      
      And add a second that more follows the event-driven part:
      
          echo 1 >/sys/kernel/debug/tracing/events/afs/afs_flock_ev/enable
      
      Individual file_lock structs seen by afs are tagged with debugging IDs that
      are displayed in the trace log to make it easier to see what's going on,
      especially as setting the first lock always seems to involve copying the
      file_lock twice.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d4696601
  14. 13 4月, 2019 1 次提交
    • L
      fs: drop unused fput_atomic definition · 1caf7a70
      Lukas Bulwahn 提交于
      commit d7065da0 ("get rid of the magic around f_count in aio") added
      fput_atomic to include/linux/fs.h, motivated by its use in __aio_put_req()
      in fs/aio.c.
      
      Later, commit 3ffa3c0e ("aio: now fput() is OK from interrupt context;
      get rid of manual delayed __fput()") removed the only use of fput_atomic
      in __aio_put_req(), but did not remove the since then unused fput_atomic
      definition in include/linux/fs.h.
      
      We curate this now and finally remove the unused definition.
      
      This issue was identified during a code review due to a coccinelle warning
      from the atomic_as_refcounter.cocci rule pointing to the use of atomic_t
      in fput_atomic.
      Suggested-by: NKrystian Radlak <kradlak@exida.com>
      Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1caf7a70
  15. 07 4月, 2019 1 次提交
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
  16. 21 3月, 2019 1 次提交
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000
  17. 08 3月, 2019 1 次提交
  18. 06 3月, 2019 1 次提交
  19. 05 3月, 2019 1 次提交
    • L
      aio: simplify - and fix - fget/fput for io_submit() · 84c4e1f8
      Linus Torvalds 提交于
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c4e1f8
  20. 28 2月, 2019 5 次提交
    • J
      fs: add fget_many() and fput_many() · 091141a4
      Jens Axboe 提交于
      Some uses cases repeatedly get and put references to the same file, but
      the only exposed interface is doing these one at the time. As each of
      these entail an atomic inc or dec on a shared structure, that cost can
      add up.
      
      Add fget_many(), which works just like fget(), except it takes an
      argument for how many references to get on the file. Ditto fput_many(),
      which can drop an arbitrary number of references to a file.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      091141a4
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
    • D
      vfs: Remove kern_mount_data() · d911b458
      David Howells 提交于
      The kern_mount_data() isn't used any more so remove it.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d911b458
    • A
      convenience helpers: vfs_get_super() and sget_fc() · cb50b348
      Al Viro 提交于
      the former is an analogue of mount_{single,nodev} for use in
      ->get_tree() instances, the latter - analogue of sget() for the
      same.
      
      These are fairly similar to the originals, but the callback signature
      for sget_fc() is different from sget() ones, so getting bits and
      pieces shared would be too convoluted; we might get around to that
      later, but for now let's just remember to keep them in sync.  They
      do live next to each other, and changes in either won't be hard
      to spot.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cb50b348
    • D
      vfs: Implement a filesystem superblock creation/configuration context · 3e1aeb00
      David Howells 提交于
      [AV - unfuck kern_mount_data(); we want non-NULL ->mnt_ns on long-living
      mounts]
      [AV - reordering fs/namespace.c is badly overdue, but let's keep it
      separate from that series]
      [AV - drop simple_pin_fs() change]
      [AV - clean vfs_kern_mount() failure exits up]
      
      Implement a filesystem context concept to be used during superblock
      creation for mount and superblock reconfiguration for remount.
      
      The mounting procedure then becomes:
      
       (1) Allocate new fs_context context.
      
       (2) Configure the context.
      
       (3) Create superblock.
      
       (4) Query the superblock.
      
       (5) Create a mount for the superblock.
      
       (6) Destroy the context.
      
      Rather than calling fs_type->mount(), an fs_context struct is created and
      fs_type->init_fs_context() is called to set it up.  Pointers exist for the
      filesystem and LSM to hang their private data off.
      
      A set of operations has to be set by ->init_fs_context() to provide
      freeing, duplication, option parsing, binary data parsing, validation,
      mounting and superblock filling.
      
      Legacy filesystems are supported by the provision of a set of legacy
      fs_context operations that build up a list of mount options and then invoke
      fs_type->mount() from within the fs_context ->get_tree() operation.  This
      allows all filesystems to be accessed using fs_context.
      
      It should be noted that, whilst this patch adds a lot of lines of code,
      there is quite a bit of duplication with existing code that can be
      eliminated should all filesystems be converted over.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3e1aeb00
  21. 24 2月, 2019 1 次提交
  22. 31 1月, 2019 4 次提交
    • A
      introduce fs_context methods · f3a09c92
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f3a09c92
    • D
      convert do_remount_sb() to fs_context · 8d0347f6
      David Howells 提交于
      Replace do_remount_sb() with a function, reconfigure_super(), that's
      fs_context aware.  The fs_context is expected to be parameterised already
      and have ->root pointing to the superblock to be reconfigured.
      
      A legacy wrapper is provided that is intended to be called from the
      fs_context ops when those appear, but for now is called directly from
      reconfigure_super().  This wrapper invokes the ->remount_fs() superblock op
      for the moment.  It is intended that the remount_fs() op will be phased
      out.
      
      The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
      that the context is being used for reconfiguration.
      
      do_umount_root() is provided to consolidate remount-to-R/O for umount and
      emergency remount by creating a context and invoking reconfiguration.
      
      do_remount(), do_umount() and do_emergency_remount_callback() are switched
      to use the new process.
      
      [AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
      umount / bug, gets rid of pointless complexity]
      [AV -- set ->net_ns in all cases; nfs remount will need that]
      [AV -- shift security_sb_remount() call into reconfigure_super(); the callers
      that didn't do security_sb_remount() have NULL fc->security anyway, so it's
      a no-op for them]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Co-developed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d0347f6
    • A
      teach vfs_get_tree() to handle subtype, switch do_new_mount() to it · a0c9a8b8
      Al Viro 提交于
      Roll the handling of subtypes into do_new_mount() and vfs_get_tree().  The
      former determines any subtype string and hangs it off the fs_context; the
      latter applies it.
      
      Make do_new_mount() create, parameterise and commit an fs_context and
      create a mount for itself rather than calling vfs_kern_mount().
      
      [AV -- missing kstrdup()]
      [AV -- ... and no kstrdup() if we get to setting ->s_submount - we
      simply transfer it from fc, leaving NULL behind]
      [AV -- constify ->s_submount, while we are at it]
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0c9a8b8
    • W
      fs: Don't need to put list_lru into its own cacheline · 7d10f70f
      Waiman Long 提交于
      The list_lru structure is essentially just a pointer to a table of
      per-node LRU lists.  Even if CONFIG_MEMCG_KMEM is defined, the list
      field is just used for LRU list registration and shrinker_id is set at
      initialization.  Those fields won't need to be touched that often.
      
      So there is no point to make the list_lru structures to sit in their own
      cachelines.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d10f70f
  23. 24 1月, 2019 1 次提交
  24. 22 1月, 2019 1 次提交
  25. 29 12月, 2018 1 次提交
  26. 07 12月, 2018 1 次提交