1. 03 5月, 2019 1 次提交
  2. 02 5月, 2019 1 次提交
    • A
      new inode method: ->free_inode() · fdb0da89
      Al Viro 提交于
      A lot of ->destroy_inode() instances end with call_rcu() of a callback
      that does RCU-delayed part of freeing.  Introduce a new method for
      doing just that, with saner signature.
      
      Rules:
      ->destroy_inode		->free_inode
      	f			g		immediate call of f(),
      						RCU-delayed call of g()
      	f			NULL		immediate call of f(),
      						no RCU-delayed calls
      	NULL			g		RCU-delayed call of g()
      	NULL			NULL		RCU-delayed default freeing
      
      IOW, NULL ->free_inode gives the same behaviour as now.
      
      Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
      mandate the latter form, but that would have very little benefit beyond
      making rules a bit more symmetric.  It would break backwards compatibility,
      require extra boilerplate and expected semantics for (NULL, NULL) pair
      would have no use whatsoever...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fdb0da89
  3. 07 4月, 2019 1 次提交
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
  4. 21 3月, 2019 1 次提交
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000
  5. 08 3月, 2019 1 次提交
  6. 06 3月, 2019 1 次提交
  7. 05 3月, 2019 1 次提交
    • L
      aio: simplify - and fix - fget/fput for io_submit() · 84c4e1f8
      Linus Torvalds 提交于
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c4e1f8
  8. 28 2月, 2019 5 次提交
    • J
      fs: add fget_many() and fput_many() · 091141a4
      Jens Axboe 提交于
      Some uses cases repeatedly get and put references to the same file, but
      the only exposed interface is doing these one at the time. As each of
      these entail an atomic inc or dec on a shared structure, that cost can
      add up.
      
      Add fget_many(), which works just like fget(), except it takes an
      argument for how many references to get on the file. Ditto fput_many(),
      which can drop an arbitrary number of references to a file.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      091141a4
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
    • D
      vfs: Remove kern_mount_data() · d911b458
      David Howells 提交于
      The kern_mount_data() isn't used any more so remove it.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d911b458
    • A
      convenience helpers: vfs_get_super() and sget_fc() · cb50b348
      Al Viro 提交于
      the former is an analogue of mount_{single,nodev} for use in
      ->get_tree() instances, the latter - analogue of sget() for the
      same.
      
      These are fairly similar to the originals, but the callback signature
      for sget_fc() is different from sget() ones, so getting bits and
      pieces shared would be too convoluted; we might get around to that
      later, but for now let's just remember to keep them in sync.  They
      do live next to each other, and changes in either won't be hard
      to spot.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cb50b348
    • D
      vfs: Implement a filesystem superblock creation/configuration context · 3e1aeb00
      David Howells 提交于
      [AV - unfuck kern_mount_data(); we want non-NULL ->mnt_ns on long-living
      mounts]
      [AV - reordering fs/namespace.c is badly overdue, but let's keep it
      separate from that series]
      [AV - drop simple_pin_fs() change]
      [AV - clean vfs_kern_mount() failure exits up]
      
      Implement a filesystem context concept to be used during superblock
      creation for mount and superblock reconfiguration for remount.
      
      The mounting procedure then becomes:
      
       (1) Allocate new fs_context context.
      
       (2) Configure the context.
      
       (3) Create superblock.
      
       (4) Query the superblock.
      
       (5) Create a mount for the superblock.
      
       (6) Destroy the context.
      
      Rather than calling fs_type->mount(), an fs_context struct is created and
      fs_type->init_fs_context() is called to set it up.  Pointers exist for the
      filesystem and LSM to hang their private data off.
      
      A set of operations has to be set by ->init_fs_context() to provide
      freeing, duplication, option parsing, binary data parsing, validation,
      mounting and superblock filling.
      
      Legacy filesystems are supported by the provision of a set of legacy
      fs_context operations that build up a list of mount options and then invoke
      fs_type->mount() from within the fs_context ->get_tree() operation.  This
      allows all filesystems to be accessed using fs_context.
      
      It should be noted that, whilst this patch adds a lot of lines of code,
      there is quite a bit of duplication with existing code that can be
      eliminated should all filesystems be converted over.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3e1aeb00
  9. 24 2月, 2019 1 次提交
  10. 31 1月, 2019 4 次提交
    • A
      introduce fs_context methods · f3a09c92
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f3a09c92
    • D
      convert do_remount_sb() to fs_context · 8d0347f6
      David Howells 提交于
      Replace do_remount_sb() with a function, reconfigure_super(), that's
      fs_context aware.  The fs_context is expected to be parameterised already
      and have ->root pointing to the superblock to be reconfigured.
      
      A legacy wrapper is provided that is intended to be called from the
      fs_context ops when those appear, but for now is called directly from
      reconfigure_super().  This wrapper invokes the ->remount_fs() superblock op
      for the moment.  It is intended that the remount_fs() op will be phased
      out.
      
      The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
      that the context is being used for reconfiguration.
      
      do_umount_root() is provided to consolidate remount-to-R/O for umount and
      emergency remount by creating a context and invoking reconfiguration.
      
      do_remount(), do_umount() and do_emergency_remount_callback() are switched
      to use the new process.
      
      [AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
      umount / bug, gets rid of pointless complexity]
      [AV -- set ->net_ns in all cases; nfs remount will need that]
      [AV -- shift security_sb_remount() call into reconfigure_super(); the callers
      that didn't do security_sb_remount() have NULL fc->security anyway, so it's
      a no-op for them]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Co-developed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d0347f6
    • A
      teach vfs_get_tree() to handle subtype, switch do_new_mount() to it · a0c9a8b8
      Al Viro 提交于
      Roll the handling of subtypes into do_new_mount() and vfs_get_tree().  The
      former determines any subtype string and hangs it off the fs_context; the
      latter applies it.
      
      Make do_new_mount() create, parameterise and commit an fs_context and
      create a mount for itself rather than calling vfs_kern_mount().
      
      [AV -- missing kstrdup()]
      [AV -- ... and no kstrdup() if we get to setting ->s_submount - we
      simply transfer it from fc, leaving NULL behind]
      [AV -- constify ->s_submount, while we are at it]
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0c9a8b8
    • W
      fs: Don't need to put list_lru into its own cacheline · 7d10f70f
      Waiman Long 提交于
      The list_lru structure is essentially just a pointer to a table of
      per-node LRU lists.  Even if CONFIG_MEMCG_KMEM is defined, the list
      field is just used for LRU list registration and shrinker_id is set at
      initialization.  Those fields won't need to be touched that often.
      
      So there is no point to make the list_lru structures to sit in their own
      cachelines.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d10f70f
  11. 24 1月, 2019 1 次提交
  12. 22 1月, 2019 1 次提交
  13. 29 12月, 2018 1 次提交
  14. 07 12月, 2018 1 次提交
  15. 01 12月, 2018 1 次提交
    • N
      fs/locks: rename some lists and pointers. · ada5c1da
      NeilBrown 提交于
      struct file lock contains an 'fl_next' pointer which
      is used to point to the lock that this request is blocked
      waiting for.  So rename it to fl_blocker.
      
      The fl_blocked list_head in an active lock is the head of a list of
      blocked requests.  In a request it is a node in that list.
      These are two distinct uses, so replace with two list_heads
      with different names.
      fl_blocked_requests is the head of a list of blocked requests
      fl_blocked_member is a node in a member of that list.
      
      The two different list_heads are never used at the same time, but that
      will change in a future patch.
      
      Note that a tracepoint is changed to report fl_blocker instead
      of fl_next.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      ada5c1da
  16. 20 11月, 2018 1 次提交
    • D
      block: Initialize BIO I/O priority early · 20578bdf
      Damien Le Moal 提交于
      For the synchronous I/O path case (read(), write() etc system calls), a
      BIO I/O priority is not initialized until the execution of
      blk_init_request_from_bio() when the BIO is submitted and a request
      initialized for the BIO execution. This is due to the ki_ioprio field of
      the struct kiocb defined on stack being always initialized to
      IOPRIO_CLASS_NONE, regardless of the calling process I/O context ioprio
      value set with ioprio_set(). This late initialization can result in the
      BIO being merged to pending requests even when the I/O priorities
      differ.
      
      Fix this by initializing the ki_iopriority field of on stack struct
      kiocb using the get_current_ioprio() helper, ensuring that all BIOs
      allocated and submitted for the system call execution see the correct
      intended I/O priority early. With this, since a BIO I/O priority is
      always set to the intended effective value for both the sync and async
      path, blk_init_request_from_bio() can be simplified.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAdam Manzanares <adam.manzanares@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      20578bdf
  17. 30 10月, 2018 10 次提交
  18. 25 10月, 2018 1 次提交
    • J
      fsnotify: Fix busy inodes during unmount · 721fb6fb
      Jan Kara 提交于
      Detaching of mark connector from fsnotify_put_mark() can race with
      unmounting of the filesystem like:
      
        CPU1				CPU2
      fsnotify_put_mark()
        spin_lock(&conn->lock);
        ...
        inode = fsnotify_detach_connector_from_object(conn)
        spin_unlock(&conn->lock);
      				generic_shutdown_super()
      				  fsnotify_unmount_inodes()
      				    sees connector detached for inode
      				      -> nothing to do
      				  evict_inode()
      				    barfs on pending inode reference
        iput(inode);
      
      Resulting in "Busy inodes after unmount" message and possible kernel
      oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
      references from detached connectors.
      
      Note that the accounting of outstanding inode references in the
      superblock can cause some cacheline contention on the counter. OTOH it
      happens only during deletion of the last notification mark from an inode
      (or during unlinking of watched inode) and that is not too bad. I have
      measured time to create & delete inotify watch 100000 times from 64
      processes in parallel (each process having its own inotify group and its
      own file on a shared superblock) on a 64 CPU machine. Average and
      standard deviation of 15 runs look like:
      
      	Avg		Stddev
      Vanilla	9.817400	0.276165
      Fixed	9.710467	0.228294
      
      So there's no statistically significant difference.
      
      Fixes: 6b3f05d2 ("fsnotify: Detach mark from object list when last reference is dropped")
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      721fb6fb
  19. 21 10月, 2018 2 次提交
  20. 19 10月, 2018 1 次提交
    • A
      fs: group frequently accessed fields of struct super_block together · 99c228a9
      Amir Goldstein 提交于
      Kernel test robot reported [1] a 6% performance regression in a
      concurrent unlink(2) workload on commit 60f7ed8c ("fsnotify: send
      path type events to group with super block marks").
      
      The performance test was run with no fsnotify marks at all on the
      data set, so the only extra instructions added by the offending
      commit are tests of the super_block fields s_fsnotify_{marks,mask}
      and these tests happen on almost every single inode access.
      
      When adding those fields to the super_block struct, we did not give much
      thought of placing them on a hot cache lines (we just placed them at the
      end of the struct).
      
      Re-organize struct super_block to try and keep some frequently accessed
      fields on the same cache line.
      
      Move the frequently accessed fields s_fsnotify_{marks,mask} near the
      frequently accessed fields s_fs_info,s_time_gran, while filling a 64bit
      alignment hole after s_time_gran.
      
      Move the seldom accessed fields s_id,s_uuid,s_max_links,s_mode near the
      seldom accessed fields s_vfs_rename_mutex,s_subtype.
      
      Rong Chen confirmed that this patch solved the reported problem.
      
      [1] https://lkml.org/lkml/2018/9/30/206Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Tested-by: Nkernel test robot <rong.a.chen@intel.com>
      Fixes: 1e6cb723 ("fsnotify: add super block object type")
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      99c228a9
  21. 24 9月, 2018 1 次提交
    • A
      vfs: swap names of {do,vfs}_clone_file_range() · a725356b
      Amir Goldstein 提交于
      Commit 031a072a ("vfs: call vfs_clone_file_range() under freeze
      protection") created a wrapper do_clone_file_range() around
      vfs_clone_file_range() moving the freeze protection to former, so
      overlayfs could call the latter.
      
      The more common vfs practice is to call do_xxx helpers from vfs_xxx
      helpers, where freeze protecction is taken in the vfs_xxx helper, so
      this anomality could be a source of confusion.
      
      It seems that commit 8ede2055 ("ovl: add reflink/copyfile/dedup
      support") may have fallen a victim to this confusion -
      ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
      hope of getting freeze protection on upper fs, but in fact results in
      overlayfs allowing to bypass upper fs freeze protection.
      
      Swap the names of the two helpers to conform to common vfs practice
      and call the correct helpers from overlayfs and nfsd.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a725356b
  22. 03 9月, 2018 1 次提交
  23. 30 8月, 2018 1 次提交