1. 21 1月, 2020 23 次提交
    • P
      io_uring: batch getting pcpu references · 2b85edfc
      Pavel Begunkov 提交于
      percpu_ref_tryget() has its own overhead. Instead getting a reference
      for each request, grab a bunch once per io_submit_sqes().
      
      ~5% throughput boost for a "submit and wait 128 nops" benchmark.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      
      __io_req_free_empty() -> __io_req_do_free()
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b85edfc
    • J
      io_uring: add IORING_OP_MADVISE · c1ca757b
      Jens Axboe 提交于
      This adds support for doing madvise(2) through io_uring. We assume that
      any operation can block, and hence punt everything async. This could be
      improved, but hard to make bullet proof. The async punt ensures it's
      safe.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1ca757b
    • J
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe 提交于
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4840e418
    • J
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe 提交于
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba04291e
    • J
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe 提交于
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a6820f2
    • J
      io_uring: improve poll completion performance · e94f141b
      Jens Axboe 提交于
      For busy IORING_OP_POLL_ADD workloads, we can have enough contention
      on the completion lock that we fail the inline completion path quite
      often as we fail the trylock on that lock. Add a list for deferred
      completions that we can use in that case. This helps reduce the number
      of async offloads we have to do, as if we get multiple completions in
      a row, we'll piggy back on to the poll_llist instead of having to queue
      our own offload.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e94f141b
    • J
      io_uring: split overflow state into SQ and CQ side · ad3eb2c8
      Jens Axboe 提交于
      We currently check ->cq_overflow_list from both SQ and CQ context, which
      causes some bouncing of that cache line. Add separate bits of state for
      this instead, so that the SQ side can check using its own state, and
      likewise for the CQ side.
      
      This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow
      with the CQ state. If we hit an overflow condition, both of these bits
      are set. Likewise for overflow flush clear, we clear both bits. For the
      fast path of just checking if there's an overflow condition on either
      the SQ or CQ side, we can use our own private bit for this.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad3eb2c8
    • J
      io_uring: add lookup table for various opcode needs · d3656344
      Jens Axboe 提交于
      We currently have various switch statements that check if an opcode needs
      a file, mm, etc. These are hard to keep in sync as opcodes are added. Add
      a struct io_op_def that holds all of this information, so we have just
      one spot to update when opcodes are added.
      
      This also enables us to NOT allocate req->io if a deferred command
      doesn't need it, and corrects some mistakes we had in terms of what
      commands need mm context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3656344
    • J
      io_uring: remove two unnecessary function declarations · add7b6b8
      Jens Axboe 提交于
      __io_free_req() and io_double_put_req() aren't used before they are
      defined, so we can kill these two forwards.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      add7b6b8
    • P
      io_uring: move *queue_link_head() from common path · 32fe525b
      Pavel Begunkov 提交于
      Move io_queue_link_head() to links handling code in io_submit_sqe(),
      so it wouldn't need extra checks and would have better data locality.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32fe525b
    • P
      io_uring: rename prev to head · 9d76377f
      Pavel Begunkov 提交于
      Calling "prev" a head of a link is a bit misleading. Rename it
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d76377f
    • J
      io_uring: add IOSQE_ASYNC · ce35a47a
      Jens Axboe 提交于
      io_uring defaults to always doing inline submissions, if at all
      possible. But for larger copies, even if the data is fully cached, that
      can take a long time. Add an IOSQE_ASYNC flag that the application can
      set on the SQE - if set, it'll ensure that we always go async for those
      kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
      get the concurrency we desire for this case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce35a47a
    • J
      io-wq: support concurrent non-blocking work · 895e2ca0
      Jens Axboe 提交于
      io-wq assumes that work will complete fast (and not block), so it
      doesn't create a new worker when work is enqueued, if we already have
      at least one worker running. This is done on the assumption that if work
      is running, then it will complete fast.
      
      Add an option to force io-wq to fork a new worker for work queued. This
      is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
      case, io-wq will create a new worker, even though workers are already
      running.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      895e2ca0
    • J
      io_uring: add support for IORING_OP_STATX · eddc7ef5
      Jens Axboe 提交于
      This provides support for async statx(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eddc7ef5
    • J
      fs: make two stat prep helpers available · 3934e36f
      Jens Axboe 提交于
      To implement an async stat, we need to provide the flags mapping and
      the statx user copy. Make them available internally, through
      fs/internal.h.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3934e36f
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c
      Jens Axboe 提交于
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05f3fb3c
    • J
      io_uring: add support for IORING_OP_CLOSE · b5dba59e
      Jens Axboe 提交于
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dba59e
    • J
      io-wq: add support for uncancellable work · 0c9d5ccd
      Jens Axboe 提交于
      Not all work can be cancelled, some of it we may need to guarantee
      that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
      on work that must not be cancelled. Note that the caller work function
      must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
      IO_WQ_WORK_CANCEL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c9d5ccd
    • J
      fs: move filp_close() outside of __close_fd_get_file() · 6e802a4b
      Jens Axboe 提交于
      Just one caller of this, and just use filp_close() there manually.
      This is important to allow async close/removal of the fd.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e802a4b
    • J
      io_uring: add support for IORING_OP_OPENAT · 15b71abe
      Jens Axboe 提交于
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15b71abe
    • J
      fs: make build_open_flags() available internally · 35cb6d54
      Jens Axboe 提交于
      This is a prep patch for supporting non-blocking open from io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      35cb6d54
    • J
      io_uring: add support for fallocate() · d63d1b5e
      Jens Axboe 提交于
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d63d1b5e
    • E
      io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972
      Eugene Syromiatnikov 提交于
      fds field of struct io_uring_files_update is problematic with regards
      to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
      and 64-bit user space.  In order to avoid custom handling of compat in
      the syscall implementation, make fds __u64 and use u64_to_user_ptr in
      order to retrieve it.  Also, align the field naturally and check that
      no garbage is passed there.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1292e972
  2. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  3. 17 1月, 2020 3 次提交
    • J
      btrfs: check rw_devices, not num_devices for balance · b35cf1f0
      Josef Bacik 提交于
      The fstest btrfs/154 reports
      
        [ 8675.381709] BTRFS: Transaction aborted (error -28)
        [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
        [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
        [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
        [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
        [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
        [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
        [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
        [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
        [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
        [ 8675.413994] FS:  00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
        [ 8675.416146] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
        [ 8675.419801] Call Trace:
        [ 8675.420742]  btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
        [ 8675.422600]  btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
        [ 8675.424335]  reset_balance_state+0x14a/0x190 [btrfs]
        [ 8675.425824]  btrfs_balance.cold+0xe7/0x154 [btrfs]
        [ 8675.427313]  ? kmem_cache_alloc_trace+0x235/0x2c0
        [ 8675.428663]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
        [ 8675.430285]  btrfs_ioctl+0x466/0x2550 [btrfs]
        [ 8675.431788]  ? mem_cgroup_charge_statistics+0x51/0xf0
        [ 8675.433487]  ? mem_cgroup_commit_charge+0x56/0x400
        [ 8675.435122]  ? do_raw_spin_unlock+0x4b/0xc0
        [ 8675.436618]  ? _raw_spin_unlock+0x1f/0x30
        [ 8675.438093]  ? __handle_mm_fault+0x499/0x740
        [ 8675.439619]  ? do_vfs_ioctl+0x56e/0x770
        [ 8675.441034]  do_vfs_ioctl+0x56e/0x770
        [ 8675.442411]  ksys_ioctl+0x3a/0x70
        [ 8675.443718]  ? trace_hardirqs_off_thunk+0x1a/0x1c
        [ 8675.445333]  __x64_sys_ioctl+0x16/0x20
        [ 8675.446705]  do_syscall_64+0x50/0x210
        [ 8675.448059]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
      
      We now use btrfs_can_overcommit() to see if we can flip a block group
      read only.  Before this would fail because we weren't taking into
      account the usable un-allocated space for allocating chunks.  With my
      patches we were allowed to do the balance, which is technically correct.
      
      The test is trying to start balance on degraded mount.  So now we're
      trying to allocate a chunk and cannot because we want to allocate a
      RAID1 chunk, but there's only 1 device that's available for usage.  This
      results in an ENOSPC.
      
      But we shouldn't even be making it this far, we don't have enough
      devices to restripe.  The problem is we're using btrfs_num_devices(),
      that also includes missing devices. That's not actually what we want, we
      need to use rw_devices.
      
      The chunk_mutex is not needed here, rw_devices changes only in device
      add, remove or replace, all are excluded by EXCL_OP mechanism.
      
      Fixes: e4d8ec0f ("Btrfs: implement online profile changing")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add stacktrace, update changelog, drop chunk_mutex ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b35cf1f0
    • F
      Btrfs: always copy scrub arguments back to user space · 5afe6ce7
      Filipe Manana 提交于
      If scrub returns an error we are not copying back the scrub arguments
      structure to user space. This prevents user space to know how much
      progress scrub has done if an error happened - this includes -ECANCELED
      which is returned when users ask for scrub to stop. A particular use
      case, which is used in btrfs-progs, is to resume scrub after it is
      canceled, in that case it relies on checking the progress from the scrub
      arguments structure and then use that progress in a call to resume
      scrub.
      
      So fix this by always copying the scrub arguments structure to user
      space, overwriting the value returned to user space with -EFAULT only if
      copying the structure failed to let user space know that either that
      copying did not happen, and therefore the structure is stale, or it
      happened partially and the structure is probably not valid and corrupt
      due to the partial copy.
      Reported-by: NGraham Cobb <g.btrfs@cobb.uk.net>
      Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
      Fixes: 06fe39ab ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Tested-by: NGraham Cobb <g.btrfs@cobb.uk.net>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5afe6ce7
    • J
      io_uring: only allow submit from owning task · 44d28279
      Jens Axboe 提交于
      If the credentials or the mm doesn't match, don't allow the task to
      submit anything on behalf of this ring. The task that owns the ring can
      pass the file descriptor to another task, but we don't want to allow
      that task to submit an SQE that then assumes the ring mm and creds if
      it needs to go async.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44d28279
  4. 16 1月, 2020 3 次提交
    • M
      fuse: fix fuse_send_readpages() in the syncronous read case · 7df1e988
      Miklos Szeredi 提交于
      Buffered read in fuse normally goes via:
      
       -> generic_file_buffered_read()
         -> fuse_readpages()
           -> fuse_send_readpages()
             ->fuse_simple_request() [called since v5.4]
      
      In the case of a read request, fuse_simple_request() will return a
      non-negative bytecount on success or a negative error value.  A positive
      bytecount was taken to be an error and the PG_error flag set on the page.
      This resulted in generic_file_buffered_read() falling back to ->readpage(),
      which would repeat the read request and succeed.  Because of the repeated
      read succeeding the bug was not detected with regression tests or other use
      cases.
      
      The FTP module in GVFS however fails the second read due to the
      non-seekable nature of FTP downloads.
      
      Fix by checking and ignoring positive return value from
      fuse_simple_request().
      Reported-by: NOndrej Holy <oholy@redhat.com>
      Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441
      Fixes: 134831e3 ("fuse: convert readpages to simple api")
      Cc: <stable@vger.kernel.org> # v5.4
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7df1e988
    • J
      io_uring: ensure workqueue offload grabs ring mutex for poll list · 11ba820b
      Jens Axboe 提交于
      A previous commit moved the locking for the async sqthread, but didn't
      take into account that the io-wq workers still need it. We can't use
      req->in_async for this anymore as both the sqthread and io-wq workers
      set it, gate the need for locking on io_wq_current_is_worker() instead.
      
      Fixes: 8a4955ff ("io_uring: sqthread should grab ctx->uring_lock for submissions")
      Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11ba820b
    • B
      io_uring: clear req->result always before issuing a read/write request · 797f3f53
      Bijan Mottahedeh 提交于
      req->result is cleared when io_issue_sqe() calls io_read/write_pre()
      routines.  Those routines however are not called when the sqe
      argument is NULL, which is the case when io_issue_sqe() is called from
      io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
      a polled request had previously failed with -EAGAIN:
      
              if (ctx->flags & IORING_SETUP_IOPOLL) {
                      if (req->result == -EAGAIN)
                              return -EAGAIN;
      
                      io_iopoll_req_issued(req);
              }
      
      and in turn cause a subsequently completed request to be re-issued in
      io_wq_submit_work().
      Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      797f3f53
  5. 15 1月, 2020 6 次提交
  6. 14 1月, 2020 2 次提交
    • J
      io_uring: don't setup async context for read/write fixed · 74566df3
      Jens Axboe 提交于
      We don't need it, and if we have it, then the retry handler will attempt
      to copy the non-existent iovec with the inline iovec, with a segment
      count that doesn't make sense.
      
      Fixes: f67676d1 ("io_uring: ensure async punted read/write requests copy iovec")
      Reported-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74566df3
    • Q
      btrfs: relocation: fix reloc_root lifespan and access · 6282675e
      Qu Wenruo 提交于
      [BUG]
      There are several different KASAN reports for balance + snapshot
      workloads.  Involved call paths include:
      
         should_ignore_root+0x54/0xb0 [btrfs]
         build_backref_tree+0x11af/0x2280 [btrfs]
         relocate_tree_blocks+0x391/0xb80 [btrfs]
         relocate_block_group+0x3e5/0xa00 [btrfs]
         btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
         btrfs_relocate_chunk+0x53/0xf0 [btrfs]
         btrfs_balance+0xc91/0x1840 [btrfs]
         btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
         btrfs_ioctl+0x8af/0x3e60 [btrfs]
         do_vfs_ioctl+0x831/0xb10
      
         create_reloc_root+0x9f/0x460 [btrfs]
         btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
         create_pending_snapshot+0xa9b/0x15f0 [btrfs]
         create_pending_snapshots+0x111/0x140 [btrfs]
         btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
         btrfs_mksubvol+0x915/0x960 [btrfs]
         btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
         btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
         btrfs_ioctl+0x241b/0x3e60 [btrfs]
         do_vfs_ioctl+0x831/0xb10
      
         btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
         create_pending_snapshot+0x209/0x15f0 [btrfs]
         create_pending_snapshots+0x111/0x140 [btrfs]
         btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
         btrfs_mksubvol+0x915/0x960 [btrfs]
         btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
         btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
         btrfs_ioctl+0x241b/0x3e60 [btrfs]
         do_vfs_ioctl+0x831/0xb10
      
      [CAUSE]
      All these call sites are only relying on root->reloc_root, which can
      undergo btrfs_drop_snapshot(), and since we don't have real refcount
      based protection to reloc roots, we can reach already dropped reloc
      root, triggering KASAN.
      
      [FIX]
      To avoid such access to unstable root->reloc_root, we should check
      BTRFS_ROOT_DEAD_RELOC_TREE bit first.
      
      This patch introduces wrappers that provide the correct way to check the
      bit with memory barriers protection.
      
      Most callers don't distinguish merged reloc tree and no reloc tree.  The
      only exception is should_ignore_root(), as merged reloc tree can be
      ignored, while no reloc tree shouldn't.
      
      [CRITICAL SECTION ANALYSIS]
      Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
      DEAD_RELOC_TREE bit has extra help from transaction as a higher level
      barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
      
      	NULL: reloc_root is NULL	PTR: reloc_root is not NULL
      	0: DEAD_RELOC_ROOT bit not set	DEAD: DEAD_RELOC_ROOT bit set
      
      	(NULL, 0)    Initial state		 __
      	  |					 /\ Section A
              btrfs_init_reloc_root()			 \/
      	  |				 	 __
      	(PTR, 0)     reloc_root initialized      /\
                |					 |
      	btrfs_update_reloc_root()		 |  Section B
                |					 |
      	(PTR, DEAD)  reloc_root has been merged  \/
                |					 __
      	=== btrfs_commit_transaction() ====================
      	  |					 /\
      	clean_dirty_subvols()			 |
      	  |					 |  Section C
      	(NULL, DEAD) reloc_root cleanup starts   \/
                |					 __
      	btrfs_drop_snapshot()			 /\
      	  |					 |  Section D
      	(NULL, 0)    Back to initial state	 \/
      
      Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
      transaction handle, so none of such caller can cross transaction boundary.
      
      In Section A, every caller just found no DEAD bit, and grab reloc_root.
      
      In the cross section A-B, caller may get no DEAD bit, but since reloc_root
      is still completely valid thus accessing reloc_root is completely safe.
      
      No test_bit() caller can cross the boundary of Section B and Section C.
      
      In Section C, every caller found the DEAD bit, so no one will access
      reloc_root.
      
      In the cross section C-D, either caller gets the DEAD bit set, avoiding
      access reloc_root no matter if it's safe or not.  Or caller get the DEAD
      bit cleared, then access reloc_root, which is already NULL, nothing will
      be wrong.
      
      The memory write barriers are between the reloc_root updates and bit
      set/clear, the pairing read side is before test_bit.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ barriers ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6282675e
  7. 09 1月, 2020 2 次提交
    • M
      fs: move guard_bio_eod() after bio_set_op_attrs · 83c9c547
      Ming Lei 提交于
      Commit 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      adds bio_truncate() for handling bio EOD. However, bio_truncate()
      doesn't use the passed 'op' parameter from guard_bio_eod's callers.
      
      So bio_trunacate() may retrieve wrong 'op', and zering pages may
      not be done for READ bio.
      
      Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
      in submit_bh_wbc() so that bio_truncate() can always retrieve correct
      op info.
      
      Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
      used any more.
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixes: 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Fold in kerneldoc and bio_op() change.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      83c9c547
    • K
      pstore/ram: Regularize prz label allocation lifetime · e163fdb3
      Kees Cook 提交于
      In my attempt to fix a memory leak, I introduced a double-free in the
      pstore error path. Instead of trying to manage the allocation lifetime
      between persistent_ram_new() and its callers, adjust the logic so
      persistent_ram_new() always takes a kstrdup() copy, and leaves the
      caller's allocation lifetime up to the caller. Therefore callers are
      _always_ responsible for freeing their label. Before, it only needed
      freeing when the prz itself failed to allocate, and not in any of the
      other prz failure cases, which callers would have no visibility into,
      which is the root design problem that lead to both the leak and now
      double-free bugs.
      Reported-by: NCengiz Can <cengiz@kernel.wtf>
      Link: https://lore.kernel.org/lkml/d4ec59002ede4aaf9928c7f7526da87c@kernel.wtf
      Fixes: 8df955a3 ("pstore/ram: Fix error-path memory leak in persistent_ram_new() callers")
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      e163fdb3