1. 21 1月, 2020 32 次提交
  2. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  3. 17 1月, 2020 3 次提交
    • J
      btrfs: check rw_devices, not num_devices for balance · b35cf1f0
      Josef Bacik 提交于
      The fstest btrfs/154 reports
      
        [ 8675.381709] BTRFS: Transaction aborted (error -28)
        [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
        [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
        [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
        [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
        [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
        [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
        [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
        [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
        [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
        [ 8675.413994] FS:  00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
        [ 8675.416146] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
        [ 8675.419801] Call Trace:
        [ 8675.420742]  btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
        [ 8675.422600]  btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
        [ 8675.424335]  reset_balance_state+0x14a/0x190 [btrfs]
        [ 8675.425824]  btrfs_balance.cold+0xe7/0x154 [btrfs]
        [ 8675.427313]  ? kmem_cache_alloc_trace+0x235/0x2c0
        [ 8675.428663]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
        [ 8675.430285]  btrfs_ioctl+0x466/0x2550 [btrfs]
        [ 8675.431788]  ? mem_cgroup_charge_statistics+0x51/0xf0
        [ 8675.433487]  ? mem_cgroup_commit_charge+0x56/0x400
        [ 8675.435122]  ? do_raw_spin_unlock+0x4b/0xc0
        [ 8675.436618]  ? _raw_spin_unlock+0x1f/0x30
        [ 8675.438093]  ? __handle_mm_fault+0x499/0x740
        [ 8675.439619]  ? do_vfs_ioctl+0x56e/0x770
        [ 8675.441034]  do_vfs_ioctl+0x56e/0x770
        [ 8675.442411]  ksys_ioctl+0x3a/0x70
        [ 8675.443718]  ? trace_hardirqs_off_thunk+0x1a/0x1c
        [ 8675.445333]  __x64_sys_ioctl+0x16/0x20
        [ 8675.446705]  do_syscall_64+0x50/0x210
        [ 8675.448059]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
      
      We now use btrfs_can_overcommit() to see if we can flip a block group
      read only.  Before this would fail because we weren't taking into
      account the usable un-allocated space for allocating chunks.  With my
      patches we were allowed to do the balance, which is technically correct.
      
      The test is trying to start balance on degraded mount.  So now we're
      trying to allocate a chunk and cannot because we want to allocate a
      RAID1 chunk, but there's only 1 device that's available for usage.  This
      results in an ENOSPC.
      
      But we shouldn't even be making it this far, we don't have enough
      devices to restripe.  The problem is we're using btrfs_num_devices(),
      that also includes missing devices. That's not actually what we want, we
      need to use rw_devices.
      
      The chunk_mutex is not needed here, rw_devices changes only in device
      add, remove or replace, all are excluded by EXCL_OP mechanism.
      
      Fixes: e4d8ec0f ("Btrfs: implement online profile changing")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add stacktrace, update changelog, drop chunk_mutex ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b35cf1f0
    • F
      Btrfs: always copy scrub arguments back to user space · 5afe6ce7
      Filipe Manana 提交于
      If scrub returns an error we are not copying back the scrub arguments
      structure to user space. This prevents user space to know how much
      progress scrub has done if an error happened - this includes -ECANCELED
      which is returned when users ask for scrub to stop. A particular use
      case, which is used in btrfs-progs, is to resume scrub after it is
      canceled, in that case it relies on checking the progress from the scrub
      arguments structure and then use that progress in a call to resume
      scrub.
      
      So fix this by always copying the scrub arguments structure to user
      space, overwriting the value returned to user space with -EFAULT only if
      copying the structure failed to let user space know that either that
      copying did not happen, and therefore the structure is stale, or it
      happened partially and the structure is probably not valid and corrupt
      due to the partial copy.
      Reported-by: NGraham Cobb <g.btrfs@cobb.uk.net>
      Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
      Fixes: 06fe39ab ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Tested-by: NGraham Cobb <g.btrfs@cobb.uk.net>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5afe6ce7
    • J
      io_uring: only allow submit from owning task · 44d28279
      Jens Axboe 提交于
      If the credentials or the mm doesn't match, don't allow the task to
      submit anything on behalf of this ring. The task that owns the ring can
      pass the file descriptor to another task, but we don't want to allow
      that task to submit an SQE that then assumes the ring mm and creds if
      it needs to go async.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44d28279
  4. 16 1月, 2020 3 次提交
    • M
      fuse: fix fuse_send_readpages() in the syncronous read case · 7df1e988
      Miklos Szeredi 提交于
      Buffered read in fuse normally goes via:
      
       -> generic_file_buffered_read()
         -> fuse_readpages()
           -> fuse_send_readpages()
             ->fuse_simple_request() [called since v5.4]
      
      In the case of a read request, fuse_simple_request() will return a
      non-negative bytecount on success or a negative error value.  A positive
      bytecount was taken to be an error and the PG_error flag set on the page.
      This resulted in generic_file_buffered_read() falling back to ->readpage(),
      which would repeat the read request and succeed.  Because of the repeated
      read succeeding the bug was not detected with regression tests or other use
      cases.
      
      The FTP module in GVFS however fails the second read due to the
      non-seekable nature of FTP downloads.
      
      Fix by checking and ignoring positive return value from
      fuse_simple_request().
      Reported-by: NOndrej Holy <oholy@redhat.com>
      Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441
      Fixes: 134831e3 ("fuse: convert readpages to simple api")
      Cc: <stable@vger.kernel.org> # v5.4
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7df1e988
    • J
      io_uring: ensure workqueue offload grabs ring mutex for poll list · 11ba820b
      Jens Axboe 提交于
      A previous commit moved the locking for the async sqthread, but didn't
      take into account that the io-wq workers still need it. We can't use
      req->in_async for this anymore as both the sqthread and io-wq workers
      set it, gate the need for locking on io_wq_current_is_worker() instead.
      
      Fixes: 8a4955ff ("io_uring: sqthread should grab ctx->uring_lock for submissions")
      Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11ba820b
    • B
      io_uring: clear req->result always before issuing a read/write request · 797f3f53
      Bijan Mottahedeh 提交于
      req->result is cleared when io_issue_sqe() calls io_read/write_pre()
      routines.  Those routines however are not called when the sqe
      argument is NULL, which is the case when io_issue_sqe() is called from
      io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
      a polled request had previously failed with -EAGAIN:
      
              if (ctx->flags & IORING_SETUP_IOPOLL) {
                      if (req->result == -EAGAIN)
                              return -EAGAIN;
      
                      io_iopoll_req_issued(req);
              }
      
      and in turn cause a subsequently completed request to be re-issued in
      io_wq_submit_work().
      Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      797f3f53
  5. 15 1月, 2020 1 次提交