1. 24 1月, 2021 4 次提交
  2. 11 12月, 2020 1 次提交
  3. 03 12月, 2020 1 次提交
  4. 13 8月, 2020 1 次提交
    • K
      exec: move S_ISREG() check earlier · 633fb6ac
      Kees Cook 提交于
      The execve(2)/uselib(2) syscalls have always rejected non-regular files.
      Recently, it was noticed that a deadlock was introduced when trying to
      execute pipes, as the S_ISREG() test was happening too late.  This was
      fixed in commit 73601ea5 ("fs/open.c: allow opening only regular files
      during execve()"), but it was added after inode_permission() had already
      run, which meant LSMs could see bogus attempts to execute non-regular
      files.
      
      Move the test into the other inode type checks (which already look for
      other pathological conditions[1]).  Since there is no need to use
      FMODE_EXEC while we still have access to "acc_mode", also switch the test
      to MAY_EXEC.
      
      Also include a comment with the redundant S_ISREG() checks at the end of
      execve(2)/uselib(2) to note that they are present to avoid any mistakes.
      
      My notes on the call path, and related arguments, checks, etc:
      
      do_open_execat()
          struct open_flags open_exec_flags = {
              .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
              .acc_mode = MAY_EXEC,
              ...
          do_filp_open(dfd, filename, open_flags)
              path_openat(nameidata, open_flags, flags)
                  file = alloc_empty_file(open_flags, current_cred());
                  do_open(nameidata, file, open_flags)
                      may_open(path, acc_mode, open_flag)
      		    /* new location of MAY_EXEC vs S_ISREG() test */
                          inode_permission(inode, MAY_OPEN | acc_mode)
                              security_inode_permission(inode, acc_mode)
                      vfs_open(path, file)
                          do_dentry_open(file, path->dentry->d_inode, open)
                              /* old location of FMODE_EXEC vs S_ISREG() test */
                              security_file_open(f)
                              open()
      
      [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      633fb6ac
  5. 31 7月, 2020 7 次提交
  6. 16 7月, 2020 2 次提交
  7. 17 6月, 2020 2 次提交
    • C
      close_range: add CLOSE_RANGE_UNSHARE · 60997c3d
      Christian Brauner 提交于
      One of the use-cases of close_range() is to drop file descriptors just before
      execve(). This would usually be expressed in the sequence:
      
      unshare(CLONE_FILES);
      close_range(3, ~0U);
      
      as pointed out by Linus it might be desirable to have this be a part of
      close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
      
      This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
      maximum number of file descriptors to copy from the old struct files. When the
      user requests that all file descriptors are supposed to be closed via
      close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
      to do any of the heavy fput() work for everything above min.
      
      The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
      fact currently share our file descriptor table we create a new private copy.
      We then close all fds in the requested range and finally after we're done we
      install the new fd table.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      60997c3d
    • C
      open: add close_range() · 278a5fba
      Christian Brauner 提交于
      This adds the close_range() syscall. It allows to efficiently close a range
      of file descriptors up to all file descriptors of a calling task.
      
      I was contacted by FreeBSD as they wanted to have the same close_range()
      syscall as we proposed here. We've coordinated this and in the meantime, Kyle
      was fast enough to merge close_range() into FreeBSD already in April:
      https://reviews.freebsd.org/D21627
      https://svnweb.freebsd.org/base?view=revision&revision=359836
      and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
      once its merged in Linux too. Python is in the process of switching to
      close_range() on FreeBSD and they are waiting on us to merge this to switch on
      Linux as well: https://bugs.python.org/issue38061
      
      The syscall came up in a recent discussion around the new mount API and
      making new file descriptor types cloexec by default. During this
      discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
      syscall in this manner has been requested by various people over time.
      
      First, it helps to close all file descriptors of an exec()ing task. This
      can be done safely via (quoting Al's example from [1] verbatim):
      
              /* that exec is sensitive */
              unshare(CLONE_FILES);
              /* we don't want anything past stderr here */
              close_range(3, ~0U);
              execve(....);
      
      The code snippet above is one way of working around the problem that file
      descriptors are not cloexec by default. This is aggravated by the fact that
      we can't just switch them over without massively regressing userspace. For
      a whole class of programs having an in-kernel method of closing all file
      descriptors is very helpful (e.g. demons, service managers, programming
      language standard libraries, container managers etc.).
      (Please note, unshare(CLONE_FILES) should only be needed if the calling
      task is multi-threaded and shares the file descriptor table with another
      thread in which case two threads could race with one thread allocating file
      descriptors and the other one closing them via close_range(). For the
      general case close_range() before the execve() is sufficient.)
      
      Second, it allows userspace to avoid implementing closing all file
      descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
      file descriptor. From looking at various large(ish) userspace code bases
      this or similar patterns are very common in:
      - service managers (cf. [4])
      - libcs (cf. [6])
      - container runtimes (cf. [5])
      - programming language runtimes/standard libraries
        - Python (cf. [2])
        - Rust (cf. [7], [8])
      As Dmitry pointed out there's even a long-standing glibc bug about missing
      kernel support for this task (cf. [3]).
      In addition, the syscall will also work for tasks that do not have procfs
      mounted and on kernels that do not have procfs support compiled in. In such
      situations the only way to make sure that all file descriptors are closed
      is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
      OPEN_MAX trickery (cf. comment [8] on Rust).
      
      The performance is striking. For good measure, comparing the following
      simple close_all_fds() userspace implementation that is essentially just
      glibc's version in [6]:
      
      static int close_all_fds(void)
      {
              int dir_fd;
              DIR *dir;
              struct dirent *direntp;
      
              dir = opendir("/proc/self/fd");
              if (!dir)
                      return -1;
              dir_fd = dirfd(dir);
              while ((direntp = readdir(dir))) {
                      int fd;
                      if (strcmp(direntp->d_name, ".") == 0)
                              continue;
                      if (strcmp(direntp->d_name, "..") == 0)
                              continue;
                      fd = atoi(direntp->d_name);
                      if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                              continue;
                      close(fd);
              }
              closedir(dir);
              return 0;
      }
      
      to close_range() yields:
      1. closing 4 open files:
         - close_all_fds(): ~280 us
         - close_range():    ~24 us
      
      2. closing 1000 open files:
         - close_all_fds(): ~5000 us
         - close_range():   ~800 us
      
      close_range() is designed to allow for some flexibility. Specifically, it
      does not simply always close all open file descriptors of a task. Instead,
      callers can specify an upper bound.
      This is e.g. useful for scenarios where specific file descriptors are
      created with well-known numbers that are supposed to be excluded from
      getting closed.
      For extra paranoia close_range() comes with a flags argument. This can e.g.
      be used to implement extension. Once can imagine userspace wanting to stop
      at the first error instead of ignoring errors under certain circumstances.
      There might be other valid ideas in the future. In any case, a flag
      argument doesn't hurt and keeps us on the safe side.
      
      From an implementation side this is kept rather dumb. It saw some input
      from David and Jann but all nonsense is obviously my own!
      - Errors to close file descriptors are currently ignored. (Could be changed
        by setting a flag in the future if needed.)
      - __close_range() is a rather simplistic wrapper around __close_fd().
        My reasoning behind this is based on the nature of how __close_fd() needs
        to release an fd. But maybe I misunderstood specifics:
        We take the files_lock and rcu-dereference the fdtable of the calling
        task, we find the entry in the fdtable, get the file and need to release
        files_lock before calling filp_close().
        In the meantime the fdtable might have been altered so we can't just
        retake the spinlock and keep the old rcu-reference of the fdtable
        around. Instead we need to grab a fresh reference to the fdtable.
        If my reasoning is correct then there's really no point in fancyfying
        __close_range(): We just need to rcu-dereference the fdtable of the
        calling task once to cap the max_fd value correctly and then go on
        calling __close_fd() in a loop.
      
      /* References */
      [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
      [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
      [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
      [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
      [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
      [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
           Note that this is an internal implementation that is not exported.
           Currently, libc seems to not provide an exported version of this
           because of missing kernel support to do this.
           Note, in a recent patch series Florian made grantpt() a nop thereby
           removing the code referenced here.
      [7]: https://github.com/rust-lang/rust/issues/12148
      [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
           Rust's solution is slightly different but is equally unperformant.
           Rust calls getdtablesize() which is a glibc library function that
           simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
           goes on to call close() on each fd. That's obviously overkill for most
           tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
           OPEN_MAX.
           Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
           to 1024. Even in this case, there's a very high chance that in the
           common case Rust is calling the close() syscall 1021 times pointlessly
           if the task just has 0, 1, and 2 open.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kyle Evans <self@kyle-evans.net>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      278a5fba
  8. 03 6月, 2020 1 次提交
    • J
      vfs: track per-sb writeback errors and report them to syncfs · 735e4ae5
      Jeff Layton 提交于
      Patch series "vfs: have syncfs() return error when there are writeback
      errors", v6.
      
      Currently, syncfs does not return errors when one of the inodes fails to
      be written back.  It will return errors based on the legacy AS_EIO and
      AS_ENOSPC flags when syncing out the block device fails, but that's not
      particularly helpful for filesystems that aren't backed by a blockdev.
      It's also possible for a stray sync to lose those errors.
      
      The basic idea in this set is to track writeback errors at the
      superblock level, so that we can quickly and easily check whether
      something bad happened without having to fsync each file individually.
      syncfs is then changed to reliably report writeback errors after they
      occur, much in the same fashion as fsync does now.
      
      This patch (of 2):
      
      Usually we suggest that applications call fsync when they want to ensure
      that all data written to the file has made it to the backing store, but
      that can be inefficient when there are a lot of open files.
      
      Calling syncfs on the filesystem can be more efficient in some
      situations, but the error reporting doesn't currently work the way most
      people expect.  If a single inode on a filesystem reports a writeback
      error, syncfs won't necessarily return an error.  syncfs only returns an
      error if __sync_blockdev fails, and on some filesystems that's a no-op.
      
      It would be better if syncfs reported an error if there were any
      writeback failures.  Then applications could call syncfs to see if there
      are any errors on any open files, and could then call fsync on all of
      the other descriptors to figure out which one failed.
      
      This patch adds a new errseq_t to struct super_block, and has
      mapping_set_error also record writeback errors there.
      
      To report those errors, we also need to keep an errseq_t in struct file
      to act as a cursor.  This patch adds a dedicated field for that purpose,
      which slots nicely into 4 bytes of padding at the end of struct file on
      x86_64.
      
      An earlier version of this patch used an O_PATH file descriptor to cue
      the kernel that the open file should track the superblock error and not
      the inode's writeback error.
      
      I think that API is just too weird though.  This is simpler and should
      make syncfs error reporting "just work" even if someone is multiplexing
      fsync and syncfs on the same fds.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andres Freund <andres@anarazel.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
      Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      735e4ae5
  9. 14 5月, 2020 2 次提交
    • M
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi 提交于
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
    • M
      vfs: split out access_override_creds() · 94704515
      Miklos Szeredi 提交于
      Split out a helper that overrides the credentials in preparation for
      actually doing the access check.
      
      This prepares for the next patch that optionally disables the creds
      override.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      94704515
  10. 13 3月, 2020 1 次提交
    • A
      cifs_atomic_open(): fix double-put on late allocation failure · d9a9f484
      Al Viro 提交于
      several iterations of ->atomic_open() calling conventions ago, we
      used to need fput() if ->atomic_open() failed at some point after
      successful finish_open().  Now (since 2016) it's not needed -
      struct file carries enough state to make fput() work regardless
      of the point in struct file lifecycle and discarding it on
      failure exits in open() got unified.  Unfortunately, I'd missed
      the fact that we had an instance of ->atomic_open() (cifs one)
      that used to need that fput(), as well as the stale comment in
      finish_open() demanding such late failure handling.  Trivially
      fixed...
      
      Fixes: fe9ec829 "do_last(): take fput() on error after opening to out:"
      Cc: stable@kernel.org # v4.7+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9a9f484
  11. 28 2月, 2020 1 次提交
    • A
      make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW · 31d1726d
      Al Viro 提交于
      O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
      As it is, we might or might not have LOOKUP_FOLLOW in op->intent
      in that case - that depends upon having O_NOFOLLOW in open flags.
      It doesn't matter, since we won't be checking it in that case -
      do_last() bails out earlier.
      
      However, making sure it's not set (i.e. acting as if we had an explicit
      O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
      check for O_CREAT | O_EXCL in do_last() with the call of step_into()
      immediately following it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d1726d
  12. 21 1月, 2020 1 次提交
  13. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  14. 27 11月, 2019 1 次提交
    • L
      Revert "vfs: properly and reliably lock f_pos in fdget_pos()" · 2be7d348
      Linus Torvalds 提交于
      This reverts commit 0be0ee71.
      
      I was hoping it would be benign to switch over entirely to FMODE_STREAM,
      and we'd have just a couple of small fixups we'd need, but it looks like
      we're not quite there yet.
      
      While it worked fine on both my desktop and laptop, they are fairly
      similar in other respects, and run mostly the same loads.  Kenneth
      Crudup reports that it seems to break both his vmware installation and
      the KDE upower service.  In both cases apparently leading to timeouts
      due to waitinmg for the f_pos lock.
      
      There are a number of character devices in particular that definitely
      want stream-like behavior, but that currently don't get marked as
      streams, and as a result get the exclusion between concurrent
      read()/write() on the same file descriptor.  Which doesn't work well for
      them.
      
      The most obvious example if this is /dev/console and /dev/tty, which use
      console_fops and tty_fops respectively (and ptmx_fops for the pty master
      side).  It may be that it's just this that causes problems, but we
      clearly weren't ready yet.
      
      Because there's a number of other likely common cases that don't have
      llseek implementations and would seem to act as stream devices:
      
        /dev/fuse		(fuse_dev_operations)
        /dev/mcelog		(mce_chrdev_ops)
        /dev/mei0		(mei_fops)
        /dev/net/tun		(tun_fops)
        /dev/nvme0		(nvme_dev_fops)
        /dev/tpm0		(tpm_fops)
        /proc/self/ns/mnt	(ns_file_operations)
        /dev/snd/pcm*		(snd_pcm_f_ops[])
      
      and while some of these could be trivially automatically detected by the
      vfs layer when the character device is opened by just noticing that they
      have no read or write operations either, it often isn't that obvious.
      
      Some character devices most definitely do use the file position, even if
      they don't allow seeking: the firmware update code, for example, uses
      simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
      back and forth.
      
      We'll revisit this when there's a better way to detect the problem and
      fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
      annotations).
      Reported-by: NKenneth R. Crudup <kenny@panix.com>
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be7d348
  15. 26 11月, 2019 1 次提交
    • L
      vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71
      Linus Torvalds 提交于
      fdget_pos() is used by file operations that will read and update f_pos:
      things like "read()", "write()" and "lseek()" (but not, for example,
      "pread()/pwrite" that get their file positions elsewhere).
      
      However, it had two separate escape clauses for this, because not
      everybody wants or needs serialization of the file position.
      
      The first and most obvious case is the "file descriptor doesn't have a
      position at all", ie a stream-like file.  Except we didn't actually use
      FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
      was that FMODE_STREAM didn't exist back in the days, but also that we
      didn't want to mark all the special cases, so we only marked the ones
      that _required_ position atomicity according to POSIX - regular files
      and directories.
      
      The case one was intentionally lazy, but now that we _do_ have
      FMODE_STREAM we could and should just use it.  With the change to use
      FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
      the code to set it is deleted.
      
      Any cases where we don't want the serialization because the driver (or
      subsystem) doesn't use the file position should just be updated to do
      "stream_open()".  We've done that for all the obvious and common
      situations, we may need a few more.  Quoting Kirill Smelkov in the
      original FMODE_STREAM thread (see link below for full email):
      
       "And I appreciate if people could help at least somehow with "getting
        rid of mixed case entirely" (i.e. always lock f_pos_lock on
        !FMODE_STREAM), because this transition starts to diverge from my
        particular use-case too far. To me it makes sense to do that
        transition as follows:
      
         - convert nonseekable_open -> stream_open via stream_open.cocci;
         - audit other nonseekable_open calls and convert left users that
           truly don't depend on position to stream_open;
         - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
           will cover pipes and sockets), or maybe convert pipes and sockets
           to FMODE_STREAM manually;
         - extend stream_open.cocci to analyze file_operations that use
           no_llseek or noop_llseek, but do not use nonseekable_open or
           alloc_file_pseudo. This might find files that have stream semantic
           but are opened differently;
         - extend stream_open.cocci to analyze file_operations whose
           .read/.write do not use ppos at all (independently of how file was
           opened);
         - ...
         - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
           !FMODE_STREAM;
         - gather bug reports for deadlocked read/write and convert missed
           cases to FMODE_STREAM, probably extending stream_open.cocci along
           the road to catch similar cases
      
        i.e. always take f_pos_lock unless a file is explicitly marked as
        being stream, and try to find and cover all files that are streams"
      
      We have not done the "extend stream_open.cocci to analyze
      alloc_file_pseudo" as well, but the previous commit did manually handle
      the case of pipes and sockets.
      
      The other case where we can avoid locking f_pos is the "this file
      descriptor only has a single user and it is us, and thus there is no
      need to lock it".
      
      The second test was correct, although a bit subtle and worth just
      re-iterating here.  There are two kinds of other sources of references
      to the same file descriptor: file descriptors that have been explicitly
      shared across fork() or with dup(), and file tables having elevated
      reference counts due to threading (or explicit file sharing with
      clone()).
      
      The first case would have incremented the file count explicitly, and in
      the second case the previous __fdget() would have incremented it for us
      and set the FDPUT_FPUT flag.
      
      But in both cases the file count would be greater than one, so the
      "file_count(file) > 1" test catches both situations.  Also note that if
      file_count is 1, that also means that no other thread can have access to
      the file table, so there also cannot be races with concurrent calls to
      dup()/fork()/clone() that would increment the file count any other way.
      
      Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0be0ee71
  16. 27 9月, 2019 1 次提交
  17. 25 9月, 2019 1 次提交
  18. 25 7月, 2019 1 次提交
    • L
      access: avoid the RCU grace period for the temporary subjective credentials · d7852fbd
      Linus Torvalds 提交于
      It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
      work because it installs a temporary credential that gets allocated and
      freed for each system call.
      
      The allocation and freeing overhead is mostly benign, but because
      credentials can be accessed under the RCU read lock, the freeing
      involves a RCU grace period.
      
      Which is not a huge deal normally, but if you have a lot of access()
      calls, this causes a fair amount of seconday damage: instead of having a
      nice alloc/free patterns that hits in hot per-CPU slab caches, you have
      all those delayed free's, and on big machines with hundreds of cores,
      the RCU overhead can end up being enormous.
      
      But it turns out that all of this is entirely unnecessary.  Exactly
      because access() only installs the credential as the thread-local
      subjective credential, the temporary cred pointer doesn't actually need
      to be RCU free'd at all.  Once we're done using it, we can just free it
      synchronously and avoid all the RCU overhead.
      
      So add a 'non_rcu' flag to 'struct cred', which can be set by users that
      know they only use it in non-RCU context (there are other potential
      users for this).  We can make it a union with the rcu freeing list head
      that we need for the RCU case, so this doesn't need any extra storage.
      
      Note that this also makes 'get_current_cred()' clear the new non_rcu
      flag, in case we have filesystems that take a long-term reference to the
      cred and then expect the RCU delayed freeing afterwards.  It's not
      entirely clear that this is required, but it makes for clear semantics:
      the subjective cred remains non-RCU as long as you only access it
      synchronously using the thread-local accessors, but you _can_ use it as
      a generic cred if you want to.
      
      It is possible that we should just remove the whole RCU markings for
      ->cred entirely.  Only ->real_cred is really supposed to be accessed
      through RCU, and the long-term cred copies that nfs uses might want to
      explicitly re-enable RCU freeing if required, rather than have
      get_current_cred() do it implicitly.
      
      But this is a "minimal semantic changes" change for the immediate
      problem.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Glauber <jglauber@marvell.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7852fbd
  19. 21 5月, 2019 1 次提交
  20. 06 5月, 2019 1 次提交
    • K
      vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files · 438ab720
      Kirill Smelkov 提交于
      This amends commit 10dce8af ("fs: stream_open - opener for
      stream-like files so that read and write can run simultaneously without
      deadlock") in how position is passed into .read()/.write() handler for
      stream-like files:
      
      Rasmus noticed that we currently pass 0 as position and ignore any position
      change if that is done by a file implementation. This papers over bugs if ppos
      is used in files that declare themselves as being stream-like as such bugs will
      go unnoticed. Even if a file implementation is correctly converted into using
      stream_open, its read/write later could be changed to use ppos and even though
      that won't be working correctly, that bug might go unnoticed without someone
      doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
      read/write for stream-like files as that don't give any chance for ppos usage
      bugs because it will oops if ppos is ever used inside .read() or .write().
      
      Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
      because they are called by vfs_read/vfs_write & friends before
      file_operations .read/.write .
      
      Note 2: if file backend uses new-style .read_iter/.write_iter, position
      is still passed into there as non-pointer kiocb.ki_pos . Currently
      stream_open.cocci (semantic patch added by 10dce8af) ignores files
      whose file_operations has *_iter methods.
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      438ab720
  21. 07 4月, 2019 1 次提交
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
  22. 30 3月, 2019 1 次提交
  23. 18 7月, 2018 5 次提交
  24. 12 7月, 2018 1 次提交
    • A
      new helper: open_with_fake_path() · 2abc77af
      Al Viro 提交于
      open a file by given inode, faking ->f_path.  Use with shitloads
      of caution - at the very least you'd damn better make sure that
      some dentry alias of that inode is pinned down by the path in
      question.  Again, this is no general-purpose interface and I hope
      it will eventually go away.  Right now overlayfs wants something
      like that, but nothing else should.
      
      Any out-of-tree code with bright idea of using this one *will*
      eventually get hurt, with zero notice and great delight on my part.
      I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
      it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
      as "you are welcome to use it".
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2abc77af