1. 04 12月, 2021 1 次提交
    • L
      fget: check that the fd still exists after getting a ref to it · 054aa8d4
      Linus Torvalds 提交于
      Jann Horn points out that there is another possible race wrt Unix domain
      socket garbage collection, somewhat reminiscent of the one fixed in
      commit cbcf0112 ("af_unix: fix garbage collect vs MSG_PEEK").
      
      See the extended comment about the garbage collection requirements added
      to unix_peek_fds() by that commit for details.
      
      The race comes from how we can locklessly look up a file descriptor just
      as it is in the process of being closed, and with the right artificial
      timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
      Unix domain socket garbage collector could see the reference count
      decrement of the close() happen before fget() took its reference to the
      file and the file was attached onto a new file descriptor.
      
      This is all (intentionally) correct on the 'struct file *' side, with
      RCU lookups and lockless reference counting very much part of the
      design.  Getting that reference count out of order isn't a problem per
      se.
      
      But the garbage collector can get confused by seeing this situation of
      having seen a file not having any remaining external references and then
      seeing it being attached to an fd.
      
      In commit cbcf0112 ("af_unix: fix garbage collect vs MSG_PEEK") the
      fix was to serialize the file descriptor install with the garbage
      collector by taking and releasing the unix_gc_lock.
      
      That's not really an option here, but since this all happens when we are
      in the process of looking up a file descriptor, we can instead simply
      just re-check that the file hasn't been closed in the meantime, and just
      re-do the lookup if we raced with a concurrent close() of the same file
      descriptor.
      Reported-and-tested-by: NJann Horn <jannh@google.com>
      Acked-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      054aa8d4
  2. 06 9月, 2021 1 次提交
  3. 16 4月, 2021 1 次提交
  4. 02 4月, 2021 3 次提交
    • C
      file: simplify logic in __close_range() · 03ba0fe4
      Christian Brauner 提交于
      It never looked too pleasant and it doesn't really buy us anything
      anymore now that CLOSE_RANGE_CLOEXEC exists and we need to retake the
      current maximum under the lock for it anyway. This also makes the logic
      easier to follow.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      03ba0fe4
    • C
      file: fix close_range() for unshare+cloexec · 9b5b8722
      Christian Brauner 提交于
      syzbot reported a bug when putting the last reference to a tasks file
      descriptor table. Debugging this showed we didn't recalculate the
      current maximum fd number for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
      after we unshared the file descriptors table. So max_fd could exceed the
      current fdtable maximum causing us to set excessive bits. As a concrete
      example, let's say the user requested everything from fd 4 to ~0UL to be
      closed and their current fdtable size is 256 with their highest open fd
      being 4. With CLOSE_RANGE_UNSHARE the caller will end up with a new
      fdtable which has room for 64 file descriptors since that is the lowest
      fdtable size we accept. But now max_fd will still point to 255 and needs
      to be adjusted. Fix this by retrieving the correct maximum fd value in
      __range_cloexec().
      
      Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
      Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
      Fixes: fec8a6a6 ("close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9b5b8722
    • C
      file: let pick_file() tell caller it's done · f49fd6d3
      Christian Brauner 提交于
      Let pick_file() report back that the fd it was passed exceeded the
      maximum fd in that fdtable. This allows us to simplify the caller of
      this helper because it doesn't need to care anymore whether the passed
      in max_fd is excessive. It can rely on pick_file() telling it that it's
      past the last valid fd.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      f49fd6d3
  5. 02 2月, 2021 1 次提交
  6. 31 12月, 2020 1 次提交
    • P
      kernel/io_uring: cancel io_uring before task works · b1b6b5a3
      Pavel Begunkov 提交于
      For cancelling io_uring requests it needs either to be able to run
      currently enqueued task_works or having it shut down by that moment.
      Otherwise io_uring_cancel_files() may be waiting for requests that won't
      ever complete.
      
      Go with the first way and do cancellations before setting PF_EXITING and
      so before putting the task_work infrastructure into a transition state
      where task_work_run() would better not be called.
      
      Cc: stable@vger.kernel.org # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1b6b5a3
  7. 19 12月, 2020 1 次提交
    • C
      close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC · fec8a6a6
      Christian Brauner 提交于
      After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
      CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
      When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
      file descriptor table in case their file descriptor table is currently
      shared.
      
      For the case where the caller has requested all file descriptors to be
      actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
      the caller does not need any of the file descriptors anymore and will
      optimize the close operation by only copying all files in the range from
      0 to 3 and no others.
      
      However, if the caller requested CLOSE_RANGE_CLOEXEC together with
      CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
      descriptors so the kernel needs to copy all of them and can't optimize.
      
      The original patch didn't account for this and thus could cause oopses
      as evidenced by the syzbot report because it assumed that all fds had
      been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.
      
      syzbot reported
      ==================================================================
      BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
      BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
      BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
      BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
      Read of size 8 at addr 0000000000000077 by task syz-executor511/8522
      
      CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x107/0x163 lib/dump_stack.c:120
       __kasan_report mm/kasan/report.c:549 [inline]
       kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
       check_memory_region_inline mm/kasan/generic.c:186 [inline]
       check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
       instrument_atomic_read include/linux/instrumented.h:71 [inline]
       atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
       atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
       filp_close+0x22/0x170 fs/open.c:1274
       close_files fs/file.c:402 [inline]
       put_files_struct fs/file.c:417 [inline]
       put_files_struct+0x1cc/0x350 fs/file.c:414
       exit_files+0x12a/0x170 fs/file.c:435
       do_exit+0xb4f/0x2a00 kernel/exit.c:818
       do_group_exit+0x125/0x310 kernel/exit.c:920
       get_signal+0x428/0x2100 kernel/signal.c:2792
       arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x447039
      Code: Unable to access opcode bytes at RIP 0x44700f.
      RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
      RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
      RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
      RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
      R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
      ==================================================================
      
      syzbot has tested the proposed patch and the reproducer did not trigger any issue:
      
      Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
      
      Tested on:
      
      commit:         10f7cddd selftests/core: add regression test for CLOSE_RAN..
      git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
      kernel config:  https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
      dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
      compiler:       gcc (GCC) 10.1.0-syz 20200507
      
      Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
      Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      fec8a6a6
  8. 11 12月, 2020 12 次提交
  9. 04 12月, 2020 1 次提交
    • G
      fs, close_range: add flag CLOSE_RANGE_CLOEXEC · 582f1fb6
      Giuseppe Scrivano 提交于
      When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
      immediately close the files but it sets the close-on-exec bit.
      
      It is useful for e.g. container runtimes that usually install a
      seccomp profile "as late as possible" before execv'ing the container
      process itself.  The container runtime could either do:
        1                                  2
      - install_seccomp_profile();       - close_range(MIN_FD, MAX_INT, 0);
      - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
      - execve(...);                     - execve(...);
      
      Both alternative have some disadvantages.
      
      In the first variant the seccomp_profile cannot block the close_range
      syscall, as well as opendir/read/close/... for the fallback on older
      kernels.
      In the second variant, close_range() can be used only on the fds
      that are not going to be needed by the runtime anymore, and it must be
      potentially called multiple times to account for the different ranges
      that must be closed.
      
      Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
      The runtime is able to use the existing open fds, the seccomp profile
      can block close_range() and the syscalls used for its fallback.
      Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      582f1fb6
  10. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  11. 31 7月, 2020 1 次提交
  12. 14 7月, 2020 3 次提交
    • K
      fs: Expand __receive_fd() to accept existing fd · 17381715
      Kees Cook 提交于
      Expand __receive_fd() with support for replace_fd() for the coming seccomp
      "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
      and update existing wrappers to retain old behavior.
      
      Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an
      uninitialized variable exposure in an earlier version of this patch.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Kadashev <dkadashev@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      17381715
    • K
      fs: Add receive_fd() wrapper for __receive_fd() · deefa7f3
      Kees Cook 提交于
      For both pidfd and seccomp, the __user pointer is not used. Update
      __receive_fd() to make writing to ufd optional via a NULL check. However,
      for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
      can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
      new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
      argument. For the new helper, the allocated fd needs to be returned on
      success. Update the existing callers to handle it.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      deefa7f3
    • K
      fs: Move __scm_install_fd() to __receive_fd() · 66590610
      Kees Cook 提交于
      In preparation for users of the "install a received file" logic outside
      of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
      net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
      named receive_fd_user(), as future patches will change the interface
      to __receive_fd().
      
      Additionally add a comment to fd_install() as a counterpoint to how
      __receive_fd() interacts with fput().
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Dmitry Kadashev <dkadashev@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Ido Schimmel <idosch@idosch.org>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Reviewed-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      66590610
  13. 17 6月, 2020 2 次提交
    • C
      close_range: add CLOSE_RANGE_UNSHARE · 60997c3d
      Christian Brauner 提交于
      One of the use-cases of close_range() is to drop file descriptors just before
      execve(). This would usually be expressed in the sequence:
      
      unshare(CLONE_FILES);
      close_range(3, ~0U);
      
      as pointed out by Linus it might be desirable to have this be a part of
      close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
      
      This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
      maximum number of file descriptors to copy from the old struct files. When the
      user requests that all file descriptors are supposed to be closed via
      close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
      to do any of the heavy fput() work for everything above min.
      
      The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
      fact currently share our file descriptor table we create a new private copy.
      We then close all fds in the requested range and finally after we're done we
      install the new fd table.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      60997c3d
    • C
      open: add close_range() · 278a5fba
      Christian Brauner 提交于
      This adds the close_range() syscall. It allows to efficiently close a range
      of file descriptors up to all file descriptors of a calling task.
      
      I was contacted by FreeBSD as they wanted to have the same close_range()
      syscall as we proposed here. We've coordinated this and in the meantime, Kyle
      was fast enough to merge close_range() into FreeBSD already in April:
      https://reviews.freebsd.org/D21627
      https://svnweb.freebsd.org/base?view=revision&revision=359836
      and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
      once its merged in Linux too. Python is in the process of switching to
      close_range() on FreeBSD and they are waiting on us to merge this to switch on
      Linux as well: https://bugs.python.org/issue38061
      
      The syscall came up in a recent discussion around the new mount API and
      making new file descriptor types cloexec by default. During this
      discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
      syscall in this manner has been requested by various people over time.
      
      First, it helps to close all file descriptors of an exec()ing task. This
      can be done safely via (quoting Al's example from [1] verbatim):
      
              /* that exec is sensitive */
              unshare(CLONE_FILES);
              /* we don't want anything past stderr here */
              close_range(3, ~0U);
              execve(....);
      
      The code snippet above is one way of working around the problem that file
      descriptors are not cloexec by default. This is aggravated by the fact that
      we can't just switch them over without massively regressing userspace. For
      a whole class of programs having an in-kernel method of closing all file
      descriptors is very helpful (e.g. demons, service managers, programming
      language standard libraries, container managers etc.).
      (Please note, unshare(CLONE_FILES) should only be needed if the calling
      task is multi-threaded and shares the file descriptor table with another
      thread in which case two threads could race with one thread allocating file
      descriptors and the other one closing them via close_range(). For the
      general case close_range() before the execve() is sufficient.)
      
      Second, it allows userspace to avoid implementing closing all file
      descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
      file descriptor. From looking at various large(ish) userspace code bases
      this or similar patterns are very common in:
      - service managers (cf. [4])
      - libcs (cf. [6])
      - container runtimes (cf. [5])
      - programming language runtimes/standard libraries
        - Python (cf. [2])
        - Rust (cf. [7], [8])
      As Dmitry pointed out there's even a long-standing glibc bug about missing
      kernel support for this task (cf. [3]).
      In addition, the syscall will also work for tasks that do not have procfs
      mounted and on kernels that do not have procfs support compiled in. In such
      situations the only way to make sure that all file descriptors are closed
      is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
      OPEN_MAX trickery (cf. comment [8] on Rust).
      
      The performance is striking. For good measure, comparing the following
      simple close_all_fds() userspace implementation that is essentially just
      glibc's version in [6]:
      
      static int close_all_fds(void)
      {
              int dir_fd;
              DIR *dir;
              struct dirent *direntp;
      
              dir = opendir("/proc/self/fd");
              if (!dir)
                      return -1;
              dir_fd = dirfd(dir);
              while ((direntp = readdir(dir))) {
                      int fd;
                      if (strcmp(direntp->d_name, ".") == 0)
                              continue;
                      if (strcmp(direntp->d_name, "..") == 0)
                              continue;
                      fd = atoi(direntp->d_name);
                      if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                              continue;
                      close(fd);
              }
              closedir(dir);
              return 0;
      }
      
      to close_range() yields:
      1. closing 4 open files:
         - close_all_fds(): ~280 us
         - close_range():    ~24 us
      
      2. closing 1000 open files:
         - close_all_fds(): ~5000 us
         - close_range():   ~800 us
      
      close_range() is designed to allow for some flexibility. Specifically, it
      does not simply always close all open file descriptors of a task. Instead,
      callers can specify an upper bound.
      This is e.g. useful for scenarios where specific file descriptors are
      created with well-known numbers that are supposed to be excluded from
      getting closed.
      For extra paranoia close_range() comes with a flags argument. This can e.g.
      be used to implement extension. Once can imagine userspace wanting to stop
      at the first error instead of ignoring errors under certain circumstances.
      There might be other valid ideas in the future. In any case, a flag
      argument doesn't hurt and keeps us on the safe side.
      
      From an implementation side this is kept rather dumb. It saw some input
      from David and Jann but all nonsense is obviously my own!
      - Errors to close file descriptors are currently ignored. (Could be changed
        by setting a flag in the future if needed.)
      - __close_range() is a rather simplistic wrapper around __close_fd().
        My reasoning behind this is based on the nature of how __close_fd() needs
        to release an fd. But maybe I misunderstood specifics:
        We take the files_lock and rcu-dereference the fdtable of the calling
        task, we find the entry in the fdtable, get the file and need to release
        files_lock before calling filp_close().
        In the meantime the fdtable might have been altered so we can't just
        retake the spinlock and keep the old rcu-reference of the fdtable
        around. Instead we need to grab a fresh reference to the fdtable.
        If my reasoning is correct then there's really no point in fancyfying
        __close_range(): We just need to rcu-dereference the fdtable of the
        calling task once to cap the max_fd value correctly and then go on
        calling __close_fd() in a loop.
      
      /* References */
      [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
      [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
      [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
      [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
      [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
      [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
           Note that this is an internal implementation that is not exported.
           Currently, libc seems to not provide an exported version of this
           because of missing kernel support to do this.
           Note, in a recent patch series Florian made grantpt() a nop thereby
           removing the code referenced here.
      [7]: https://github.com/rust-lang/rust/issues/12148
      [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
           Rust's solution is slightly different but is equally unperformant.
           Rust calls getdtablesize() which is a glibc library function that
           simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
           goes on to call close() on each fd. That's obviously overkill for most
           tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
           OPEN_MAX.
           Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
           to 1024. Even in this case, there's a very high chance that in the
           common case Rust is calling the close() syscall 1021 times pointlessly
           if the task just has 0, 1, and 2 open.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kyle Evans <self@kyle-evans.net>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      278a5fba
  14. 20 5月, 2020 1 次提交
  15. 20 3月, 2020 1 次提交
  16. 21 1月, 2020 1 次提交
  17. 14 1月, 2020 1 次提交
  18. 03 1月, 2020 1 次提交
    • D
      Revert "fs: remove ksys_dup()" · 74f1a299
      Dominik Brodowski 提交于
      This reverts commit 8243186f ("fs: remove ksys_dup()") and the
      subsequent fix for it in commit 2d3145f8 ("early init: fix error
      handling when opening /dev/console").
      
      Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
      caused more trouble than what is worth it: it requires accessing vfs
      internals and it turns out there were other bugs in it too.
      
      In particular, the file reference counting was wrong - because unlike
      the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
      thus had an extra leaked file reference.
      
      That in turn then caused odd problems with Androidx86 long after boot
      becaue of how the extra reference to the console kept the session active
      even after all file descriptors had been closed.
      Reported-by: Nyouling 257 <youling257@gmail.com>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74f1a299
  19. 13 12月, 2019 1 次提交
  20. 27 11月, 2019 1 次提交
    • L
      Revert "vfs: properly and reliably lock f_pos in fdget_pos()" · 2be7d348
      Linus Torvalds 提交于
      This reverts commit 0be0ee71.
      
      I was hoping it would be benign to switch over entirely to FMODE_STREAM,
      and we'd have just a couple of small fixups we'd need, but it looks like
      we're not quite there yet.
      
      While it worked fine on both my desktop and laptop, they are fairly
      similar in other respects, and run mostly the same loads.  Kenneth
      Crudup reports that it seems to break both his vmware installation and
      the KDE upower service.  In both cases apparently leading to timeouts
      due to waitinmg for the f_pos lock.
      
      There are a number of character devices in particular that definitely
      want stream-like behavior, but that currently don't get marked as
      streams, and as a result get the exclusion between concurrent
      read()/write() on the same file descriptor.  Which doesn't work well for
      them.
      
      The most obvious example if this is /dev/console and /dev/tty, which use
      console_fops and tty_fops respectively (and ptmx_fops for the pty master
      side).  It may be that it's just this that causes problems, but we
      clearly weren't ready yet.
      
      Because there's a number of other likely common cases that don't have
      llseek implementations and would seem to act as stream devices:
      
        /dev/fuse		(fuse_dev_operations)
        /dev/mcelog		(mce_chrdev_ops)
        /dev/mei0		(mei_fops)
        /dev/net/tun		(tun_fops)
        /dev/nvme0		(nvme_dev_fops)
        /dev/tpm0		(tpm_fops)
        /proc/self/ns/mnt	(ns_file_operations)
        /dev/snd/pcm*		(snd_pcm_f_ops[])
      
      and while some of these could be trivially automatically detected by the
      vfs layer when the character device is opened by just noticing that they
      have no read or write operations either, it often isn't that obvious.
      
      Some character devices most definitely do use the file position, even if
      they don't allow seeking: the firmware update code, for example, uses
      simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
      back and forth.
      
      We'll revisit this when there's a better way to detect the problem and
      fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
      annotations).
      Reported-by: NKenneth R. Crudup <kenny@panix.com>
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be7d348
  21. 26 11月, 2019 1 次提交
    • L
      vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71
      Linus Torvalds 提交于
      fdget_pos() is used by file operations that will read and update f_pos:
      things like "read()", "write()" and "lseek()" (but not, for example,
      "pread()/pwrite" that get their file positions elsewhere).
      
      However, it had two separate escape clauses for this, because not
      everybody wants or needs serialization of the file position.
      
      The first and most obvious case is the "file descriptor doesn't have a
      position at all", ie a stream-like file.  Except we didn't actually use
      FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
      was that FMODE_STREAM didn't exist back in the days, but also that we
      didn't want to mark all the special cases, so we only marked the ones
      that _required_ position atomicity according to POSIX - regular files
      and directories.
      
      The case one was intentionally lazy, but now that we _do_ have
      FMODE_STREAM we could and should just use it.  With the change to use
      FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
      the code to set it is deleted.
      
      Any cases where we don't want the serialization because the driver (or
      subsystem) doesn't use the file position should just be updated to do
      "stream_open()".  We've done that for all the obvious and common
      situations, we may need a few more.  Quoting Kirill Smelkov in the
      original FMODE_STREAM thread (see link below for full email):
      
       "And I appreciate if people could help at least somehow with "getting
        rid of mixed case entirely" (i.e. always lock f_pos_lock on
        !FMODE_STREAM), because this transition starts to diverge from my
        particular use-case too far. To me it makes sense to do that
        transition as follows:
      
         - convert nonseekable_open -> stream_open via stream_open.cocci;
         - audit other nonseekable_open calls and convert left users that
           truly don't depend on position to stream_open;
         - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
           will cover pipes and sockets), or maybe convert pipes and sockets
           to FMODE_STREAM manually;
         - extend stream_open.cocci to analyze file_operations that use
           no_llseek or noop_llseek, but do not use nonseekable_open or
           alloc_file_pseudo. This might find files that have stream semantic
           but are opened differently;
         - extend stream_open.cocci to analyze file_operations whose
           .read/.write do not use ppos at all (independently of how file was
           opened);
         - ...
         - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
           !FMODE_STREAM;
         - gather bug reports for deadlocked read/write and convert missed
           cases to FMODE_STREAM, probably extending stream_open.cocci along
           the road to catch similar cases
      
        i.e. always take f_pos_lock unless a file is explicitly marked as
        being stream, and try to find and cover all files that are streams"
      
      We have not done the "extend stream_open.cocci to analyze
      alloc_file_pseudo" as well, but the previous commit did manually handle
      the case of pipes and sockets.
      
      The other case where we can avoid locking f_pos is the "this file
      descriptor only has a single user and it is us, and thus there is no
      need to lock it".
      
      The second test was correct, although a bit subtle and worth just
      re-iterating here.  There are two kinds of other sources of references
      to the same file descriptor: file descriptors that have been explicitly
      shared across fork() or with dup(), and file tables having elevated
      reference counts due to threading (or explicit file sharing with
      clone()).
      
      The first case would have incremented the file count explicitly, and in
      the second case the previous __fdget() would have incremented it for us
      and set the FDPUT_FPUT flag.
      
      But in both cases the file count would be greater than one, so the
      "file_count(file) > 1" test catches both situations.  Also note that if
      file_count is 1, that also means that no other thread can have access to
      the file table, so there also cannot be races with concurrent calls to
      dup()/fork()/clone() that would increment the file count any other way.
      
      Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0be0ee71
  22. 06 3月, 2019 1 次提交
    • S
      fs/file.c: initialize init_files.resize_wait · 5704a068
      Shuriyc Chu 提交于
      (Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)
      
      'get_unused_fd_flags' in kthread cause kernel crash.  It works fine on
      4.1, but causes crash after get 64 fds.  It also cause crash on
      ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
      same.
      
      The crash message on centos7.5 shows below:
      
        start fd 61
        start fd 62
        start fd 63
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: __wake_up_common+0x2e/0x90
        PGD 0
        Oops: 0000 [#1] SMP
        Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
         virtio floppy dm_mirror dm_region_hash dm_log dm_mod
        CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.3.3.el7.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
        RIP: 0010:__wake_up_common+0x2e/0x90
        RSP: 0018:ffff8e94247a2d18  EFLAGS: 00010086
        RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
        RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
        FS:  0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
        Call Trace:
          __wake_up+0x39/0x50
          expand_files+0x131/0x250
          __alloc_fd+0x47/0x170
          get_unused_fd_flags+0x30/0x40
          test_fd+0x12a/0x1c0 [test]
          kthread+0xd1/0xe0
          ret_from_fork_nospec_begin+0x21/0x21
        Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 <48> 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
        RIP   __wake_up_common+0x2e/0x90
         RSP <ffff8e94247a2d18>
        CR2: 0000000000000000
      
      This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
      (3.10.0-693.21.1 ) is ok.  Root cause: the item 'resize_wait' is not
      initialized before being used.
      Reported-by: NRichard Zhang <zhang.zijian@h3c.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5704a068
  23. 28 2月, 2019 1 次提交
    • J
      fs: add fget_many() and fput_many() · 091141a4
      Jens Axboe 提交于
      Some uses cases repeatedly get and put references to the same file, but
      the only exposed interface is doing these one at the time. As each of
      these entail an atomic inc or dec on a shared structure, that cost can
      add up.
      
      Add fget_many(), which works just like fget(), except it takes an
      argument for how many references to get on the file. Ditto fput_many(),
      which can drop an arbitrary number of references to a file.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      091141a4
  24. 19 12月, 2018 1 次提交
    • T
      binder: fix use-after-free due to ksys_close() during fdget() · 80cd7956
      Todd Kjos 提交于
      44d8047f ("binder: use standard functions to allocate fds")
      exposed a pre-existing issue in the binder driver.
      
      fdget() is used in ksys_ioctl() as a performance optimization.
      One of the rules associated with fdget() is that ksys_close() must
      not be called between the fdget() and the fdput(). There is a case
      where this requirement is not met in the binder driver which results
      in the reference count dropping to 0 when the device is still in
      use. This can result in use-after-free or other issues.
      
      If userpace has passed a file-descriptor for the binder driver using
      a BINDER_TYPE_FDA object, then kys_close() is called on it when
      handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
      the assumptions for using fdget().
      
      The problem is fixed by deferring the close using task_work_add(). A
      new variant of __close_fd() was created that returns a struct file
      with a reference. The fput() is deferred instead of using ksys_close().
      
      Fixes: 44d8047f ("binder: use standard functions to allocate fds")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTodd Kjos <tkjos@google.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80cd7956