1. 01 11月, 2022 1 次提交
    • J
      fs: use acquire ordering in __fget_light() · 7ee47dcf
      Jann Horn 提交于
      We must prevent the CPU from reordering the files->count read with the
      FD table access like this, on architectures where read-read reordering is
      possible:
      
          files_lookup_fd_raw()
                                        close_fd()
                                        put_files_struct()
          atomic_read(&files->count)
      
      I would like to mark this for stable, but the stable rules explicitly say
      "no theoretical races", and given that the FD table pointer and
      files->count are explicitly stored in the same cacheline, this sort of
      reordering seems quite unlikely in practice...
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ee47dcf
  2. 29 6月, 2022 1 次提交
    • A
      gfs2: Add glockfd debugfs file · 4480c27c
      Andreas Gruenbacher 提交于
      When a process has a gfs2 file open, the file is keeping a reference on the
      underlying gfs2 inode, and the inode is keeping the inode's iopen glock held in
      shared mode.  In other words, the process depends on the iopen glock of each
      open gfs2 file.  Expose those dependencies in a new "glockfd" debugfs file.
      
      The new debugfs file contains one line for each gfs2 file descriptor,
      specifying the tgid, file descriptor number, and glock name, e.g.,
      
        1601 6 5/816d
      
      This list is compiled by iterating all tasks on the system using find_ge_pid(),
      and all file descriptors of each task using task_lookup_next_fd_rcu().  To make
      that work from gfs2, export those two functions.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      4480c27c
  3. 06 6月, 2022 1 次提交
  4. 15 5月, 2022 2 次提交
    • A
      Unify the primitives for file descriptor closing · 6319194e
      Al Viro 提交于
      Currently we have 3 primitives for removing an opened file from descriptor
      table - pick_file(), __close_fd_get_file() and close_fd_get_file().  Their
      calling conventions are rather odd and there's a code duplication for no
      good reason.  They can be unified -
      
      1) have __range_close() cap max_fd in the very beginning; that way
      we don't need separate way for pick_file() to report being past the end
      of descriptor table.
      
      2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
      than returning it via struct file ** argument.  Don't bother with
      (bogus) return value - nobody wants that -ENOENT.
      
      3) make pick_file() return NULL on unopened descriptor - the only caller
      that used to care about the distinction between descriptor past the end
      of descriptor table and finding NULL in descriptor table doesn't give
      a damn after (1).
      
      4) lift ->files_lock out of pick_file()
      
      That actually simplifies the callers, as well as the primitives themselves.
      Code duplication is also gone...
      Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6319194e
    • G
      fs: remove fget_many and fput_many interface · 81132a39
      Gou Hao 提交于
      These two interface were added in 091141a4 commit,
      but now there is no place to call them.
      
      The only user of fput/fget_many() was removed in commit
      62906e89 ("io_uring: remove file batch-get optimisation").
      
      A user of get_file_rcu_many() were removed in commit
      f0735310 ("init: add an init_dup helper").
      
      And replace atomic_long_sub/add to atomic_long_dec/inc
      can improve performance.
      
      Here are the test results of unixbench:
      
      Cmd: ./Run -c 64 context1
      
      Without patch:
      System Benchmarks Partial Index              BASELINE       RESULT    INDEX
      Pipe-based Context Switching                   4000.0    2798407.0   6996.0
                                                                         ========
      System Benchmarks Index Score (Partial Only)                         6996.0
      
      With patch:
      System Benchmarks Partial Index              BASELINE       RESULT    INDEX
      Pipe-based Context Switching                   4000.0    3486268.8   8715.7
                                                                         ========
      System Benchmarks Index Score (Partial Only)                         8715.7
      Signed-off-by: NGou Hao <gouhao@uniontech.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      81132a39
  5. 30 3月, 2022 2 次提交
    • L
      fs: fix fd table size alignment properly · d888c83f
      Linus Torvalds 提交于
      Jason Donenfeld reports that my commit 1c24a186 ("fs: fd tables have
      to be multiples of BITS_PER_LONG") doesn't work, and the reason is an
      embarrassing brown-paper-bag bug.
      
      Yes, we want to align the number of fds to BITS_PER_LONG, and yes, the
      reason they might not be aligned is because the incoming 'max_fd'
      argument might not be aligned.
      
      But aligining the argument - while simple - will cause a "infinitely
      big" maxfd (eg NR_OPEN_MAX) to just overflow to zero.  Which most
      definitely isn't what we want either.
      
      The obvious fix was always just to do the alignment last, but I had
      moved it earlier just to make the patch smaller and the code look
      simpler.  Duh.  It certainly made _me_ look simple.
      
      Fixes: 1c24a186 ("fs: fd tables have to be multiples of BITS_PER_LONG")
      Reported-and-tested-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Fedor Pchelkin <aissur0002@gmail.com>
      Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Christian Brauner <brauner@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d888c83f
    • L
      fs: fd tables have to be multiples of BITS_PER_LONG · 1c24a186
      Linus Torvalds 提交于
      This has always been the rule: fdtables have several bitmaps in them,
      and as a result they have to be sized properly for bitmaps.  We walk
      those bitmaps in chunks of 'unsigned long' in serveral cases, but even
      when we don't, we use the regular kernel bitops that are defined to work
      on arrays of 'unsigned long', not on some byte array.
      
      Now, the distinction between arrays of bytes and 'unsigned long'
      normally only really ends up being noticeable on big-endian systems, but
      Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps()
      could be called with an argument that wasn't even a multiple of
      BITS_PER_BYTE.  And then it fails to do the proper copy even on
      little-endian machines.
      
      The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which
      didn't actually sanitize the fdtable size sufficiently, and never made
      sure it had the proper BITS_PER_LONG alignment.
      
      That's partly because the alignment historically came not from having to
      explicitly align things, but simply from previous fdtable sizes, and
      from count_open_files(), which counts the file descriptors by walking
      them one 'unsigned long' word at a time and thus naturally ends up doing
      sizing in the proper 'chunks of unsigned long'.
      
      But with the introduction of close_range(), we now have an external
      source of "this is how many files we want to have", and so
      sane_fdtable_size() needs to do a better job.
      
      This also adds that explicit alignment to alloc_fdtable(), although
      there it is mainly just for documentation at a source code level.  The
      arithmetic we do there to pick a reasonable fdtable size already aligns
      the result sufficiently.
      
      In fact,clang notices that the added ALIGN() in that function doesn't
      actually do anything, and does not generate any extra code for it.
      
      It turns out that gcc ends up confusing itself by combining a previous
      constant-sized shift operation with the variable-sized shift operations
      in roundup_pow_of_two().  And probably due to that doesn't notice that
      the ALIGN() is a no-op.  But that's a (tiny) gcc misfeature that doesn't
      matter.  Having the explicit alignment makes sense, and would actually
      matter on a 128-bit architecture if we ever go there.
      
      This also adds big comments above both functions about how fdtable sizes
      have to have that BITS_PER_LONG alignment.
      
      Fixes: 60997c3d ("close_range: add CLOSE_RANGE_UNSHARE")
      Reported-by: NFedor Pchelkin <aissur0002@gmail.com>
      Reported-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
      Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/Tested-and-acked-by: NChristian Brauner <brauner@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c24a186
  6. 14 12月, 2021 1 次提交
    • L
      fget: clarify and improve __fget_files() implementation · e386dfc5
      Linus Torvalds 提交于
      Commit 054aa8d4 ("fget: check that the fd still exists after getting
      a ref to it") fixed a race with getting a reference to a file just as it
      was being closed.  It was a fairly minimal patch, and I didn't think
      re-checking the file pointer lookup would be a measurable overhead,
      since it was all right there and cached.
      
      But I was wrong, as pointed out by the kernel test robot.
      
      The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
      quite noticeably.  Admittedly it seems to be a very artificial test:
      doing "poll()" system calls on regular files in a very tight loop in
      multiple threads.
      
      That means that basically all the time is spent just looking up file
      descriptors without ever doing anything useful with them (not that doing
      'poll()' on a regular file is useful to begin with).  And as a result it
      shows the extra "re-check fd" cost as a sore thumb.
      
      Happily, the regression is fixable by just writing the code to loook up
      the fd to be better and clearer.  There's still a cost to verify the
      file pointer, but now it's basically in the noise even for that
      benchmark that does nothing else - and the code is more understandable
      and has better comments too.
      
      [ Side note: this patch is also a classic case of one that looks very
        messy with the default greedy Myers diff - it's much more legible with
        either the patience of histogram diff algorithm ]
      
      Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
      Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NCarel Si <beibei.si@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e386dfc5
  7. 04 12月, 2021 1 次提交
    • L
      fget: check that the fd still exists after getting a ref to it · 054aa8d4
      Linus Torvalds 提交于
      Jann Horn points out that there is another possible race wrt Unix domain
      socket garbage collection, somewhat reminiscent of the one fixed in
      commit cbcf0112 ("af_unix: fix garbage collect vs MSG_PEEK").
      
      See the extended comment about the garbage collection requirements added
      to unix_peek_fds() by that commit for details.
      
      The race comes from how we can locklessly look up a file descriptor just
      as it is in the process of being closed, and with the right artificial
      timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
      Unix domain socket garbage collector could see the reference count
      decrement of the close() happen before fget() took its reference to the
      file and the file was attached onto a new file descriptor.
      
      This is all (intentionally) correct on the 'struct file *' side, with
      RCU lookups and lockless reference counting very much part of the
      design.  Getting that reference count out of order isn't a problem per
      se.
      
      But the garbage collector can get confused by seeing this situation of
      having seen a file not having any remaining external references and then
      seeing it being attached to an fd.
      
      In commit cbcf0112 ("af_unix: fix garbage collect vs MSG_PEEK") the
      fix was to serialize the file descriptor install with the garbage
      collector by taking and releasing the unix_gc_lock.
      
      That's not really an option here, but since this all happens when we are
      in the process of looking up a file descriptor, we can instead simply
      just re-check that the file hasn't been closed in the meantime, and just
      re-do the lookup if we raced with a concurrent close() of the same file
      descriptor.
      Reported-and-tested-by: NJann Horn <jannh@google.com>
      Acked-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      054aa8d4
  8. 06 9月, 2021 1 次提交
  9. 16 4月, 2021 1 次提交
  10. 02 4月, 2021 3 次提交
    • C
      file: simplify logic in __close_range() · 03ba0fe4
      Christian Brauner 提交于
      It never looked too pleasant and it doesn't really buy us anything
      anymore now that CLOSE_RANGE_CLOEXEC exists and we need to retake the
      current maximum under the lock for it anyway. This also makes the logic
      easier to follow.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      03ba0fe4
    • C
      file: fix close_range() for unshare+cloexec · 9b5b8722
      Christian Brauner 提交于
      syzbot reported a bug when putting the last reference to a tasks file
      descriptor table. Debugging this showed we didn't recalculate the
      current maximum fd number for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
      after we unshared the file descriptors table. So max_fd could exceed the
      current fdtable maximum causing us to set excessive bits. As a concrete
      example, let's say the user requested everything from fd 4 to ~0UL to be
      closed and their current fdtable size is 256 with their highest open fd
      being 4. With CLOSE_RANGE_UNSHARE the caller will end up with a new
      fdtable which has room for 64 file descriptors since that is the lowest
      fdtable size we accept. But now max_fd will still point to 255 and needs
      to be adjusted. Fix this by retrieving the correct maximum fd value in
      __range_cloexec().
      
      Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
      Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
      Fixes: fec8a6a6 ("close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9b5b8722
    • C
      file: let pick_file() tell caller it's done · f49fd6d3
      Christian Brauner 提交于
      Let pick_file() report back that the fd it was passed exceeded the
      maximum fd in that fdtable. This allows us to simplify the caller of
      this helper because it doesn't need to care anymore whether the passed
      in max_fd is excessive. It can rely on pick_file() telling it that it's
      past the last valid fd.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      f49fd6d3
  11. 02 2月, 2021 1 次提交
  12. 31 12月, 2020 1 次提交
    • P
      kernel/io_uring: cancel io_uring before task works · b1b6b5a3
      Pavel Begunkov 提交于
      For cancelling io_uring requests it needs either to be able to run
      currently enqueued task_works or having it shut down by that moment.
      Otherwise io_uring_cancel_files() may be waiting for requests that won't
      ever complete.
      
      Go with the first way and do cancellations before setting PF_EXITING and
      so before putting the task_work infrastructure into a transition state
      where task_work_run() would better not be called.
      
      Cc: stable@vger.kernel.org # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1b6b5a3
  13. 19 12月, 2020 1 次提交
    • C
      close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC · fec8a6a6
      Christian Brauner 提交于
      After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
      CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
      When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
      file descriptor table in case their file descriptor table is currently
      shared.
      
      For the case where the caller has requested all file descriptors to be
      actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
      the caller does not need any of the file descriptors anymore and will
      optimize the close operation by only copying all files in the range from
      0 to 3 and no others.
      
      However, if the caller requested CLOSE_RANGE_CLOEXEC together with
      CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
      descriptors so the kernel needs to copy all of them and can't optimize.
      
      The original patch didn't account for this and thus could cause oopses
      as evidenced by the syzbot report because it assumed that all fds had
      been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.
      
      syzbot reported
      ==================================================================
      BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
      BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
      BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
      BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
      Read of size 8 at addr 0000000000000077 by task syz-executor511/8522
      
      CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x107/0x163 lib/dump_stack.c:120
       __kasan_report mm/kasan/report.c:549 [inline]
       kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
       check_memory_region_inline mm/kasan/generic.c:186 [inline]
       check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
       instrument_atomic_read include/linux/instrumented.h:71 [inline]
       atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
       atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
       filp_close+0x22/0x170 fs/open.c:1274
       close_files fs/file.c:402 [inline]
       put_files_struct fs/file.c:417 [inline]
       put_files_struct+0x1cc/0x350 fs/file.c:414
       exit_files+0x12a/0x170 fs/file.c:435
       do_exit+0xb4f/0x2a00 kernel/exit.c:818
       do_group_exit+0x125/0x310 kernel/exit.c:920
       get_signal+0x428/0x2100 kernel/signal.c:2792
       arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x447039
      Code: Unable to access opcode bytes at RIP 0x44700f.
      RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
      RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
      RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
      RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
      R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
      ==================================================================
      
      syzbot has tested the proposed patch and the reproducer did not trigger any issue:
      
      Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
      
      Tested on:
      
      commit:         10f7cddd selftests/core: add regression test for CLOSE_RAN..
      git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
      kernel config:  https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
      dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
      compiler:       gcc (GCC) 10.1.0-syz 20200507
      
      Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
      Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      fec8a6a6
  14. 11 12月, 2020 12 次提交
  15. 04 12月, 2020 1 次提交
    • G
      fs, close_range: add flag CLOSE_RANGE_CLOEXEC · 582f1fb6
      Giuseppe Scrivano 提交于
      When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
      immediately close the files but it sets the close-on-exec bit.
      
      It is useful for e.g. container runtimes that usually install a
      seccomp profile "as late as possible" before execv'ing the container
      process itself.  The container runtime could either do:
        1                                  2
      - install_seccomp_profile();       - close_range(MIN_FD, MAX_INT, 0);
      - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
      - execve(...);                     - execve(...);
      
      Both alternative have some disadvantages.
      
      In the first variant the seccomp_profile cannot block the close_range
      syscall, as well as opendir/read/close/... for the fallback on older
      kernels.
      In the second variant, close_range() can be used only on the fds
      that are not going to be needed by the runtime anymore, and it must be
      potentially called multiple times to account for the different ranges
      that must be closed.
      
      Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
      The runtime is able to use the existing open fds, the seccomp profile
      can block close_range() and the syscalls used for its fallback.
      Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      582f1fb6
  16. 01 10月, 2020 1 次提交
    • J
      io_uring: don't rely on weak ->files references · 0f212204
      Jens Axboe 提交于
      Grab actual references to the files_struct. To avoid circular references
      issues due to this, we add a per-task note that keeps track of what
      io_uring contexts a task has used. When the tasks execs or exits its
      assigned files, we cancel requests based on this tracking.
      
      With that, we can grab proper references to the files table, and no
      longer need to rely on stashing away ring_fd and ring_file to check
      if the ring_fd may have been closed.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f212204
  17. 31 7月, 2020 1 次提交
  18. 14 7月, 2020 3 次提交
    • K
      fs: Expand __receive_fd() to accept existing fd · 17381715
      Kees Cook 提交于
      Expand __receive_fd() with support for replace_fd() for the coming seccomp
      "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
      and update existing wrappers to retain old behavior.
      
      Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an
      uninitialized variable exposure in an earlier version of this patch.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Kadashev <dkadashev@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      17381715
    • K
      fs: Add receive_fd() wrapper for __receive_fd() · deefa7f3
      Kees Cook 提交于
      For both pidfd and seccomp, the __user pointer is not used. Update
      __receive_fd() to make writing to ufd optional via a NULL check. However,
      for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
      can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
      new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
      argument. For the new helper, the allocated fd needs to be returned on
      success. Update the existing callers to handle it.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      deefa7f3
    • K
      fs: Move __scm_install_fd() to __receive_fd() · 66590610
      Kees Cook 提交于
      In preparation for users of the "install a received file" logic outside
      of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
      net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
      named receive_fd_user(), as future patches will change the interface
      to __receive_fd().
      
      Additionally add a comment to fd_install() as a counterpoint to how
      __receive_fd() interacts with fput().
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Dmitry Kadashev <dkadashev@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Ido Schimmel <idosch@idosch.org>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Reviewed-by: NSargun Dhillon <sargun@sargun.me>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      66590610
  19. 17 6月, 2020 2 次提交
    • C
      close_range: add CLOSE_RANGE_UNSHARE · 60997c3d
      Christian Brauner 提交于
      One of the use-cases of close_range() is to drop file descriptors just before
      execve(). This would usually be expressed in the sequence:
      
      unshare(CLONE_FILES);
      close_range(3, ~0U);
      
      as pointed out by Linus it might be desirable to have this be a part of
      close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
      
      This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
      maximum number of file descriptors to copy from the old struct files. When the
      user requests that all file descriptors are supposed to be closed via
      close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
      to do any of the heavy fput() work for everything above min.
      
      The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
      fact currently share our file descriptor table we create a new private copy.
      We then close all fds in the requested range and finally after we're done we
      install the new fd table.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      60997c3d
    • C
      open: add close_range() · 278a5fba
      Christian Brauner 提交于
      This adds the close_range() syscall. It allows to efficiently close a range
      of file descriptors up to all file descriptors of a calling task.
      
      I was contacted by FreeBSD as they wanted to have the same close_range()
      syscall as we proposed here. We've coordinated this and in the meantime, Kyle
      was fast enough to merge close_range() into FreeBSD already in April:
      https://reviews.freebsd.org/D21627
      https://svnweb.freebsd.org/base?view=revision&revision=359836
      and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
      once its merged in Linux too. Python is in the process of switching to
      close_range() on FreeBSD and they are waiting on us to merge this to switch on
      Linux as well: https://bugs.python.org/issue38061
      
      The syscall came up in a recent discussion around the new mount API and
      making new file descriptor types cloexec by default. During this
      discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
      syscall in this manner has been requested by various people over time.
      
      First, it helps to close all file descriptors of an exec()ing task. This
      can be done safely via (quoting Al's example from [1] verbatim):
      
              /* that exec is sensitive */
              unshare(CLONE_FILES);
              /* we don't want anything past stderr here */
              close_range(3, ~0U);
              execve(....);
      
      The code snippet above is one way of working around the problem that file
      descriptors are not cloexec by default. This is aggravated by the fact that
      we can't just switch them over without massively regressing userspace. For
      a whole class of programs having an in-kernel method of closing all file
      descriptors is very helpful (e.g. demons, service managers, programming
      language standard libraries, container managers etc.).
      (Please note, unshare(CLONE_FILES) should only be needed if the calling
      task is multi-threaded and shares the file descriptor table with another
      thread in which case two threads could race with one thread allocating file
      descriptors and the other one closing them via close_range(). For the
      general case close_range() before the execve() is sufficient.)
      
      Second, it allows userspace to avoid implementing closing all file
      descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
      file descriptor. From looking at various large(ish) userspace code bases
      this or similar patterns are very common in:
      - service managers (cf. [4])
      - libcs (cf. [6])
      - container runtimes (cf. [5])
      - programming language runtimes/standard libraries
        - Python (cf. [2])
        - Rust (cf. [7], [8])
      As Dmitry pointed out there's even a long-standing glibc bug about missing
      kernel support for this task (cf. [3]).
      In addition, the syscall will also work for tasks that do not have procfs
      mounted and on kernels that do not have procfs support compiled in. In such
      situations the only way to make sure that all file descriptors are closed
      is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
      OPEN_MAX trickery (cf. comment [8] on Rust).
      
      The performance is striking. For good measure, comparing the following
      simple close_all_fds() userspace implementation that is essentially just
      glibc's version in [6]:
      
      static int close_all_fds(void)
      {
              int dir_fd;
              DIR *dir;
              struct dirent *direntp;
      
              dir = opendir("/proc/self/fd");
              if (!dir)
                      return -1;
              dir_fd = dirfd(dir);
              while ((direntp = readdir(dir))) {
                      int fd;
                      if (strcmp(direntp->d_name, ".") == 0)
                              continue;
                      if (strcmp(direntp->d_name, "..") == 0)
                              continue;
                      fd = atoi(direntp->d_name);
                      if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                              continue;
                      close(fd);
              }
              closedir(dir);
              return 0;
      }
      
      to close_range() yields:
      1. closing 4 open files:
         - close_all_fds(): ~280 us
         - close_range():    ~24 us
      
      2. closing 1000 open files:
         - close_all_fds(): ~5000 us
         - close_range():   ~800 us
      
      close_range() is designed to allow for some flexibility. Specifically, it
      does not simply always close all open file descriptors of a task. Instead,
      callers can specify an upper bound.
      This is e.g. useful for scenarios where specific file descriptors are
      created with well-known numbers that are supposed to be excluded from
      getting closed.
      For extra paranoia close_range() comes with a flags argument. This can e.g.
      be used to implement extension. Once can imagine userspace wanting to stop
      at the first error instead of ignoring errors under certain circumstances.
      There might be other valid ideas in the future. In any case, a flag
      argument doesn't hurt and keeps us on the safe side.
      
      From an implementation side this is kept rather dumb. It saw some input
      from David and Jann but all nonsense is obviously my own!
      - Errors to close file descriptors are currently ignored. (Could be changed
        by setting a flag in the future if needed.)
      - __close_range() is a rather simplistic wrapper around __close_fd().
        My reasoning behind this is based on the nature of how __close_fd() needs
        to release an fd. But maybe I misunderstood specifics:
        We take the files_lock and rcu-dereference the fdtable of the calling
        task, we find the entry in the fdtable, get the file and need to release
        files_lock before calling filp_close().
        In the meantime the fdtable might have been altered so we can't just
        retake the spinlock and keep the old rcu-reference of the fdtable
        around. Instead we need to grab a fresh reference to the fdtable.
        If my reasoning is correct then there's really no point in fancyfying
        __close_range(): We just need to rcu-dereference the fdtable of the
        calling task once to cap the max_fd value correctly and then go on
        calling __close_fd() in a loop.
      
      /* References */
      [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
      [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
      [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
      [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
      [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
      [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
           Note that this is an internal implementation that is not exported.
           Currently, libc seems to not provide an exported version of this
           because of missing kernel support to do this.
           Note, in a recent patch series Florian made grantpt() a nop thereby
           removing the code referenced here.
      [7]: https://github.com/rust-lang/rust/issues/12148
      [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
           Rust's solution is slightly different but is equally unperformant.
           Rust calls getdtablesize() which is a glibc library function that
           simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
           goes on to call close() on each fd. That's obviously overkill for most
           tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
           OPEN_MAX.
           Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
           to 1024. Even in this case, there's a very high chance that in the
           common case Rust is calling the close() syscall 1021 times pointlessly
           if the task just has 0, 1, and 2 open.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kyle Evans <self@kyle-evans.net>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      278a5fba
  20. 20 5月, 2020 1 次提交
  21. 20 3月, 2020 1 次提交
  22. 21 1月, 2020 1 次提交