1. 23 5月, 2022 1 次提交
    • C
      NFSD: Instantiate a struct file when creating a regular NFSv4 file · fb70bf12
      Chuck Lever 提交于
      There have been reports of races that cause NFSv4 OPEN(CREATE) to
      return an error even though the requested file was created. NFSv4
      does not provide a status code for this case.
      
      To mitigate some of these problems, reorganize the NFSv4
      OPEN(CREATE) logic to allocate resources before the file is actually
      created, and open the new file while the parent directory is still
      locked.
      
      Two new APIs are added:
      
      + Add an API that works like nfsd_file_acquire() but does not open
      the underlying file. The OPEN(CREATE) path can use this API when it
      already has an open file.
      
      + Add an API that is kin to dentry_open(). NFSD needs to create a
      file and grab an open "struct file *" atomically. The
      alloc_empty_file() has to be done before the inode create. If it
      fails (for example, because the NFS server has exceeded its
      max_files limit), we avoid creating the file and can still return
      an error to the NFS client.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=382Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NJianHong Yin <jiyin@redhat.com>
      fb70bf12
  2. 10 5月, 2022 1 次提交
  3. 27 4月, 2022 1 次提交
  4. 09 3月, 2022 1 次提交
  5. 05 12月, 2021 1 次提交
    • C
      fs: support mapped mounts of mapped filesystems · bd303368
      Christian Brauner 提交于
      In previous patches we added new and modified existing helpers to handle
      idmapped mounts of filesystems mounted with an idmapping. In this final
      patch we convert all relevant places in the vfs to actually pass the
      filesystem's idmapping into these helpers.
      
      With this the vfs is in shape to handle idmapped mounts of filesystems
      mounted with an idmapping. Note that this is just the generic
      infrastructure. Actually adding support for idmapped mounts to a
      filesystem mountable with an idmapping is follow-up work.
      
      In this patch we extend the definition of an idmapped mount from a mount
      that that has the initial idmapping attached to it to a mount that has
      an idmapping attached to it which is not the same as the idmapping the
      filesystem was mounted with.
      
      As before we do not allow the initial idmapping to be attached to a
      mount. In addition this patch prevents that the idmapping the filesystem
      was mounted with can be attached to a mount created based on this
      filesystem.
      
      This has multiple reasons and advantages. First, attaching the initial
      idmapping or the filesystem's idmapping doesn't make much sense as in
      both cases the values of the i_{g,u}id and other places where k{g,u}ids
      are used do not change. Second, a user that really wants to do this for
      whatever reason can just create a separate dedicated identical idmapping
      to attach to the mount. Third, we can continue to use the initial
      idmapping as an indicator that a mount is not idmapped allowing us to
      continue to keep passing the initial idmapping into the mapping helpers
      to tell them that something isn't an idmapped mount even if the
      filesystem is mounted with an idmapping.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      bd303368
  6. 04 12月, 2021 2 次提交
  7. 07 11月, 2021 2 次提交
    • R
      mm, thp: fix incorrect unmap behavior for private pages · 8468e937
      Rongwei Wang 提交于
      When truncating pagecache on file THP, the private pages of a process
      should not be unmapped mapping.  This incorrect behavior on a dynamic
      shared libraries which will cause related processes to happen core dump.
      
      A simple test for a DSO (Prerequisite is the DSO mapped in file THP):
      
          int main(int argc, char *argv[])
          {
      	int fd;
      
      	fd = open(argv[1], O_WRONLY);
      	if (fd < 0) {
      		perror("open");
      	}
      
      	close(fd);
      	return 0;
          }
      
      The test only to open a target DSO, and do nothing.  But this operation
      will lead one or more process to happen core dump.  This patch mainly to
      fix this bug.
      
      Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
      Fixes: eb6ecbed ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
      Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
      Tested-by: NXu Yu <xuyu@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Song Liu <song@kernel.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Collin Fijalkovich <cfijalkovich@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8468e937
    • R
      mm, thp: lock filemap when truncating page cache · 55fc0d91
      Rongwei Wang 提交于
      Patch series "fix two bugs for file THP".
      
      This patch (of 2):
      
      Transparent huge page has supported read-only non-shmem files.  The
      file- backed THP is collapsed by khugepaged and truncated when written
      (for shared libraries).
      
      However, there is a race when multiple writers truncate the same page
      cache concurrently.
      
      In that case, subpage(s) of file THP can be revealed by find_get_entry
      in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
      truncate_inode_page, as follows:
      
          page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
          head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
          flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
          raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
          raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
          head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
          head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
          page dumped because: VM_BUG_ON_PAGE(PageTail(page))
          ------------[ cut here ]------------
          kernel BUG at mm/truncate.c:213!
          Internal error: Oops - BUG: 0 [#1] SMP
          Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
          CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
          Hardware name: ECS, BIOS 0.0.0 02/06/2015
          pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
          Call trace:
           truncate_inode_page+0x64/0x70
           truncate_inode_pages_range+0x550/0x7e4
           truncate_pagecache+0x58/0x80
           do_dentry_open+0x1e4/0x3c0
           vfs_open+0x38/0x44
           do_open+0x1f0/0x310
           path_openat+0x114/0x1dc
           do_filp_open+0x84/0x134
           do_sys_openat2+0xbc/0x164
           __arm64_sys_openat+0x74/0xc0
           el0_svc_common.constprop.0+0x88/0x220
           do_el0_svc+0x30/0xa0
           el0_svc+0x20/0x30
           el0_sync_handler+0x1a4/0x1b0
           el0_sync+0x180/0x1c0
          Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
      
      This patch mainly to lock filemap when one enter truncate_pagecache(),
      avoiding truncating the same page cache concurrently.
      
      Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com
      Fixes: eb6ecbed ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
      Suggested-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: NSong Liu <song@kernel.org>
      Cc: Collin Fijalkovich <cfijalkovich@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55fc0d91
  8. 05 10月, 2021 1 次提交
    • R
      audit: add OPENAT2 record to list "how" info · 571e5c0e
      Richard Guy Briggs 提交于
      Since the openat2(2) syscall uses a struct open_how pointer to communicate
      its parameters they are not usefully recorded by the audit SYSCALL record's
      four existing arguments.
      
      Add a new audit record type OPENAT2 that reports the parameters in its
      third argument, struct open_how with fields oflag, mode and resolve.
      
      The new record in the context of an event would look like:
      time->Wed Mar 17 16:28:53 2021
      type=PROCTITLE msg=audit(1616012933.531:184): proctitle=
        73797363616C6C735F66696C652F6F70656E617432002F746D702F61756469742D
        7465737473756974652D737641440066696C652D6F70656E617432
      type=PATH msg=audit(1616012933.531:184): item=1 name="file-openat2"
        inode=29 dev=00:1f mode=0100600 ouid=0 ogid=0 rdev=00:00
        obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE
        cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
      type=PATH msg=audit(1616012933.531:184):
        item=0 name="/root/rgb/git/audit-testsuite/tests"
        inode=25 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00
        obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT
        cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
      type=CWD msg=audit(1616012933.531:184):
        cwd="/root/rgb/git/audit-testsuite/tests"
      type=OPENAT2 msg=audit(1616012933.531:184):
        oflag=0100302 mode=0600 resolve=0xa
      type=SYSCALL msg=audit(1616012933.531:184): arch=c000003e syscall=437
        success=yes exit=4 a0=3 a1=7ffe315f1c53 a2=7ffe315f1550 a3=18
        items=2 ppid=528 pid=540 auid=0 uid=0 gid=0 euid=0 suid=0
        fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="openat2"
        exe="/root/rgb/git/audit-testsuite/tests/syscalls_file/openat2"
        subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
        key="testsuite-1616012933-bjAUcEPO"
      
      Link: https://lore.kernel.org/r/d23fbb89186754487850367224b060e26f9b7181.1621363275.git.rgb@redhat.comSigned-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      [PM: tweak subject, wrap example, move AUDIT_OPENAT2 to 1337]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      571e5c0e
  9. 23 8月, 2021 1 次提交
    • J
      fs: remove mandatory file locking support · f7e33bdb
      Jeff Layton 提交于
      We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
      off in Fedora and RHEL8. Several other distros have followed suit.
      
      I've heard of one problem in all that time: Someone migrated from an
      older distro that supported "-o mand" to one that didn't, and the host
      had a fstab entry with "mand" in it which broke on reboot. They didn't
      actually _use_ mandatory locking so they just removed the mount option
      and moved on.
      
      This patch rips out mandatory locking support wholesale from the kernel,
      along with the Kconfig option and the Documentation file. It also
      changes the mount code to ignore the "mand" mount option instead of
      erroring out, and to throw a big, ugly warning.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      f7e33bdb
  10. 01 7月, 2021 1 次提交
    • C
      mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs · eb6ecbed
      Collin Fijalkovich 提交于
      Transparent huge pages are supported for read-only non-shmem files, but
      are only used for vmas with VM_DENYWRITE.  This condition ensures that
      file THPs are protected from writes while an application is running
      (ETXTBSY).  Any existing file THPs are then dropped from the page cache
      when a file is opened for write in do_dentry_open().  Since sys_mmap
      ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
      produced by execve().
      
      Systems that make heavy use of shared libraries (e.g.  Android) are unable
      to apply VM_DENYWRITE through the dynamic linker, preventing them from
      benefiting from the resultant reduced contention on the TLB.
      
      This patch reduces the constraint on file THPs allowing use with any
      executable mapping from a file not opened for write (see
      inode_is_open_for_write()).  It also introduces additional conditions to
      ensure that files opened for write will never be backed by file THPs.
      
      Restricting the use of THPs to executable mappings eliminates the risk
      that a read-only file later opened for write would encounter significant
      latencies due to page cache truncation.
      
      The ld linker flag '-z max-page-size=(hugepage size)' can be used to
      produce executables with the necessary layout.  The dynamic linker must
      map these file's segments at a hugepage size aligned vma for the mapping
      to be backed with THPs.
      
      Comparison of the performance characteristics of 4KB and 2MB-backed
      libraries follows; the Android dex2oat tool was used to AOT compile an
      example application on a single ARM core.
      
      4KB Pages:
      ==========
      
      count              event_name            # count / runtime
      598,995,035,942    cpu-cycles            # 1.800861 GHz
       81,195,620,851    raw-stall-frontend    # 244.112 M/sec
      347,754,466,597    iTLB-loads            # 1.046 G/sec
        2,970,248,900    iTLB-load-misses      # 0.854122% miss rate
      
      Total test time: 332.854998 seconds.
      
      2MB Pages:
      ==========
      
      count              event_name            # count / runtime
      592,872,663,047    cpu-cycles            # 1.800358 GHz
       76,485,624,143    raw-stall-frontend    # 232.261 M/sec
      350,478,413,710    iTLB-loads            # 1.064 G/sec
          803,233,322    iTLB-load-misses      # 0.229182% miss rate
      
      Total test time: 329.826087 seconds
      
      A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
      
      /apex/com.android.art/lib64/libart.so
      FilePmdMapped:      4096 kB
      
      /apex/com.android.art/lib64/libart-compiler.so
      FilePmdMapped:      2048 kB
      
      Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.comSigned-off-by: NCollin Fijalkovich <cfijalkovich@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: NSong Liu <song@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Hridya Valsaraju <hridya@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb6ecbed
  11. 28 5月, 2021 1 次提交
    • C
      open: don't silently ignore unknown O-flags in openat2() · cfe80306
      Christian Brauner 提交于
      The new openat2() syscall verifies that no unknown O-flag values are
      set and returns an error to userspace if they are while the older open
      syscalls like open() and openat() simply ignore unknown flag values:
      
        #define O_FLAG_CURRENTLY_INVALID (1 << 31)
        struct open_how how = {
                .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
                .resolve = 0,
        };
      
        /* fails */
        fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));
      
        /* succeeds */
        fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);
      
      However, openat2() silently truncates the upper 32 bits meaning:
      
        #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
        #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)
      
        struct open_how how_lowe32 = {
                .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
        };
      
        struct open_how how_upper32 = {
                .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
        };
      
        /* fails */
        fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));
      
        /* succeeds */
        fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));
      
      Fix this by preventing the immediate truncation in build_open_flags().
      
      There's a snafu here though stripping FMODE_* directly from flags would
      cause the upper 32 bits to be truncated as well due to integer promotion
      rules since FMODE_* is unsigned int, O_* are signed ints (yuck).
      
      In addition, struct open_flags currently defines flags to be 32 bit
      which is reasonable. If we simply were to bump it to 64 bit we would
      need to change a lot of code preemptively which doesn't seem worth it.
      So simply add a compile-time check verifying that all currently known
      O_* flags are within the 32 bit range and fail to build if they aren't
      anymore.
      
      This change shouldn't regress old open syscalls since they silently
      truncate any unknown values anyway. It is a tiny semantic change for
      openat2() but it is very unlikely people pass ing > 32 bit unknown flags
      and the syscall is relatively new too.
      
      Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reported-by: NRichard Guy Briggs <rgb@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      cfe80306
  12. 08 4月, 2021 1 次提交
  13. 24 1月, 2021 5 次提交
  14. 05 1月, 2021 1 次提交
  15. 11 12月, 2020 1 次提交
  16. 03 12月, 2020 1 次提交
  17. 13 8月, 2020 1 次提交
    • K
      exec: move S_ISREG() check earlier · 633fb6ac
      Kees Cook 提交于
      The execve(2)/uselib(2) syscalls have always rejected non-regular files.
      Recently, it was noticed that a deadlock was introduced when trying to
      execute pipes, as the S_ISREG() test was happening too late.  This was
      fixed in commit 73601ea5 ("fs/open.c: allow opening only regular files
      during execve()"), but it was added after inode_permission() had already
      run, which meant LSMs could see bogus attempts to execute non-regular
      files.
      
      Move the test into the other inode type checks (which already look for
      other pathological conditions[1]).  Since there is no need to use
      FMODE_EXEC while we still have access to "acc_mode", also switch the test
      to MAY_EXEC.
      
      Also include a comment with the redundant S_ISREG() checks at the end of
      execve(2)/uselib(2) to note that they are present to avoid any mistakes.
      
      My notes on the call path, and related arguments, checks, etc:
      
      do_open_execat()
          struct open_flags open_exec_flags = {
              .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
              .acc_mode = MAY_EXEC,
              ...
          do_filp_open(dfd, filename, open_flags)
              path_openat(nameidata, open_flags, flags)
                  file = alloc_empty_file(open_flags, current_cred());
                  do_open(nameidata, file, open_flags)
                      may_open(path, acc_mode, open_flag)
      		    /* new location of MAY_EXEC vs S_ISREG() test */
                          inode_permission(inode, MAY_OPEN | acc_mode)
                              security_inode_permission(inode, acc_mode)
                      vfs_open(path, file)
                          do_dentry_open(file, path->dentry->d_inode, open)
                              /* old location of FMODE_EXEC vs S_ISREG() test */
                              security_file_open(f)
                              open()
      
      [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      633fb6ac
  18. 31 7月, 2020 7 次提交
  19. 16 7月, 2020 2 次提交
  20. 17 6月, 2020 2 次提交
    • C
      close_range: add CLOSE_RANGE_UNSHARE · 60997c3d
      Christian Brauner 提交于
      One of the use-cases of close_range() is to drop file descriptors just before
      execve(). This would usually be expressed in the sequence:
      
      unshare(CLONE_FILES);
      close_range(3, ~0U);
      
      as pointed out by Linus it might be desirable to have this be a part of
      close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
      
      This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
      maximum number of file descriptors to copy from the old struct files. When the
      user requests that all file descriptors are supposed to be closed via
      close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
      to do any of the heavy fput() work for everything above min.
      
      The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
      fact currently share our file descriptor table we create a new private copy.
      We then close all fds in the requested range and finally after we're done we
      install the new fd table.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      60997c3d
    • C
      open: add close_range() · 278a5fba
      Christian Brauner 提交于
      This adds the close_range() syscall. It allows to efficiently close a range
      of file descriptors up to all file descriptors of a calling task.
      
      I was contacted by FreeBSD as they wanted to have the same close_range()
      syscall as we proposed here. We've coordinated this and in the meantime, Kyle
      was fast enough to merge close_range() into FreeBSD already in April:
      https://reviews.freebsd.org/D21627
      https://svnweb.freebsd.org/base?view=revision&revision=359836
      and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
      once its merged in Linux too. Python is in the process of switching to
      close_range() on FreeBSD and they are waiting on us to merge this to switch on
      Linux as well: https://bugs.python.org/issue38061
      
      The syscall came up in a recent discussion around the new mount API and
      making new file descriptor types cloexec by default. During this
      discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
      syscall in this manner has been requested by various people over time.
      
      First, it helps to close all file descriptors of an exec()ing task. This
      can be done safely via (quoting Al's example from [1] verbatim):
      
              /* that exec is sensitive */
              unshare(CLONE_FILES);
              /* we don't want anything past stderr here */
              close_range(3, ~0U);
              execve(....);
      
      The code snippet above is one way of working around the problem that file
      descriptors are not cloexec by default. This is aggravated by the fact that
      we can't just switch them over without massively regressing userspace. For
      a whole class of programs having an in-kernel method of closing all file
      descriptors is very helpful (e.g. demons, service managers, programming
      language standard libraries, container managers etc.).
      (Please note, unshare(CLONE_FILES) should only be needed if the calling
      task is multi-threaded and shares the file descriptor table with another
      thread in which case two threads could race with one thread allocating file
      descriptors and the other one closing them via close_range(). For the
      general case close_range() before the execve() is sufficient.)
      
      Second, it allows userspace to avoid implementing closing all file
      descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
      file descriptor. From looking at various large(ish) userspace code bases
      this or similar patterns are very common in:
      - service managers (cf. [4])
      - libcs (cf. [6])
      - container runtimes (cf. [5])
      - programming language runtimes/standard libraries
        - Python (cf. [2])
        - Rust (cf. [7], [8])
      As Dmitry pointed out there's even a long-standing glibc bug about missing
      kernel support for this task (cf. [3]).
      In addition, the syscall will also work for tasks that do not have procfs
      mounted and on kernels that do not have procfs support compiled in. In such
      situations the only way to make sure that all file descriptors are closed
      is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
      OPEN_MAX trickery (cf. comment [8] on Rust).
      
      The performance is striking. For good measure, comparing the following
      simple close_all_fds() userspace implementation that is essentially just
      glibc's version in [6]:
      
      static int close_all_fds(void)
      {
              int dir_fd;
              DIR *dir;
              struct dirent *direntp;
      
              dir = opendir("/proc/self/fd");
              if (!dir)
                      return -1;
              dir_fd = dirfd(dir);
              while ((direntp = readdir(dir))) {
                      int fd;
                      if (strcmp(direntp->d_name, ".") == 0)
                              continue;
                      if (strcmp(direntp->d_name, "..") == 0)
                              continue;
                      fd = atoi(direntp->d_name);
                      if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                              continue;
                      close(fd);
              }
              closedir(dir);
              return 0;
      }
      
      to close_range() yields:
      1. closing 4 open files:
         - close_all_fds(): ~280 us
         - close_range():    ~24 us
      
      2. closing 1000 open files:
         - close_all_fds(): ~5000 us
         - close_range():   ~800 us
      
      close_range() is designed to allow for some flexibility. Specifically, it
      does not simply always close all open file descriptors of a task. Instead,
      callers can specify an upper bound.
      This is e.g. useful for scenarios where specific file descriptors are
      created with well-known numbers that are supposed to be excluded from
      getting closed.
      For extra paranoia close_range() comes with a flags argument. This can e.g.
      be used to implement extension. Once can imagine userspace wanting to stop
      at the first error instead of ignoring errors under certain circumstances.
      There might be other valid ideas in the future. In any case, a flag
      argument doesn't hurt and keeps us on the safe side.
      
      From an implementation side this is kept rather dumb. It saw some input
      from David and Jann but all nonsense is obviously my own!
      - Errors to close file descriptors are currently ignored. (Could be changed
        by setting a flag in the future if needed.)
      - __close_range() is a rather simplistic wrapper around __close_fd().
        My reasoning behind this is based on the nature of how __close_fd() needs
        to release an fd. But maybe I misunderstood specifics:
        We take the files_lock and rcu-dereference the fdtable of the calling
        task, we find the entry in the fdtable, get the file and need to release
        files_lock before calling filp_close().
        In the meantime the fdtable might have been altered so we can't just
        retake the spinlock and keep the old rcu-reference of the fdtable
        around. Instead we need to grab a fresh reference to the fdtable.
        If my reasoning is correct then there's really no point in fancyfying
        __close_range(): We just need to rcu-dereference the fdtable of the
        calling task once to cap the max_fd value correctly and then go on
        calling __close_fd() in a loop.
      
      /* References */
      [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
      [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
      [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
      [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
      [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
      [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
           Note that this is an internal implementation that is not exported.
           Currently, libc seems to not provide an exported version of this
           because of missing kernel support to do this.
           Note, in a recent patch series Florian made grantpt() a nop thereby
           removing the code referenced here.
      [7]: https://github.com/rust-lang/rust/issues/12148
      [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
           Rust's solution is slightly different but is equally unperformant.
           Rust calls getdtablesize() which is a glibc library function that
           simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
           goes on to call close() on each fd. That's obviously overkill for most
           tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
           OPEN_MAX.
           Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
           to 1024. Even in this case, there's a very high chance that in the
           common case Rust is calling the close() syscall 1021 times pointlessly
           if the task just has 0, 1, and 2 open.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kyle Evans <self@kyle-evans.net>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      278a5fba
  21. 03 6月, 2020 1 次提交
    • J
      vfs: track per-sb writeback errors and report them to syncfs · 735e4ae5
      Jeff Layton 提交于
      Patch series "vfs: have syncfs() return error when there are writeback
      errors", v6.
      
      Currently, syncfs does not return errors when one of the inodes fails to
      be written back.  It will return errors based on the legacy AS_EIO and
      AS_ENOSPC flags when syncing out the block device fails, but that's not
      particularly helpful for filesystems that aren't backed by a blockdev.
      It's also possible for a stray sync to lose those errors.
      
      The basic idea in this set is to track writeback errors at the
      superblock level, so that we can quickly and easily check whether
      something bad happened without having to fsync each file individually.
      syncfs is then changed to reliably report writeback errors after they
      occur, much in the same fashion as fsync does now.
      
      This patch (of 2):
      
      Usually we suggest that applications call fsync when they want to ensure
      that all data written to the file has made it to the backing store, but
      that can be inefficient when there are a lot of open files.
      
      Calling syncfs on the filesystem can be more efficient in some
      situations, but the error reporting doesn't currently work the way most
      people expect.  If a single inode on a filesystem reports a writeback
      error, syncfs won't necessarily return an error.  syncfs only returns an
      error if __sync_blockdev fails, and on some filesystems that's a no-op.
      
      It would be better if syncfs reported an error if there were any
      writeback failures.  Then applications could call syncfs to see if there
      are any errors on any open files, and could then call fsync on all of
      the other descriptors to figure out which one failed.
      
      This patch adds a new errseq_t to struct super_block, and has
      mapping_set_error also record writeback errors there.
      
      To report those errors, we also need to keep an errseq_t in struct file
      to act as a cursor.  This patch adds a dedicated field for that purpose,
      which slots nicely into 4 bytes of padding at the end of struct file on
      x86_64.
      
      An earlier version of this patch used an O_PATH file descriptor to cue
      the kernel that the open file should track the superblock error and not
      the inode's writeback error.
      
      I think that API is just too weird though.  This is simpler and should
      make syncfs error reporting "just work" even if someone is multiplexing
      fsync and syncfs on the same fds.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andres Freund <andres@anarazel.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
      Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      735e4ae5
  22. 14 5月, 2020 2 次提交
    • M
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi 提交于
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
    • M
      vfs: split out access_override_creds() · 94704515
      Miklos Szeredi 提交于
      Split out a helper that overrides the credentials in preparation for
      actually doing the access check.
      
      This prepares for the next patch that optionally disables the creds
      override.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      94704515
  23. 13 3月, 2020 1 次提交
    • A
      cifs_atomic_open(): fix double-put on late allocation failure · d9a9f484
      Al Viro 提交于
      several iterations of ->atomic_open() calling conventions ago, we
      used to need fput() if ->atomic_open() failed at some point after
      successful finish_open().  Now (since 2016) it's not needed -
      struct file carries enough state to make fput() work regardless
      of the point in struct file lifecycle and discarding it on
      failure exits in open() got unified.  Unfortunately, I'd missed
      the fact that we had an instance of ->atomic_open() (cifs one)
      that used to need that fput(), as well as the stale comment in
      finish_open() demanding such late failure handling.  Trivially
      fixed...
      
      Fixes: fe9ec829 "do_last(): take fput() on error after opening to out:"
      Cc: stable@kernel.org # v4.7+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9a9f484
  24. 28 2月, 2020 1 次提交
    • A
      make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW · 31d1726d
      Al Viro 提交于
      O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
      As it is, we might or might not have LOOKUP_FOLLOW in op->intent
      in that case - that depends upon having O_NOFOLLOW in open flags.
      It doesn't matter, since we won't be checking it in that case -
      do_last() bails out earlier.
      
      However, making sure it's not set (i.e. acting as if we had an explicit
      O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
      check for O_CREAT | O_EXCL in do_last() with the call of step_into()
      immediately following it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d1726d
  25. 21 1月, 2020 1 次提交