1. 07 6月, 2021 1 次提交
    • J
      quota: Change quotactl_path() systcall to an fd-based one · 64c2c2c6
      Jan Kara 提交于
      Some users have pointed out that path-based syscalls are problematic in
      some environments and at least directory fd argument and possibly also
      resolve flags are desirable for such syscalls. Rather than
      reimplementing all details of pathname lookup and following where it may
      eventually evolve, let's go for full file descriptor based syscall
      similar to how ioctl(2) works since the beginning. Managing of quotas
      isn't performance sensitive so the extra overhead of open does not
      matter and we are able to consume O_PATH descriptors as well which makes
      open cheap anyway. Also for frequent operations (such as retrieving
      usage information for all users) we can reuse single fd and in fact get
      even better performance as well as avoiding races with possible remounts
      etc.
      Tested-by: NSascha Hauer <s.hauer@pengutronix.de>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      64c2c2c6
  2. 23 4月, 2021 1 次提交
    • M
      landlock: Add syscall implementations · 265885da
      Mickaël Salaün 提交于
      These 3 system calls are designed to be used by unprivileged processes
      to sandbox themselves:
      * landlock_create_ruleset(2): Creates a ruleset and returns its file
        descriptor.
      * landlock_add_rule(2): Adds a rule (e.g. file hierarchy access) to a
        ruleset, identified by the dedicated file descriptor.
      * landlock_restrict_self(2): Enforces a ruleset on the calling thread
        and its future children (similar to seccomp).  This syscall has the
        same usage restrictions as seccomp(2): the caller must have the
        no_new_privs attribute set or have CAP_SYS_ADMIN in the current user
        namespace.
      
      All these syscalls have a "flags" argument (not currently used) to
      enable extensibility.
      
      Here are the motivations for these new syscalls:
      * A sandboxed process may not have access to file systems, including
        /dev, /sys or /proc, but it should still be able to add more
        restrictions to itself.
      * Neither prctl(2) nor seccomp(2) (which was used in a previous version)
        fit well with the current definition of a Landlock security policy.
      
      All passed structs (attributes) are checked at build time to ensure that
      they don't contain holes and that they are aligned the same way for each
      architecture.
      
      See the user and kernel documentation for more details (provided by a
      following commit):
      * Documentation/userspace-api/landlock.rst
      * Documentation/security/landlock.rst
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NMickaël Salaün <mic@linux.microsoft.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Link: https://lore.kernel.org/r/20210422154123.13086-9-mic@digikod.netSigned-off-by: NJames Morris <jamorris@linux.microsoft.com>
      265885da
  3. 17 3月, 2021 1 次提交
  4. 28 1月, 2021 1 次提交
  5. 24 1月, 2021 1 次提交
    • C
      fs: add mount_setattr() · 2a186721
      Christian Brauner 提交于
      This implements the missing mount_setattr() syscall. While the new mount
      api allows to change the properties of a superblock there is currently
      no way to change the properties of a mount or a mount tree using file
      descriptors which the new mount api is based on. In addition the old
      mount api has the restriction that mount options cannot be applied
      recursively. This hasn't changed since changing mount options on a
      per-mount basis was implemented in [1] and has been a frequent request
      not just for convenience but also for security reasons. The legacy
      mount syscall is unable to accommodate this behavior without introducing
      a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
      MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
      mount. Changing MS_REC to apply to the whole mount tree would mean
      introducing a significant uapi change and would likely cause significant
      regressions.
      
      The new mount_setattr() syscall allows to recursively clear and set
      mount options in one shot. Multiple calls to change mount options
      requesting the same changes are idempotent:
      
      int mount_setattr(int dfd, const char *path, unsigned flags,
                        struct mount_attr *uattr, size_t usize);
      
      Flags to modify path resolution behavior are specified in the @flags
      argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
      and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
      restrict path resolution as introduced with openat2() might be supported
      in the future.
      
      The mount_setattr() syscall can be expected to grow over time and is
      designed with extensibility in mind. It follows the extensible syscall
      pattern we have used with other syscalls such as openat2(), clone3(),
      sched_{set,get}attr(), and others.
      The set of mount options is passed in the uapi struct mount_attr which
      currently has the following layout:
      
      struct mount_attr {
      	__u64 attr_set;
      	__u64 attr_clr;
      	__u64 propagation;
      	__u64 userns_fd;
      };
      
      The @attr_set and @attr_clr members are used to clear and set mount
      options. This way a user can e.g. request that a set of flags is to be
      raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
      @attr_set while at the same time requesting that another set of flags is
      to be lowered such as removing noexec from a mount tree by specifying
      MOUNT_ATTR_NOEXEC in @attr_clr.
      
      Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
      not a bitmap, users wanting to transition to a different atime setting
      cannot simply specify the atime setting in @attr_set, but must also
      specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
      MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
      can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
      @attr_clr.
      
      The @propagation field lets callers specify the propagation type of a
      mount tree. Propagation is a single property that has four different
      settings and as such is not really a flag argument but an enum.
      Specifically, it would be unclear what setting and clearing propagation
      settings in combination would amount to. The legacy mount() syscall thus
      forbids the combination of multiple propagation settings too. The goal
      is to keep the semantics of mount propagation somewhat simple as they
      are overly complex as it is.
      
      The @userns_fd field lets user specify a user namespace whose idmapping
      becomes the idmapping of the mount. This is implemented and explained in
      detail in the next patch.
      
      [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2a186721
  6. 28 12月, 2020 1 次提交
  7. 20 12月, 2020 1 次提交
  8. 11 12月, 2020 2 次提交
  9. 10 12月, 2020 1 次提交
    • H
      io_uring: add timeout support for io_uring_enter() · c73ebb68
      Hao Xu 提交于
      Now users who want to get woken when waiting for events should submit a
      timeout command first. It is not safe for applications that split SQ and
      CQ handling between two threads, such as mysql. Users should synchronize
      the two threads explicitly to protect SQ and that will impact the
      performance.
      
      This patch adds support for timeout to existing io_uring_enter(). To
      avoid overloading arguments, it introduces a new parameter structure
      which contains sigmask and timeout.
      
      I have tested the workloads with one thread submiting nop requests
      while the other reaping the cqe with timeout. It shows 1.8~2x faster
      when the iodepth is 16.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      [axboe: various cleanups/fixes, and name change to SIG_IS_DATA]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c73ebb68
  10. 13 11月, 2020 1 次提交
  11. 26 10月, 2020 1 次提交
  12. 19 10月, 2020 1 次提交
    • M
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim 提交于
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
  13. 25 9月, 2020 1 次提交
  14. 15 8月, 2020 1 次提交
    • X
      all arch: remove system call sys_sysctl · 88db0aa2
      Xiaoming Ni 提交于
      Since commit 61a47c1a ("sysctl: Remove the sysctl system call"),
      sys_sysctl is actually unavailable: any input can only return an error.
      
      We have been warning about people using the sysctl system call for years
      and believe there are no more users.  Even if there are users of this
      interface if they have not complained or fixed their code by now they
      probably are not going to, so there is no point in warning them any
      longer.
      
      So completely remove sys_sysctl on all architectures.
      
      [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
       Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.comSigned-off-by: NXiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Will Deacon <will@kernel.org>		[arm/arm64]
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bin Meng <bin.meng@windriver.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: chenzefeng <chenzefeng2@huawei.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Diego Elio Pettenò <flameeyes@flameeyes.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kars de Jong <jongk@linux-m68k.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Sven Schnelle <svens@stackframe.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com>
      Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88db0aa2
  15. 13 8月, 2020 1 次提交
  16. 31 7月, 2020 17 次提交
  17. 30 7月, 2020 1 次提交
  18. 20 7月, 2020 1 次提交
    • C
      net: remove compat_sys_{get,set}sockopt · 55db9c0e
      Christoph Hellwig 提交于
      Now that the ->compat_{get,set}sockopt proto_ops methods are gone
      there is no good reason left to keep the compat syscalls separate.
      
      This fixes the odd use of unsigned int for the compat_setsockopt
      optlen and the missing sock_use_custom_sol_socket.
      
      It would also easily allow running the eBPF hooks for the compat
      syscalls, but such a large change in behavior does not belong into
      a consolidation patch like this one.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55db9c0e
  19. 17 6月, 2020 1 次提交
    • C
      open: add close_range() · 278a5fba
      Christian Brauner 提交于
      This adds the close_range() syscall. It allows to efficiently close a range
      of file descriptors up to all file descriptors of a calling task.
      
      I was contacted by FreeBSD as they wanted to have the same close_range()
      syscall as we proposed here. We've coordinated this and in the meantime, Kyle
      was fast enough to merge close_range() into FreeBSD already in April:
      https://reviews.freebsd.org/D21627
      https://svnweb.freebsd.org/base?view=revision&revision=359836
      and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
      once its merged in Linux too. Python is in the process of switching to
      close_range() on FreeBSD and they are waiting on us to merge this to switch on
      Linux as well: https://bugs.python.org/issue38061
      
      The syscall came up in a recent discussion around the new mount API and
      making new file descriptor types cloexec by default. During this
      discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
      syscall in this manner has been requested by various people over time.
      
      First, it helps to close all file descriptors of an exec()ing task. This
      can be done safely via (quoting Al's example from [1] verbatim):
      
              /* that exec is sensitive */
              unshare(CLONE_FILES);
              /* we don't want anything past stderr here */
              close_range(3, ~0U);
              execve(....);
      
      The code snippet above is one way of working around the problem that file
      descriptors are not cloexec by default. This is aggravated by the fact that
      we can't just switch them over without massively regressing userspace. For
      a whole class of programs having an in-kernel method of closing all file
      descriptors is very helpful (e.g. demons, service managers, programming
      language standard libraries, container managers etc.).
      (Please note, unshare(CLONE_FILES) should only be needed if the calling
      task is multi-threaded and shares the file descriptor table with another
      thread in which case two threads could race with one thread allocating file
      descriptors and the other one closing them via close_range(). For the
      general case close_range() before the execve() is sufficient.)
      
      Second, it allows userspace to avoid implementing closing all file
      descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
      file descriptor. From looking at various large(ish) userspace code bases
      this or similar patterns are very common in:
      - service managers (cf. [4])
      - libcs (cf. [6])
      - container runtimes (cf. [5])
      - programming language runtimes/standard libraries
        - Python (cf. [2])
        - Rust (cf. [7], [8])
      As Dmitry pointed out there's even a long-standing glibc bug about missing
      kernel support for this task (cf. [3]).
      In addition, the syscall will also work for tasks that do not have procfs
      mounted and on kernels that do not have procfs support compiled in. In such
      situations the only way to make sure that all file descriptors are closed
      is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
      OPEN_MAX trickery (cf. comment [8] on Rust).
      
      The performance is striking. For good measure, comparing the following
      simple close_all_fds() userspace implementation that is essentially just
      glibc's version in [6]:
      
      static int close_all_fds(void)
      {
              int dir_fd;
              DIR *dir;
              struct dirent *direntp;
      
              dir = opendir("/proc/self/fd");
              if (!dir)
                      return -1;
              dir_fd = dirfd(dir);
              while ((direntp = readdir(dir))) {
                      int fd;
                      if (strcmp(direntp->d_name, ".") == 0)
                              continue;
                      if (strcmp(direntp->d_name, "..") == 0)
                              continue;
                      fd = atoi(direntp->d_name);
                      if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                              continue;
                      close(fd);
              }
              closedir(dir);
              return 0;
      }
      
      to close_range() yields:
      1. closing 4 open files:
         - close_all_fds(): ~280 us
         - close_range():    ~24 us
      
      2. closing 1000 open files:
         - close_all_fds(): ~5000 us
         - close_range():   ~800 us
      
      close_range() is designed to allow for some flexibility. Specifically, it
      does not simply always close all open file descriptors of a task. Instead,
      callers can specify an upper bound.
      This is e.g. useful for scenarios where specific file descriptors are
      created with well-known numbers that are supposed to be excluded from
      getting closed.
      For extra paranoia close_range() comes with a flags argument. This can e.g.
      be used to implement extension. Once can imagine userspace wanting to stop
      at the first error instead of ignoring errors under certain circumstances.
      There might be other valid ideas in the future. In any case, a flag
      argument doesn't hurt and keeps us on the safe side.
      
      From an implementation side this is kept rather dumb. It saw some input
      from David and Jann but all nonsense is obviously my own!
      - Errors to close file descriptors are currently ignored. (Could be changed
        by setting a flag in the future if needed.)
      - __close_range() is a rather simplistic wrapper around __close_fd().
        My reasoning behind this is based on the nature of how __close_fd() needs
        to release an fd. But maybe I misunderstood specifics:
        We take the files_lock and rcu-dereference the fdtable of the calling
        task, we find the entry in the fdtable, get the file and need to release
        files_lock before calling filp_close().
        In the meantime the fdtable might have been altered so we can't just
        retake the spinlock and keep the old rcu-reference of the fdtable
        around. Instead we need to grab a fresh reference to the fdtable.
        If my reasoning is correct then there's really no point in fancyfying
        __close_range(): We just need to rcu-dereference the fdtable of the
        calling task once to cap the max_fd value correctly and then go on
        calling __close_fd() in a loop.
      
      /* References */
      [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
      [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
      [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
      [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
      [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
      [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
           Note that this is an internal implementation that is not exported.
           Currently, libc seems to not provide an exported version of this
           because of missing kernel support to do this.
           Note, in a recent patch series Florian made grantpt() a nop thereby
           removing the code referenced here.
      [7]: https://github.com/rust-lang/rust/issues/12148
      [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
           Rust's solution is slightly different but is equally unperformant.
           Rust calls getdtablesize() which is a glibc library function that
           simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
           goes on to call close() on each fd. That's obviously overkill for most
           tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
           OPEN_MAX.
           Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
           to 1024. Even in this case, there's a very high chance that in the
           common case Rust is calling the close() syscall 1021 times pointlessly
           if the task just has 0, 1, and 2 open.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kyle Evans <self@kyle-evans.net>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      278a5fba
  20. 15 6月, 2020 1 次提交
  21. 14 5月, 2020 1 次提交
    • M
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi 提交于
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
  22. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  23. 14 1月, 2020 1 次提交