1. 31 7月, 2020 4 次提交
  2. 30 7月, 2020 1 次提交
  3. 15 6月, 2020 1 次提交
  4. 14 5月, 2020 1 次提交
    • M
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi 提交于
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
  5. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  6. 14 1月, 2020 1 次提交
  7. 03 1月, 2020 1 次提交
    • D
      Revert "fs: remove ksys_dup()" · 74f1a299
      Dominik Brodowski 提交于
      This reverts commit 8243186f ("fs: remove ksys_dup()") and the
      subsequent fix for it in commit 2d3145f8 ("early init: fix error
      handling when opening /dev/console").
      
      Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
      caused more trouble than what is worth it: it requires accessing vfs
      internals and it turns out there were other bugs in it too.
      
      In particular, the file reference counting was wrong - because unlike
      the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
      thus had an extra leaked file reference.
      
      That in turn then caused odd problems with Androidx86 long after boot
      becaue of how the extra reference to the console kept the session active
      even after all file descriptors had been closed.
      Reported-by: Nyouling 257 <youling257@gmail.com>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74f1a299
  8. 19 12月, 2019 1 次提交
  9. 13 12月, 2019 1 次提交
  10. 12 12月, 2019 1 次提交
    • D
      init: use do_mount() instead of ksys_mount() · cccaa5e3
      Dominik Brodowski 提交于
      In prepare_namespace(), do_mount() can be used instead of ksys_mount()
      as the first and third argument are const strings in the kernel, the
      second and fourth argument are passed through anyway, and the fifth
      argument is NULL.
      
      In do_mount_root(), ksys_mount() is called with the first and third
      argument being already kernelspace strings, which do not need to be
      copied over from userspace to kernelspace (again). The second and
      fourth arguments are passed through to do_mount() anyway. The fifth
      argument, while already residing in kernelspace, needs to be put into
      a page of its own. Then, do_mount() can be used instead of
      ksys_mount().
      
      Once this is done, there are no in-kernel users to ksys_mount() left,
      which can therefore be removed.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      cccaa5e3
  11. 15 11月, 2019 3 次提交
  12. 08 9月, 2019 1 次提交
    • A
      ipc: fix sparc64 ipc() wrapper · fb377eb8
      Arnd Bergmann 提交于
      Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
      to a commit from my y2038 series in linux-5.1, as I missed the custom
      sys_ipc() wrapper that sparc64 uses in place of the generic version that
      I patched.
      
      The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
      now do not allow being called with the IPC_64 flag any more, resulting
      in a -EINVAL error when they don't recognize the command.
      
      Instead, the correct way to do this now is to call the internal
      ksys_old_{sem,shm,msg}ctl() functions to select the API version.
      
      As we generally move towards these functions anyway, change all of
      sparc_ipc() to consistently use those in place of the sys_*() versions,
      and move the required ksys_*() declarations into linux/syscalls.h
      
      The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
      errors when ipc is disabled.
      Reported-by: NMatt Turner <mattst88@gmail.com>
      Fixes: 275f2214 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
      Cc: stable@vger.kernel.org
      Tested-by: NMatt Turner <mattst88@gmail.com>
      Tested-by: NAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      fb377eb8
  13. 05 7月, 2019 1 次提交
  14. 28 6月, 2019 1 次提交
    • C
      pid: add pidfd_open() · 32fcb426
      Christian Brauner 提交于
      This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
      pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
      process that is created via traditional fork()/clone() calls that is only
      referenced by a PID:
      
      int pidfd = pidfd_open(1234, 0);
      ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
      
      With the introduction of pidfds through CLONE_PIDFD it is possible to
      created pidfds at process creation time.
      However, a lot of processes get created with traditional PID-based calls
      such as fork() or clone() (without CLONE_PIDFD). For these processes a
      caller can currently not create a pollable pidfd. This is a problem for
      Android's low memory killer (LMK) and service managers such as systemd.
      Both are examples of tools that want to make use of pidfds to get reliable
      notification of process exit for non-parents (pidfd polling) and race-free
      signal sending (pidfd_send_signal()). They intend to switch to this API for
      process supervision/management as soon as possible. Having no way to get
      pollable pidfds from PID-only processes is one of the biggest blockers for
      them in adopting this api. With pidfd_open() making it possible to retrieve
      pidfds for PID-based processes we enable them to adopt this api.
      
      In line with Arnd's recent changes to consolidate syscall numbers across
      architectures, I have added the pidfd_open() syscall to all architectures
      at the same time.
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      32fcb426
  15. 09 6月, 2019 1 次提交
    • C
      fork: add clone3 · 7f192e3c
      Christian Brauner 提交于
      This adds the clone3 system call.
      
      As mentioned several times already (cf. [7], [8]) here's the promised
      patchset for clone3().
      
      We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
      free flag from clone().
      
      Independent of the CLONE_PIDFD patchset a time namespace has been discussed
      at Linux Plumber Conference last year and has been sent out and reviewed
      (cf. [5]). It is expected that it will go upstream in the not too distant
      future. However, it relies on the addition of the CLONE_NEWTIME flag to
      clone(). The only other good candidate - CLONE_DETACHED - is currently not
      recyclable as we have identified at least two large or widely used
      codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
      CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
      blocked. clone3() has the advantage that it will unblock this patchset
      again. In general, clone3() is extensible and allows for the implementation
      of new features.
      
      The idea is to keep clone3() very simple and close to the original clone(),
      specifically, to keep on supporting old clone()-based workloads.
      We know there have been various creative proposals how a new process
      creation syscall or even api is supposed to look like. Some people even
      going so far as to argue that the traditional fork()+exec() split should be
      abandoned in favor of an in-kernel version of spawn(). Independent of
      whether or not we personally think spawn() is a good idea this patchset has
      and does not want to have anything to do with this.
      One stance we take is that there's no real good alternative to
      clone()+exec() and we need and want to support this model going forward;
      independent of spawn().
      The following requirements guided clone3():
      - bump the number of available flags
      - move arguments that are currently passed as separate arguments
        in clone() into a dedicated struct clone_args
        - choose a struct layout that is easy to handle on 32 and on 64 bit
        - choose a struct layout that is extensible
        - give new flags that currently need to abuse another flag's dedicated
          return argument in clone() their own dedicated return argument
          (e.g. CLONE_PIDFD)
        - use a separate kernel internal struct kernel_clone_args that is
          properly typed according to current kernel conventions in fork.c and is
          different from  the uapi struct clone_args
      - port _do_fork() to use kernel_clone_args so that all process creation
        syscalls such as fork(), vfork(), clone(), and clone3() behave identical
        (Arnd suggested, that we can probably also port do_fork() itself in a
         separate patchset.)
      - ease of transition for userspace from clone() to clone3()
        This very much means that we do *not* remove functionality that userspace
        currently relies on as the latter is a good way of creating a syscall
        that won't be adopted.
      - do not try to be clever or complex: keep clone3() as dumb as possible
      
      In accordance with Linus suggestions (cf. [11]), clone3() has the following
      signature:
      
      /* uapi */
      struct clone_args {
              __aligned_u64 flags;
              __aligned_u64 pidfd;
              __aligned_u64 child_tid;
              __aligned_u64 parent_tid;
              __aligned_u64 exit_signal;
              __aligned_u64 stack;
              __aligned_u64 stack_size;
              __aligned_u64 tls;
      };
      
      /* kernel internal */
      struct kernel_clone_args {
              u64 flags;
              int __user *pidfd;
              int __user *child_tid;
              int __user *parent_tid;
              int exit_signal;
              unsigned long stack;
              unsigned long stack_size;
              unsigned long tls;
      };
      
      long sys_clone3(struct clone_args __user *uargs, size_t size)
      
      clone3() cleanly supports all of the supported flags from clone() and thus
      all legacy workloads.
      The advantage of sticking close to the old clone() is the low cost for
      userspace to switch to this new api. Quite a lot of userspace apis (e.g.
      pthreads) are based on the clone() syscall. With the new clone3() syscall
      supporting all of the old workloads and opening up the ability to add new
      features should make switching to it for userspace more appealing. In
      essence, glibc can just write a simple wrapper to switch from clone() to
      clone3().
      
      There has been some interest in this patchset already. We have received a
      patch from the CRIU corner for clone3() that would set the PID/TID of a
      restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
      
      /* User visible differences to legacy clone() */
      - CLONE_DETACHED will cause EINVAL with clone3()
      - CSIGNAL is deprecated
        It is superseeded by a dedicated "exit_signal" argument in struct
        clone_args freeing up space for additional flags.
        This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
      
      /* References */
      [1]: b3e58382
      [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
      [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
      [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
      [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
      [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
      [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
      [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
      [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
      [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
      [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      7f192e3c
  16. 05 6月, 2019 1 次提交
  17. 27 5月, 2019 1 次提交
  18. 21 3月, 2019 6 次提交
    • D
      vfs: syscall: Add fspick() to select a superblock for reconfiguration · cf3cba4a
      David Howells 提交于
      Provide an fspick() system call that can be used to pick an existing
      mountpoint into an fs_context which can thereafter be used to reconfigure a
      superblock (equivalent of the superblock side of -o remount).
      
      This looks like:
      
      	int fd = fspick(AT_FDCWD, "/mnt",
      			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
      	fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
      	fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
      	fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
      
      At the point of fspick being called, the file descriptor referring to the
      filesystem context is in exactly the same state as the one that was created
      by fsopen() after fsmount() has been successfully called.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cf3cba4a
    • D
      vfs: syscall: Add fsmount() to create a mount for a superblock · 93766fbd
      David Howells 提交于
      Provide a system call by which a filesystem opened with fsopen() and
      configured by a series of fsconfig() calls can have a detached mount object
      created for it.  This mount object can then be attached to the VFS mount
      hierarchy using move_mount() by passing the returned file descriptor as the
      from directory fd.
      
      The system call looks like:
      
      	int mfd = fsmount(int fsfd, unsigned int flags,
      			  unsigned int attr_flags);
      
      where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
      FSMOUNT_CLOEXEC.  attr_flags is a bitwise-OR of the following flags:
      
      	MOUNT_ATTR_RDONLY	Mount read-only
      	MOUNT_ATTR_NOSUID	Ignore suid and sgid bits
      	MOUNT_ATTR_NODEV	Disallow access to device special files
      	MOUNT_ATTR_NOEXEC	Disallow program execution
      	MOUNT_ATTR__ATIME	Setting on how atime should be updated
      	MOUNT_ATTR_RELATIME	- Update atime relative to mtime/ctime
      	MOUNT_ATTR_NOATIME	- Do not update access times
      	MOUNT_ATTR_STRICTATIME	- Always perform atime updates
      	MOUNT_ATTR_NODIRATIME	Do not update directory access times
      
      In the event that fsmount() fails, it may be possible to get an error
      message by calling read() on fsfd.  If no message is available, ENODATA
      will be reported.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      93766fbd
    • D
      vfs: syscall: Add fsconfig() for configuring and managing a context · ecdab150
      David Howells 提交于
      Add a syscall for configuring a filesystem creation context and triggering
      actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
      
          long fsconfig(int fs_fd, unsigned int cmd, const char *key,
      		  const void *value, int aux);
      
      Where fs_fd indicates the context, cmd indicates the action to take, key
      indicates the parameter name for parameter-setting actions and, if needed,
      value points to a buffer containing the value and aux can give more
      information for the value.
      
      The following command IDs are proposed:
      
       (*) FSCONFIG_SET_FLAG: No value is specified.  The parameter must be
           boolean in nature.  The key may be prefixed with "no" to invert the
           setting. value must be NULL and aux must be 0.
      
       (*) FSCONFIG_SET_STRING: A string value is specified.  The parameter can
           be expecting boolean, integer, string or take a path.  A conversion to
           an appropriate type will be attempted (which may include looking up as
           a path).  value points to a NUL-terminated string and aux must be 0.
      
       (*) FSCONFIG_SET_BINARY: A binary blob is specified.  value points to
           the blob and aux indicates its size.  The parameter must be expecting
           a blob.
      
       (*) FSCONFIG_SET_PATH: A non-empty path is specified.  The parameter must
           be expecting a path object.  value points to a NUL-terminated string
           that is the path and aux is a file descriptor at which to start a
           relative lookup or AT_FDCWD.
      
       (*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
           implied.
      
       (*) FSCONFIG_SET_FD: An open file descriptor is specified.  value must
           be NULL and aux indicates the file descriptor.
      
       (*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
      
       (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
      
      For the "set" command IDs, the idea is that the file_system_type will point
      to a list of parameters and the types of value that those parameters expect
      to take.  The core code can then do the parse and argument conversion and
      then give the LSM and FS a cooked option or array of options to use.
      
      Source specification is also done the same way same way, using special keys
      "source", "source1", "source2", etc..
      
      [!] Note that, for the moment, the key and value are just glued back
      together and handed to the filesystem.  Every filesystem that uses options
      uses match_token() and co. to do this, and this will need to be changed -
      but not all at once.
      
      Example usage:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
          fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
          fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
          fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
          fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
          fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
          fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
          fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
          fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("afs", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("jffs2", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecdab150
    • D
      vfs: syscall: Add fsopen() to prepare for superblock creation · 24dcb3d9
      David Howells 提交于
      Provide an fsopen() system call that starts the process of preparing to
      create a superblock that will then be mountable, using an fd as a context
      handle.  fsopen() is given the name of the filesystem that will be used:
      
      	int mfd = fsopen(const char *fsname, unsigned int flags);
      
      where flags can be 0 or FSOPEN_CLOEXEC.
      
      For example:
      
      	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
      	fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
      	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
      	fsinfo(sfd, NULL, ...); // query new superblock attributes
      	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
      	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
      
      	sfd = fsopen("afs", -1);
      	fsconfig(fd, FSCONFIG_SET_STRING, "source",
      		 "#grand.central.org:root.cell", 0);
      	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
      	mfd = fsmount(sfd, 0, MS_NODEV);
      	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
      
      If an error is reported at any step, an error message may be available to be
      read() back (ENODATA will be reported if there isn't an error available) in
      the form:
      
      	"e <subsys>:<problem>"
      	"e SELinux:Mount on mountpoint not permitted"
      
      Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
      even if the fsmount() fails.  read() is still possible to retrieve error
      information.
      
      The fsopen() syscall creates a mount context and hangs it of the fd that it
      returns.
      
      Netlink is not used because it is optional and would make the core VFS
      dependent on the networking layer and also potentially add network
      namespace issues.
      
      Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
      fsopen().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24dcb3d9
    • D
      vfs: syscall: Add move_mount(2) to move mounts around · 2db154b3
      David Howells 提交于
      Add a move_mount() system call that will move a mount from one place to
      another and, in the next commit, allow to attach an unattached mount tree.
      
      The new system call looks like the following:
      
      	int move_mount(int from_dfd, const char *from_path,
      		       int to_dfd, const char *to_path,
      		       unsigned int flags);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2db154b3
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000
  19. 06 3月, 2019 1 次提交
    • C
      signal: add pidfd_send_signal() syscall · 3eb39f47
      Christian Brauner 提交于
      The kill() syscall operates on process identifiers (pid). After a process
      has exited its pid can be reused by another process. If a caller sends a
      signal to a reused pid it will end up signaling the wrong process. This
      issue has often surfaced and there has been a push to address this problem [1].
      
      This patch uses file descriptors (fd) from proc/<pid> as stable handles on
      struct pid. Even if a pid is recycled the handle will not change. The fd
      can be used to send signals to the process it refers to.
      Thus, the new syscall pidfd_send_signal() is introduced to solve this
      problem. Instead of pids it operates on process fds (pidfd).
      
      /* prototype and argument /*
      long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
      
      /* syscall number 424 */
      The syscall number was chosen to be 424 to align with Arnd's rework in his
      y2038 to minimize merge conflicts (cf. [25]).
      
      In addition to the pidfd and signal argument it takes an additional
      siginfo_t and flags argument. If the siginfo_t argument is NULL then
      pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
      is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
      The flags argument is added to allow for future extensions of this syscall.
      It currently needs to be passed as 0. Failing to do so will cause EINVAL.
      
      /* pidfd_send_signal() replaces multiple pid-based syscalls */
      The pidfd_send_signal() syscall currently takes on the job of
      rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
      positive pid is passed to kill(2). It will however be possible to also
      replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
      
      /* sending signals to threads (tid) and process groups (pgid) */
      Specifically, the pidfd_send_signal() syscall does currently not operate on
      process groups or threads. This is left for future extensions.
      In order to extend the syscall to allow sending signal to threads and
      process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
      PIDFD_TYPE_TID) should be added. This implies that the flags argument will
      determine what is signaled and not the file descriptor itself. Put in other
      words, grouping in this api is a property of the flags argument not a
      property of the file descriptor (cf. [13]). Clarification for this has been
      requested by Eric (cf. [19]).
      When appropriate extensions through the flags argument are added then
      pidfd_send_signal() can additionally replace the part of kill(2) which
      operates on process groups as well as the tgkill(2) and
      rt_tgsigqueueinfo(2) syscalls.
      How such an extension could be implemented has been very roughly sketched
      in [14], [15], and [16]. However, this should not be taken as a commitment
      to a particular implementation. There might be better ways to do it.
      Right now this is intentionally left out to keep this patchset as simple as
      possible (cf. [4]).
      
      /* naming */
      The syscall had various names throughout iterations of this patchset:
      - procfd_signal()
      - procfd_send_signal()
      - taskfd_send_signal()
      In the last round of reviews it was pointed out that given that if the
      flags argument decides the scope of the signal instead of different types
      of fds it might make sense to either settle for "procfd_" or "pidfd_" as
      prefix. The community was willing to accept either (cf. [17] and [18]).
      Given that one developer expressed strong preference for the "pidfd_"
      prefix (cf. [13]) and with other developers less opinionated about the name
      we should settle for "pidfd_" to avoid further bikeshedding.
      
      The  "_send_signal" suffix was chosen to reflect the fact that the syscall
      takes on the job of multiple syscalls. It is therefore intentional that the
      name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
      fomer because it might imply that pidfd_send_signal() is a replacement for
      kill(2), and not the latter because it is a hassle to remember the correct
      spelling - especially for non-native speakers - and because it is not
      descriptive enough of what the syscall actually does. The name
      "pidfd_send_signal" makes it very clear that its job is to send signals.
      
      /* zombies */
      Zombies can be signaled just as any other process. No special error will be
      reported since a zombie state is an unreliable state (cf. [3]). However,
      this can be added as an extension through the @flags argument if the need
      ever arises.
      
      /* cross-namespace signals */
      The patch currently enforces that the signaler and signalee either are in
      the same pid namespace or that the signaler's pid namespace is an ancestor
      of the signalee's pid namespace. This is done for the sake of simplicity
      and because it is unclear to what values certain members of struct
      siginfo_t would need to be set to (cf. [5], [6]).
      
      /* compat syscalls */
      It became clear that we would like to avoid adding compat syscalls
      (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
      itself by adding __copy_siginfo_from_user_generic() which lets us avoid
      compat syscalls (cf. [8]). It should be noted that the addition of
      __copy_siginfo_from_user_any() is caused by a bug in the original
      implementation of rt_sigqueueinfo(2) (cf. 12).
      With upcoming rework for syscall handling things might improve
      significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
      any additional callers.
      
      /* testing */
      This patch was tested on x64 and x86.
      
      /* userspace usage */
      An asciinema recording for the basic functionality can be found under [9].
      With this patch a process can be killed via:
      
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <unistd.h>
      
       static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                               unsigned int flags)
       {
       #ifdef __NR_pidfd_send_signal
               return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
       #else
               return -ENOSYS;
       #endif
       }
      
       int main(int argc, char *argv[])
       {
               int fd, ret, saved_errno, sig;
      
               if (argc < 3)
                       exit(EXIT_FAILURE);
      
               fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
               if (fd < 0) {
                       printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               sig = atoi(argv[2]);
      
               printf("Sending signal %d to process %s\n", sig, argv[1]);
               ret = do_pidfd_send_signal(fd, sig, NULL, 0);
      
               saved_errno = errno;
               close(fd);
               errno = saved_errno;
      
               if (ret < 0) {
                       printf("%s - Failed to send signal %d to process %s\n",
                              strerror(errno), sig, argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               exit(EXIT_SUCCESS);
       }
      
      /* Q&A
       * Given that it seems the same questions get asked again by people who are
       * late to the party it makes sense to add a Q&A section to the commit
       * message so it's hopefully easier to avoid duplicate threads.
       *
       * For the sake of progress please consider these arguments settled unless
       * there is a new point that desperately needs to be addressed. Please make
       * sure to check the links to the threads in this commit message whether
       * this has not already been covered.
       */
      Q-01: (Florian Weimer [20], Andrew Morton [21])
            What happens when the target process has exited?
      A-01: Sending the signal will fail with ESRCH (cf. [22]).
      
      Q-02:  (Andrew Morton [21])
             Is the task_struct pinned by the fd?
      A-02:  No. A reference to struct pid is kept. struct pid - as far as I
             understand - was created exactly for the reason to not require to
             pin struct task_struct (cf. [22]).
      
      Q-03: (Andrew Morton [21])
            Does the entire procfs directory remain visible? Just one entry
            within it?
      A-03: The same thing that happens right now when you hold a file descriptor
            to /proc/<pid> open (cf. [22]).
      
      Q-04: (Andrew Morton [21])
            Does the pid remain reserved?
      A-04: No. This patchset guarantees a stable handle not that pids are not
            recycled (cf. [22]).
      
      Q-05: (Andrew Morton [21])
            Do attempts to signal that fd return errors?
      A-05: See {Q,A}-01.
      
      Q-06: (Andrew Morton [22])
            Is there a cleaner way of obtaining the fd? Another syscall perhaps.
      A-06: Userspace can already trivially retrieve file descriptors from procfs
            so this is something that we will need to support anyway. Hence,
            there's no immediate need to add another syscalls just to make
            pidfd_send_signal() not dependent on the presence of procfs. However,
            adding a syscalls to get such file descriptors is planned for a
            future patchset (cf. [22]).
      
      Q-07: (Andrew Morton [21] and others)
            This fd-for-a-process sounds like a handy thing and people may well
            think up other uses for it in the future, probably unrelated to
            signals. Are the code and the interface designed to permit such
            future applications?
      A-07: Yes (cf. [22]).
      
      Q-08: (Andrew Morton [21] and others)
            Now I think about it, why a new syscall? This thing is looking
            rather like an ioctl?
      A-08: This has been extensively discussed. It was agreed that a syscall is
            preferred for a variety or reasons. Here are just a few taken from
            prior threads. Syscalls are safer than ioctl()s especially when
            signaling to fds. Processes are a core kernel concept so a syscall
            seems more appropriate. The layout of the syscall with its four
            arguments would require the addition of a custom struct for the
            ioctl() thereby causing at least the same amount or even more
            complexity for userspace than a simple syscall. The new syscall will
            replace multiple other pid-based syscalls (see description above).
            The file-descriptors-for-processes concept introduced with this
            syscall will be extended with other syscalls in the future. See also
            [22], [23] and various other threads already linked in here.
      
      Q-09: (Florian Weimer [24])
            What happens if you use the new interface with an O_PATH descriptor?
      A-09:
            pidfds opened as O_PATH fds cannot be used to send signals to a
            process (cf. [2]). Signaling processes through pidfds is the
            equivalent of writing to a file. Thus, this is not an operation that
            operates "purely at the file descriptor level" as required by the
            open(2) manpage. See also [4].
      
      /* References */
      [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
      [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
      [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
      [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
      [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
      [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
      [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
      [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
      [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
      [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
      [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
      [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
      [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
      [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
      [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
      [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
      [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
      [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
      [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
      [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
      [23]: https://lwn.net/Articles/773459/
      [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NTycho Andersen <tycho@tycho.ws>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NAleksa Sarai <cyphar@cyphar.com>
      3eb39f47
  20. 28 2月, 2019 2 次提交
    • J
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe 提交于
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edafccee
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
  21. 07 2月, 2019 3 次提交
    • A
      y2038: syscalls: rename y2038 compat syscalls · 8dabe724
      Arnd Bergmann 提交于
      A lot of system calls that pass a time_t somewhere have an implementation
      using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
      been reworked so that this implementation can now be used on 32-bit
      architectures as well.
      
      The missing step is to redefine them using the regular SYSCALL_DEFINEx()
      to get them out of the compat namespace and make it possible to build them
      on 32-bit architectures.
      
      Any system call that ends in 'time' gets a '32' suffix on its name for
      that version, while the others get a '_time32' suffix, to distinguish
      them from the normal version, which takes a 64-bit time argument in the
      future.
      
      In this step, only 64-bit architectures are changed, doing this rename
      first lets us avoid touching the 32-bit architectures twice.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8dabe724
    • D
      timex: change syscalls to use struct __kernel_timex · 3876ced4
      Deepa Dinamani 提交于
      struct timex is not y2038 safe.
      Switch all the syscall apis to use y2038 safe __kernel_timex.
      
      Note that sys_adjtimex() does not have a y2038 safe solution.  C libraries
      can implement it by calling clock_adjtime(CLOCK_REALTIME, ...).
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      3876ced4
    • A
      time: fix sys_timer_settime prototype · 50b93f30
      Arnd Bergmann 提交于
      A small typo has crept into the y2038 conversion of the timer_settime
      system call. So far this was completely harmless, but once we start
      using the new version, this has to be fixed.
      
      Fixes: 6ff84735 ("time: Change types to new y2038 safe __kernel_itimerspec")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      50b93f30
  22. 26 1月, 2019 1 次提交
    • A
      ipc: rename old-style shmctl/semctl/msgctl syscalls · 275f2214
      Arnd Bergmann 提交于
      The behavior of these system calls is slightly different between
      architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
      symbol. Most architectures that implement the split IPC syscalls don't set
      that symbol and only get the modern version, but alpha, arm, microblaze,
      mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
      
      For the architectures that so far only implement sys_ipc(), i.e. m68k,
      mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
      when adding the split syscalls, so we need to distinguish between the
      two groups of architectures.
      
      The method I picked for this distinction is to have a separate system call
      entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
      does not. The system call tables of the five architectures are changed
      accordingly.
      
      As an additional benefit, we no longer need the configuration specific
      definition for ipc_parse_version(), it always does the same thing now,
      but simply won't get called on architectures with the modern interface.
      
      A small downside is that on architectures that do set
      ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
      that are never called. They only add a few bytes of bloat, so it seems
      better to keep them compared to adding yet another Kconfig symbol.
      I considered adding new syscall numbers for the IPC_64 variants for
      consistency, but decided against that for now.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      275f2214
  23. 18 1月, 2019 1 次提交
  24. 18 12月, 2018 2 次提交
    • A
      y2038: signal: Add sys_rt_sigtimedwait_time32 · df8522a3
      Arnd Bergmann 提交于
      Once sys_rt_sigtimedwait() gets changed to a 64-bit time_t, we have
      to provide compatibility support for existing binaries.
      
      An earlier version of this patch reused the compat_sys_rt_sigtimedwait
      entry point to avoid code duplication, but this newer approach
      duplicates the existing native entry point instead, which seems
      a bit cleaner.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      df8522a3
    • A
      y2038: socket: Add compat_sys_recvmmsg_time64 · e11d4284
      Arnd Bergmann 提交于
      recvmmsg() takes two arguments to pointers of structures that differ
      between 32-bit and 64-bit architectures: mmsghdr and timespec.
      
      For y2038 compatbility, we are changing the native system call from
      timespec to __kernel_timespec with a 64-bit time_t (in another patch),
      and use the existing compat system call on both 32-bit and 64-bit
      architectures for compatibility with traditional 32-bit user space.
      
      As we now have two variants of recvmmsg() for 32-bit tasks that are both
      different from the variant that we use on 64-bit tasks, this means we
      also require two compat system calls!
      
      The solution I picked is to flip things around: The existing
      compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
      and now handles the case for old user space on all architectures that
      have set CONFIG_COMPAT_32BIT_TIME.  A new compat_sys_recvmmsg_time64()
      call gets added in the old place for 64-bit architectures only, this
      one handles the case of a compat mmsghdr structure combined with
      __kernel_timespec.
      
      In the indirect sys_socketcall(), we now need to call either
      do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
      architecture we are on. For compat_sys_socketcall(), no such change is
      needed, we always call __compat_sys_recvmmsg().
      
      I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
      implementation for 64-bit time_t will need significant changes including
      an updated asm/unistd.h, and it seems better to consistently use the
      separate syscalls that configuration, leaving the socketcall only for
      backward compatibility with 32-bit time_t based libc.
      
      The naming is asymmetric for the moment, so both existing syscalls
      entry points keep their names, while the new ones are recvmmsg_time32
      and compat_recvmmsg_time64 respectively. I expect that we will rename
      the compat syscalls later as we start using generated syscall tables
      everywhere and add these entry points.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      e11d4284
  25. 12 12月, 2018 1 次提交
    • T
      seccomp: switch system call argument type to void * · a5662e4d
      Tycho Andersen 提交于
      The const qualifier causes problems for any code that wants to write to the
      third argument of the seccomp syscall, as we will do in a future patch in
      this series.
      
      The third argument to the seccomp syscall is documented as void *, so
      rather than just dropping the const, let's switch everything to use void *
      as well.
      
      I believe this is safe because of 1. the documentation above, 2. there's no
      real type information exported about syscalls anywhere besides the man
      pages.
      Signed-off-by: NTycho Andersen <tycho@tycho.ws>
      CC: Kees Cook <keescook@chromium.org>
      CC: Andy Lutomirski <luto@amacapital.net>
      CC: Oleg Nesterov <oleg@redhat.com>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: "Serge E. Hallyn" <serge@hallyn.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      CC: Christian Brauner <christian@brauner.io>
      CC: Tyler Hicks <tyhicks@canonical.com>
      CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      a5662e4d
  26. 08 12月, 2018 1 次提交
    • A
      y2038: futex: Add support for __kernel_timespec · bec2f7cb
      Arnd Bergmann 提交于
      This prepares sys_futex for y2038 safe calling: the native
      syscall is changed to receive a __kernel_timespec argument, which
      will be switched to 64-bit time_t in the future. All the internal
      time handling gets changed to timespec64, and the compat_sys_futex
      entry point is moved under the CONFIG_COMPAT_32BIT_TIME check
      to provide compatibility for existing 32-bit architectures.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      bec2f7cb