1. 08 9月, 2019 1 次提交
    • A
      ipc: fix sparc64 ipc() wrapper · fb377eb8
      Arnd Bergmann 提交于
      Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
      to a commit from my y2038 series in linux-5.1, as I missed the custom
      sys_ipc() wrapper that sparc64 uses in place of the generic version that
      I patched.
      
      The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
      now do not allow being called with the IPC_64 flag any more, resulting
      in a -EINVAL error when they don't recognize the command.
      
      Instead, the correct way to do this now is to call the internal
      ksys_old_{sem,shm,msg}ctl() functions to select the API version.
      
      As we generally move towards these functions anyway, change all of
      sparc_ipc() to consistently use those in place of the sys_*() versions,
      and move the required ksys_*() declarations into linux/syscalls.h
      
      The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
      errors when ipc is disabled.
      Reported-by: NMatt Turner <mattst88@gmail.com>
      Fixes: 275f2214 ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
      Cc: stable@vger.kernel.org
      Tested-by: NMatt Turner <mattst88@gmail.com>
      Tested-by: NAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      fb377eb8
  2. 05 7月, 2019 1 次提交
  3. 28 6月, 2019 1 次提交
    • C
      pid: add pidfd_open() · 32fcb426
      Christian Brauner 提交于
      This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
      pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
      process that is created via traditional fork()/clone() calls that is only
      referenced by a PID:
      
      int pidfd = pidfd_open(1234, 0);
      ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
      
      With the introduction of pidfds through CLONE_PIDFD it is possible to
      created pidfds at process creation time.
      However, a lot of processes get created with traditional PID-based calls
      such as fork() or clone() (without CLONE_PIDFD). For these processes a
      caller can currently not create a pollable pidfd. This is a problem for
      Android's low memory killer (LMK) and service managers such as systemd.
      Both are examples of tools that want to make use of pidfds to get reliable
      notification of process exit for non-parents (pidfd polling) and race-free
      signal sending (pidfd_send_signal()). They intend to switch to this API for
      process supervision/management as soon as possible. Having no way to get
      pollable pidfds from PID-only processes is one of the biggest blockers for
      them in adopting this api. With pidfd_open() making it possible to retrieve
      pidfds for PID-based processes we enable them to adopt this api.
      
      In line with Arnd's recent changes to consolidate syscall numbers across
      architectures, I have added the pidfd_open() syscall to all architectures
      at the same time.
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      32fcb426
  4. 09 6月, 2019 1 次提交
    • C
      fork: add clone3 · 7f192e3c
      Christian Brauner 提交于
      This adds the clone3 system call.
      
      As mentioned several times already (cf. [7], [8]) here's the promised
      patchset for clone3().
      
      We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
      free flag from clone().
      
      Independent of the CLONE_PIDFD patchset a time namespace has been discussed
      at Linux Plumber Conference last year and has been sent out and reviewed
      (cf. [5]). It is expected that it will go upstream in the not too distant
      future. However, it relies on the addition of the CLONE_NEWTIME flag to
      clone(). The only other good candidate - CLONE_DETACHED - is currently not
      recyclable as we have identified at least two large or widely used
      codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
      CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
      blocked. clone3() has the advantage that it will unblock this patchset
      again. In general, clone3() is extensible and allows for the implementation
      of new features.
      
      The idea is to keep clone3() very simple and close to the original clone(),
      specifically, to keep on supporting old clone()-based workloads.
      We know there have been various creative proposals how a new process
      creation syscall or even api is supposed to look like. Some people even
      going so far as to argue that the traditional fork()+exec() split should be
      abandoned in favor of an in-kernel version of spawn(). Independent of
      whether or not we personally think spawn() is a good idea this patchset has
      and does not want to have anything to do with this.
      One stance we take is that there's no real good alternative to
      clone()+exec() and we need and want to support this model going forward;
      independent of spawn().
      The following requirements guided clone3():
      - bump the number of available flags
      - move arguments that are currently passed as separate arguments
        in clone() into a dedicated struct clone_args
        - choose a struct layout that is easy to handle on 32 and on 64 bit
        - choose a struct layout that is extensible
        - give new flags that currently need to abuse another flag's dedicated
          return argument in clone() their own dedicated return argument
          (e.g. CLONE_PIDFD)
        - use a separate kernel internal struct kernel_clone_args that is
          properly typed according to current kernel conventions in fork.c and is
          different from  the uapi struct clone_args
      - port _do_fork() to use kernel_clone_args so that all process creation
        syscalls such as fork(), vfork(), clone(), and clone3() behave identical
        (Arnd suggested, that we can probably also port do_fork() itself in a
         separate patchset.)
      - ease of transition for userspace from clone() to clone3()
        This very much means that we do *not* remove functionality that userspace
        currently relies on as the latter is a good way of creating a syscall
        that won't be adopted.
      - do not try to be clever or complex: keep clone3() as dumb as possible
      
      In accordance with Linus suggestions (cf. [11]), clone3() has the following
      signature:
      
      /* uapi */
      struct clone_args {
              __aligned_u64 flags;
              __aligned_u64 pidfd;
              __aligned_u64 child_tid;
              __aligned_u64 parent_tid;
              __aligned_u64 exit_signal;
              __aligned_u64 stack;
              __aligned_u64 stack_size;
              __aligned_u64 tls;
      };
      
      /* kernel internal */
      struct kernel_clone_args {
              u64 flags;
              int __user *pidfd;
              int __user *child_tid;
              int __user *parent_tid;
              int exit_signal;
              unsigned long stack;
              unsigned long stack_size;
              unsigned long tls;
      };
      
      long sys_clone3(struct clone_args __user *uargs, size_t size)
      
      clone3() cleanly supports all of the supported flags from clone() and thus
      all legacy workloads.
      The advantage of sticking close to the old clone() is the low cost for
      userspace to switch to this new api. Quite a lot of userspace apis (e.g.
      pthreads) are based on the clone() syscall. With the new clone3() syscall
      supporting all of the old workloads and opening up the ability to add new
      features should make switching to it for userspace more appealing. In
      essence, glibc can just write a simple wrapper to switch from clone() to
      clone3().
      
      There has been some interest in this patchset already. We have received a
      patch from the CRIU corner for clone3() that would set the PID/TID of a
      restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
      
      /* User visible differences to legacy clone() */
      - CLONE_DETACHED will cause EINVAL with clone3()
      - CSIGNAL is deprecated
        It is superseeded by a dedicated "exit_signal" argument in struct
        clone_args freeing up space for additional flags.
        This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
      
      /* References */
      [1]: b3e58382
      [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
      [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
      [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
      [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
      [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
      [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
      [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
      [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
      [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
      [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      7f192e3c
  5. 05 6月, 2019 1 次提交
  6. 27 5月, 2019 1 次提交
  7. 21 3月, 2019 6 次提交
    • D
      vfs: syscall: Add fspick() to select a superblock for reconfiguration · cf3cba4a
      David Howells 提交于
      Provide an fspick() system call that can be used to pick an existing
      mountpoint into an fs_context which can thereafter be used to reconfigure a
      superblock (equivalent of the superblock side of -o remount).
      
      This looks like:
      
      	int fd = fspick(AT_FDCWD, "/mnt",
      			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
      	fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
      	fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
      	fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
      
      At the point of fspick being called, the file descriptor referring to the
      filesystem context is in exactly the same state as the one that was created
      by fsopen() after fsmount() has been successfully called.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cf3cba4a
    • D
      vfs: syscall: Add fsmount() to create a mount for a superblock · 93766fbd
      David Howells 提交于
      Provide a system call by which a filesystem opened with fsopen() and
      configured by a series of fsconfig() calls can have a detached mount object
      created for it.  This mount object can then be attached to the VFS mount
      hierarchy using move_mount() by passing the returned file descriptor as the
      from directory fd.
      
      The system call looks like:
      
      	int mfd = fsmount(int fsfd, unsigned int flags,
      			  unsigned int attr_flags);
      
      where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
      FSMOUNT_CLOEXEC.  attr_flags is a bitwise-OR of the following flags:
      
      	MOUNT_ATTR_RDONLY	Mount read-only
      	MOUNT_ATTR_NOSUID	Ignore suid and sgid bits
      	MOUNT_ATTR_NODEV	Disallow access to device special files
      	MOUNT_ATTR_NOEXEC	Disallow program execution
      	MOUNT_ATTR__ATIME	Setting on how atime should be updated
      	MOUNT_ATTR_RELATIME	- Update atime relative to mtime/ctime
      	MOUNT_ATTR_NOATIME	- Do not update access times
      	MOUNT_ATTR_STRICTATIME	- Always perform atime updates
      	MOUNT_ATTR_NODIRATIME	Do not update directory access times
      
      In the event that fsmount() fails, it may be possible to get an error
      message by calling read() on fsfd.  If no message is available, ENODATA
      will be reported.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      93766fbd
    • D
      vfs: syscall: Add fsconfig() for configuring and managing a context · ecdab150
      David Howells 提交于
      Add a syscall for configuring a filesystem creation context and triggering
      actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
      
          long fsconfig(int fs_fd, unsigned int cmd, const char *key,
      		  const void *value, int aux);
      
      Where fs_fd indicates the context, cmd indicates the action to take, key
      indicates the parameter name for parameter-setting actions and, if needed,
      value points to a buffer containing the value and aux can give more
      information for the value.
      
      The following command IDs are proposed:
      
       (*) FSCONFIG_SET_FLAG: No value is specified.  The parameter must be
           boolean in nature.  The key may be prefixed with "no" to invert the
           setting. value must be NULL and aux must be 0.
      
       (*) FSCONFIG_SET_STRING: A string value is specified.  The parameter can
           be expecting boolean, integer, string or take a path.  A conversion to
           an appropriate type will be attempted (which may include looking up as
           a path).  value points to a NUL-terminated string and aux must be 0.
      
       (*) FSCONFIG_SET_BINARY: A binary blob is specified.  value points to
           the blob and aux indicates its size.  The parameter must be expecting
           a blob.
      
       (*) FSCONFIG_SET_PATH: A non-empty path is specified.  The parameter must
           be expecting a path object.  value points to a NUL-terminated string
           that is the path and aux is a file descriptor at which to start a
           relative lookup or AT_FDCWD.
      
       (*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
           implied.
      
       (*) FSCONFIG_SET_FD: An open file descriptor is specified.  value must
           be NULL and aux indicates the file descriptor.
      
       (*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
      
       (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
      
      For the "set" command IDs, the idea is that the file_system_type will point
      to a list of parameters and the types of value that those parameters expect
      to take.  The core code can then do the parse and argument conversion and
      then give the LSM and FS a cooked option or array of options to use.
      
      Source specification is also done the same way same way, using special keys
      "source", "source1", "source2", etc..
      
      [!] Note that, for the moment, the key and value are just glued back
      together and handed to the filesystem.  Every filesystem that uses options
      uses match_token() and co. to do this, and this will need to be changed -
      but not all at once.
      
      Example usage:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
          fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
          fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
          fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
          fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
          fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
          fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
          fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
          fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("afs", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("jffs2", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecdab150
    • D
      vfs: syscall: Add fsopen() to prepare for superblock creation · 24dcb3d9
      David Howells 提交于
      Provide an fsopen() system call that starts the process of preparing to
      create a superblock that will then be mountable, using an fd as a context
      handle.  fsopen() is given the name of the filesystem that will be used:
      
      	int mfd = fsopen(const char *fsname, unsigned int flags);
      
      where flags can be 0 or FSOPEN_CLOEXEC.
      
      For example:
      
      	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
      	fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
      	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
      	fsinfo(sfd, NULL, ...); // query new superblock attributes
      	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
      	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
      
      	sfd = fsopen("afs", -1);
      	fsconfig(fd, FSCONFIG_SET_STRING, "source",
      		 "#grand.central.org:root.cell", 0);
      	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
      	mfd = fsmount(sfd, 0, MS_NODEV);
      	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
      
      If an error is reported at any step, an error message may be available to be
      read() back (ENODATA will be reported if there isn't an error available) in
      the form:
      
      	"e <subsys>:<problem>"
      	"e SELinux:Mount on mountpoint not permitted"
      
      Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
      even if the fsmount() fails.  read() is still possible to retrieve error
      information.
      
      The fsopen() syscall creates a mount context and hangs it of the fd that it
      returns.
      
      Netlink is not used because it is optional and would make the core VFS
      dependent on the networking layer and also potentially add network
      namespace issues.
      
      Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
      fsopen().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24dcb3d9
    • D
      vfs: syscall: Add move_mount(2) to move mounts around · 2db154b3
      David Howells 提交于
      Add a move_mount() system call that will move a mount from one place to
      another and, in the next commit, allow to attach an unattached mount tree.
      
      The new system call looks like the following:
      
      	int move_mount(int from_dfd, const char *from_path,
      		       int to_dfd, const char *to_path,
      		       unsigned int flags);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2db154b3
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000
  8. 06 3月, 2019 1 次提交
    • C
      signal: add pidfd_send_signal() syscall · 3eb39f47
      Christian Brauner 提交于
      The kill() syscall operates on process identifiers (pid). After a process
      has exited its pid can be reused by another process. If a caller sends a
      signal to a reused pid it will end up signaling the wrong process. This
      issue has often surfaced and there has been a push to address this problem [1].
      
      This patch uses file descriptors (fd) from proc/<pid> as stable handles on
      struct pid. Even if a pid is recycled the handle will not change. The fd
      can be used to send signals to the process it refers to.
      Thus, the new syscall pidfd_send_signal() is introduced to solve this
      problem. Instead of pids it operates on process fds (pidfd).
      
      /* prototype and argument /*
      long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
      
      /* syscall number 424 */
      The syscall number was chosen to be 424 to align with Arnd's rework in his
      y2038 to minimize merge conflicts (cf. [25]).
      
      In addition to the pidfd and signal argument it takes an additional
      siginfo_t and flags argument. If the siginfo_t argument is NULL then
      pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
      is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
      The flags argument is added to allow for future extensions of this syscall.
      It currently needs to be passed as 0. Failing to do so will cause EINVAL.
      
      /* pidfd_send_signal() replaces multiple pid-based syscalls */
      The pidfd_send_signal() syscall currently takes on the job of
      rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
      positive pid is passed to kill(2). It will however be possible to also
      replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
      
      /* sending signals to threads (tid) and process groups (pgid) */
      Specifically, the pidfd_send_signal() syscall does currently not operate on
      process groups or threads. This is left for future extensions.
      In order to extend the syscall to allow sending signal to threads and
      process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
      PIDFD_TYPE_TID) should be added. This implies that the flags argument will
      determine what is signaled and not the file descriptor itself. Put in other
      words, grouping in this api is a property of the flags argument not a
      property of the file descriptor (cf. [13]). Clarification for this has been
      requested by Eric (cf. [19]).
      When appropriate extensions through the flags argument are added then
      pidfd_send_signal() can additionally replace the part of kill(2) which
      operates on process groups as well as the tgkill(2) and
      rt_tgsigqueueinfo(2) syscalls.
      How such an extension could be implemented has been very roughly sketched
      in [14], [15], and [16]. However, this should not be taken as a commitment
      to a particular implementation. There might be better ways to do it.
      Right now this is intentionally left out to keep this patchset as simple as
      possible (cf. [4]).
      
      /* naming */
      The syscall had various names throughout iterations of this patchset:
      - procfd_signal()
      - procfd_send_signal()
      - taskfd_send_signal()
      In the last round of reviews it was pointed out that given that if the
      flags argument decides the scope of the signal instead of different types
      of fds it might make sense to either settle for "procfd_" or "pidfd_" as
      prefix. The community was willing to accept either (cf. [17] and [18]).
      Given that one developer expressed strong preference for the "pidfd_"
      prefix (cf. [13]) and with other developers less opinionated about the name
      we should settle for "pidfd_" to avoid further bikeshedding.
      
      The  "_send_signal" suffix was chosen to reflect the fact that the syscall
      takes on the job of multiple syscalls. It is therefore intentional that the
      name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
      fomer because it might imply that pidfd_send_signal() is a replacement for
      kill(2), and not the latter because it is a hassle to remember the correct
      spelling - especially for non-native speakers - and because it is not
      descriptive enough of what the syscall actually does. The name
      "pidfd_send_signal" makes it very clear that its job is to send signals.
      
      /* zombies */
      Zombies can be signaled just as any other process. No special error will be
      reported since a zombie state is an unreliable state (cf. [3]). However,
      this can be added as an extension through the @flags argument if the need
      ever arises.
      
      /* cross-namespace signals */
      The patch currently enforces that the signaler and signalee either are in
      the same pid namespace or that the signaler's pid namespace is an ancestor
      of the signalee's pid namespace. This is done for the sake of simplicity
      and because it is unclear to what values certain members of struct
      siginfo_t would need to be set to (cf. [5], [6]).
      
      /* compat syscalls */
      It became clear that we would like to avoid adding compat syscalls
      (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
      itself by adding __copy_siginfo_from_user_generic() which lets us avoid
      compat syscalls (cf. [8]). It should be noted that the addition of
      __copy_siginfo_from_user_any() is caused by a bug in the original
      implementation of rt_sigqueueinfo(2) (cf. 12).
      With upcoming rework for syscall handling things might improve
      significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
      any additional callers.
      
      /* testing */
      This patch was tested on x64 and x86.
      
      /* userspace usage */
      An asciinema recording for the basic functionality can be found under [9].
      With this patch a process can be killed via:
      
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <unistd.h>
      
       static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                               unsigned int flags)
       {
       #ifdef __NR_pidfd_send_signal
               return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
       #else
               return -ENOSYS;
       #endif
       }
      
       int main(int argc, char *argv[])
       {
               int fd, ret, saved_errno, sig;
      
               if (argc < 3)
                       exit(EXIT_FAILURE);
      
               fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
               if (fd < 0) {
                       printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               sig = atoi(argv[2]);
      
               printf("Sending signal %d to process %s\n", sig, argv[1]);
               ret = do_pidfd_send_signal(fd, sig, NULL, 0);
      
               saved_errno = errno;
               close(fd);
               errno = saved_errno;
      
               if (ret < 0) {
                       printf("%s - Failed to send signal %d to process %s\n",
                              strerror(errno), sig, argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               exit(EXIT_SUCCESS);
       }
      
      /* Q&A
       * Given that it seems the same questions get asked again by people who are
       * late to the party it makes sense to add a Q&A section to the commit
       * message so it's hopefully easier to avoid duplicate threads.
       *
       * For the sake of progress please consider these arguments settled unless
       * there is a new point that desperately needs to be addressed. Please make
       * sure to check the links to the threads in this commit message whether
       * this has not already been covered.
       */
      Q-01: (Florian Weimer [20], Andrew Morton [21])
            What happens when the target process has exited?
      A-01: Sending the signal will fail with ESRCH (cf. [22]).
      
      Q-02:  (Andrew Morton [21])
             Is the task_struct pinned by the fd?
      A-02:  No. A reference to struct pid is kept. struct pid - as far as I
             understand - was created exactly for the reason to not require to
             pin struct task_struct (cf. [22]).
      
      Q-03: (Andrew Morton [21])
            Does the entire procfs directory remain visible? Just one entry
            within it?
      A-03: The same thing that happens right now when you hold a file descriptor
            to /proc/<pid> open (cf. [22]).
      
      Q-04: (Andrew Morton [21])
            Does the pid remain reserved?
      A-04: No. This patchset guarantees a stable handle not that pids are not
            recycled (cf. [22]).
      
      Q-05: (Andrew Morton [21])
            Do attempts to signal that fd return errors?
      A-05: See {Q,A}-01.
      
      Q-06: (Andrew Morton [22])
            Is there a cleaner way of obtaining the fd? Another syscall perhaps.
      A-06: Userspace can already trivially retrieve file descriptors from procfs
            so this is something that we will need to support anyway. Hence,
            there's no immediate need to add another syscalls just to make
            pidfd_send_signal() not dependent on the presence of procfs. However,
            adding a syscalls to get such file descriptors is planned for a
            future patchset (cf. [22]).
      
      Q-07: (Andrew Morton [21] and others)
            This fd-for-a-process sounds like a handy thing and people may well
            think up other uses for it in the future, probably unrelated to
            signals. Are the code and the interface designed to permit such
            future applications?
      A-07: Yes (cf. [22]).
      
      Q-08: (Andrew Morton [21] and others)
            Now I think about it, why a new syscall? This thing is looking
            rather like an ioctl?
      A-08: This has been extensively discussed. It was agreed that a syscall is
            preferred for a variety or reasons. Here are just a few taken from
            prior threads. Syscalls are safer than ioctl()s especially when
            signaling to fds. Processes are a core kernel concept so a syscall
            seems more appropriate. The layout of the syscall with its four
            arguments would require the addition of a custom struct for the
            ioctl() thereby causing at least the same amount or even more
            complexity for userspace than a simple syscall. The new syscall will
            replace multiple other pid-based syscalls (see description above).
            The file-descriptors-for-processes concept introduced with this
            syscall will be extended with other syscalls in the future. See also
            [22], [23] and various other threads already linked in here.
      
      Q-09: (Florian Weimer [24])
            What happens if you use the new interface with an O_PATH descriptor?
      A-09:
            pidfds opened as O_PATH fds cannot be used to send signals to a
            process (cf. [2]). Signaling processes through pidfds is the
            equivalent of writing to a file. Thus, this is not an operation that
            operates "purely at the file descriptor level" as required by the
            open(2) manpage. See also [4].
      
      /* References */
      [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
      [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
      [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
      [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
      [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
      [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
      [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
      [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
      [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
      [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
      [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
      [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
      [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
      [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
      [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
      [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
      [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
      [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
      [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
      [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
      [23]: https://lwn.net/Articles/773459/
      [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NTycho Andersen <tycho@tycho.ws>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NAleksa Sarai <cyphar@cyphar.com>
      3eb39f47
  9. 28 2月, 2019 2 次提交
    • J
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe 提交于
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edafccee
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
  10. 07 2月, 2019 3 次提交
    • A
      y2038: syscalls: rename y2038 compat syscalls · 8dabe724
      Arnd Bergmann 提交于
      A lot of system calls that pass a time_t somewhere have an implementation
      using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
      been reworked so that this implementation can now be used on 32-bit
      architectures as well.
      
      The missing step is to redefine them using the regular SYSCALL_DEFINEx()
      to get them out of the compat namespace and make it possible to build them
      on 32-bit architectures.
      
      Any system call that ends in 'time' gets a '32' suffix on its name for
      that version, while the others get a '_time32' suffix, to distinguish
      them from the normal version, which takes a 64-bit time argument in the
      future.
      
      In this step, only 64-bit architectures are changed, doing this rename
      first lets us avoid touching the 32-bit architectures twice.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8dabe724
    • D
      timex: change syscalls to use struct __kernel_timex · 3876ced4
      Deepa Dinamani 提交于
      struct timex is not y2038 safe.
      Switch all the syscall apis to use y2038 safe __kernel_timex.
      
      Note that sys_adjtimex() does not have a y2038 safe solution.  C libraries
      can implement it by calling clock_adjtime(CLOCK_REALTIME, ...).
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      3876ced4
    • A
      time: fix sys_timer_settime prototype · 50b93f30
      Arnd Bergmann 提交于
      A small typo has crept into the y2038 conversion of the timer_settime
      system call. So far this was completely harmless, but once we start
      using the new version, this has to be fixed.
      
      Fixes: 6ff84735 ("time: Change types to new y2038 safe __kernel_itimerspec")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      50b93f30
  11. 26 1月, 2019 1 次提交
    • A
      ipc: rename old-style shmctl/semctl/msgctl syscalls · 275f2214
      Arnd Bergmann 提交于
      The behavior of these system calls is slightly different between
      architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
      symbol. Most architectures that implement the split IPC syscalls don't set
      that symbol and only get the modern version, but alpha, arm, microblaze,
      mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
      
      For the architectures that so far only implement sys_ipc(), i.e. m68k,
      mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
      when adding the split syscalls, so we need to distinguish between the
      two groups of architectures.
      
      The method I picked for this distinction is to have a separate system call
      entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
      does not. The system call tables of the five architectures are changed
      accordingly.
      
      As an additional benefit, we no longer need the configuration specific
      definition for ipc_parse_version(), it always does the same thing now,
      but simply won't get called on architectures with the modern interface.
      
      A small downside is that on architectures that do set
      ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
      that are never called. They only add a few bytes of bloat, so it seems
      better to keep them compared to adding yet another Kconfig symbol.
      I considered adding new syscall numbers for the IPC_64 variants for
      consistency, but decided against that for now.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      275f2214
  12. 18 1月, 2019 1 次提交
  13. 18 12月, 2018 2 次提交
    • A
      y2038: signal: Add sys_rt_sigtimedwait_time32 · df8522a3
      Arnd Bergmann 提交于
      Once sys_rt_sigtimedwait() gets changed to a 64-bit time_t, we have
      to provide compatibility support for existing binaries.
      
      An earlier version of this patch reused the compat_sys_rt_sigtimedwait
      entry point to avoid code duplication, but this newer approach
      duplicates the existing native entry point instead, which seems
      a bit cleaner.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      df8522a3
    • A
      y2038: socket: Add compat_sys_recvmmsg_time64 · e11d4284
      Arnd Bergmann 提交于
      recvmmsg() takes two arguments to pointers of structures that differ
      between 32-bit and 64-bit architectures: mmsghdr and timespec.
      
      For y2038 compatbility, we are changing the native system call from
      timespec to __kernel_timespec with a 64-bit time_t (in another patch),
      and use the existing compat system call on both 32-bit and 64-bit
      architectures for compatibility with traditional 32-bit user space.
      
      As we now have two variants of recvmmsg() for 32-bit tasks that are both
      different from the variant that we use on 64-bit tasks, this means we
      also require two compat system calls!
      
      The solution I picked is to flip things around: The existing
      compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
      and now handles the case for old user space on all architectures that
      have set CONFIG_COMPAT_32BIT_TIME.  A new compat_sys_recvmmsg_time64()
      call gets added in the old place for 64-bit architectures only, this
      one handles the case of a compat mmsghdr structure combined with
      __kernel_timespec.
      
      In the indirect sys_socketcall(), we now need to call either
      do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
      architecture we are on. For compat_sys_socketcall(), no such change is
      needed, we always call __compat_sys_recvmmsg().
      
      I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
      implementation for 64-bit time_t will need significant changes including
      an updated asm/unistd.h, and it seems better to consistently use the
      separate syscalls that configuration, leaving the socketcall only for
      backward compatibility with 32-bit time_t based libc.
      
      The naming is asymmetric for the moment, so both existing syscalls
      entry points keep their names, while the new ones are recvmmsg_time32
      and compat_recvmmsg_time64 respectively. I expect that we will rename
      the compat syscalls later as we start using generated syscall tables
      everywhere and add these entry points.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      e11d4284
  14. 12 12月, 2018 1 次提交
    • T
      seccomp: switch system call argument type to void * · a5662e4d
      Tycho Andersen 提交于
      The const qualifier causes problems for any code that wants to write to the
      third argument of the seccomp syscall, as we will do in a future patch in
      this series.
      
      The third argument to the seccomp syscall is documented as void *, so
      rather than just dropping the const, let's switch everything to use void *
      as well.
      
      I believe this is safe because of 1. the documentation above, 2. there's no
      real type information exported about syscalls anywhere besides the man
      pages.
      Signed-off-by: NTycho Andersen <tycho@tycho.ws>
      CC: Kees Cook <keescook@chromium.org>
      CC: Andy Lutomirski <luto@amacapital.net>
      CC: Oleg Nesterov <oleg@redhat.com>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      CC: "Serge E. Hallyn" <serge@hallyn.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      CC: Christian Brauner <christian@brauner.io>
      CC: Tyler Hicks <tyhicks@canonical.com>
      CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      a5662e4d
  15. 08 12月, 2018 1 次提交
    • A
      y2038: futex: Add support for __kernel_timespec · bec2f7cb
      Arnd Bergmann 提交于
      This prepares sys_futex for y2038 safe calling: the native
      syscall is changed to receive a __kernel_timespec argument, which
      will be switched to 64-bit time_t in the future. All the internal
      time handling gets changed to timespec64, and the compat_sys_futex
      entry point is moved under the CONFIG_COMPAT_32BIT_TIME check
      to provide compatibility for existing 32-bit architectures.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      bec2f7cb
  16. 07 12月, 2018 3 次提交
    • D
      io_pgetevents: use __kernel_timespec · 7a35397f
      Deepa Dinamani 提交于
      struct timespec is not y2038 safe.
      struct __kernel_timespec is the new y2038 safe structure for all
      syscalls that are using struct timespec.
      Update io_pgetevents interfaces to use struct __kernel_timespec.
      
      sigset_t also has different representations on 32 bit and 64 bit
      architectures. Hence, we need to support the following different
      syscalls:
      
      New y2038 safe syscalls:
      (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
      
      Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents
      Compat : compat_sys_io_pgetevents_time64
      
      Older y2038 unsafe syscalls:
      (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
      
      Native 32 bit : sys_io_pgetevents_time32
      Compat : compat_sys_io_pgetevents
      
      Note that io_getevents syscalls do not have a y2038 safe solution.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      7a35397f
    • D
      pselect6: use __kernel_timespec · e024707b
      Deepa Dinamani 提交于
      struct timespec is not y2038 safe.
      struct __kernel_timespec is the new y2038 safe structure for all
      syscalls that are using struct timespec.
      Update pselect interfaces to use struct __kernel_timespec.
      
      sigset_t also has different representations on 32 bit and 64 bit
      architectures. Hence, we need to support the following different
      syscalls:
      
      New y2038 safe syscalls:
      (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
      
      Native 64 bit(unchanged) and native 32 bit : sys_pselect6
      Compat : compat_sys_pselect6_time64
      
      Older y2038 unsafe syscalls:
      (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
      
      Native 32 bit : pselect6_time32
      Compat : compat_sys_pselect6
      
      Note that all other versions of select syscalls will not have
      y2038 safe versions.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      e024707b
    • D
      ppoll: use __kernel_timespec · 8bd27a30
      Deepa Dinamani 提交于
      struct timespec is not y2038 safe.
      struct __kernel_timespec is the new y2038 safe structure for all
      syscalls that are using struct timespec.
      Update ppoll interfaces to use struct __kernel_timespec.
      
      sigset_t also has different representations on 32 bit and 64 bit
      architectures. Hence, we need to support the following different
      syscalls:
      
      New y2038 safe syscalls:
      (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
      
      Native 64 bit(unchanged) and native 32 bit : sys_ppoll
      Compat : compat_sys_ppoll_time64
      
      Older y2038 unsafe syscalls:
      (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
      
      Native 32 bit : ppoll_time32
      Compat : compat_sys_ppoll
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8bd27a30
  17. 29 8月, 2018 5 次提交
    • A
      y2038: signal: Change rt_sigtimedwait to use __kernel_timespec · 49c39f84
      Arnd Bergmann 提交于
      This changes sys_rt_sigtimedwait() to use get_timespec64(), changing
      the timeout type to __kernel_timespec, which will be changed to use
      a 64-bit time_t in the future. Since the do_sigtimedwait() core
      function changes, we also have to modify the compat version of this
      system call in the same way.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      49c39f84
    • A
      y2038: socket: Change recvmmsg to use __kernel_timespec · c2e6c856
      Arnd Bergmann 提交于
      This converts the recvmmsg() system call in all its variations to use
      'timespec64' internally for its timeout, and have a __kernel_timespec64
      argument in the native entry point. This lets us change the type to use
      64-bit time_t at a later point while using the 32-bit compat system call
      emulation for existing user space.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      c2e6c856
    • A
      y2038: sched: Change sched_rr_get_interval to use __kernel_timespec · 474b9c77
      Arnd Bergmann 提交于
      This is a preparation patch for converting sys_sched_rr_get_interval to
      work with 64-bit time_t on 32-bit architectures. The 'interval' argument
      is changed to struct __kernel_timespec, which will be redefined using
      64-bit time_t in the future. The compat version of the system call in
      turn is enabled for compilation with CONFIG_COMPAT_32BIT_TIME so
      the individual 32-bit architectures can share the handling of the
      traditional argument with 64-bit architectures providing it for their
      compat mode.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      474b9c77
    • A
      y2038: Compile utimes()/futimesat() conditionally · 185cfaf7
      Arnd Bergmann 提交于
      There are four generations of utimes() syscalls: utime(), utimes(),
      futimesat() and utimensat(), each one being a superset of the previous
      one. For y2038 support, we have to add another one, which is the same
      as the existing utimensat() but always passes 64-bit times_t based
      timespec values.
      
      There are currently 10 architectures that only use utimensat(), two
      that use utimes(), futimesat() and utimensat() but not utime(), and 11
      architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
      in order to get a sys_utime implementation. Since all the new
      architectures only want utimensat(), moving all the legacy entry points
      into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
      and ia64 grow a tiny bit as they now also get an unused sys_utime(),
      but it didn't seem worth the extra complexity of adding yet another
      ifdef for those.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      185cfaf7
    • A
      y2038: Change sys_utimensat() to use __kernel_timespec · a4f7a300
      Arnd Bergmann 提交于
      When 32-bit architectures get changed to support 64-bit time_t,
      utimensat() needs to use the new __kernel_timespec structure as its
      argument.
      
      The older utime(), utimes() and futimesat() system calls don't need a
      corresponding change as they are no longer used on C libraries that have
      64-bit time support.
      
      As we do for the other syscalls that have timespec arguments, we reuse
      the 'compat' syscall entry points to implement the traditional four
      interfaces, and only leave the new utimensat() as a native handler,
      so that the same code gets used on both 32-bit and 64-bit kernels
      on each syscall.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      a4f7a300
  18. 27 8月, 2018 1 次提交
    • A
      y2038: globally rename compat_time to old_time32 · 9afc5eee
      Arnd Bergmann 提交于
      Christoph Hellwig suggested a slightly different path for handling
      backwards compatibility with the 32-bit time_t based system calls:
      
      Rather than simply reusing the compat_sys_* entry points on 32-bit
      architectures unchanged, we get rid of those entry points and the
      compat_time types by renaming them to something that makes more sense
      on 32-bit architectures (which don't have a compat mode otherwise),
      and then share the entry points under the new name with the 64-bit
      architectures that use them for implementing the compatibility.
      
      The following types and interfaces are renamed here, and moved
      from linux/compat_time.h to linux/time32.h:
      
      old				new
      ---				---
      compat_time_t			old_time32_t
      struct compat_timeval		struct old_timeval32
      struct compat_timespec		struct old_timespec32
      struct compat_itimerspec	struct old_itimerspec32
      ns_to_compat_timeval()		ns_to_old_timeval32()
      get_compat_itimerspec64()	get_old_itimerspec32()
      put_compat_itimerspec64()	put_old_itimerspec32()
      compat_get_timespec64()		get_old_timespec32()
      compat_put_timespec64()		put_old_timespec32()
      
      As we already have aliases in place, this patch addresses only the
      instances that are relevant to the system call interface in particular,
      not those that occur in device drivers and other modules. Those
      will get handled separately, while providing the 64-bit version
      of the respective interfaces.
      
      I'm not renaming the timex, rusage and itimerval structures, as we are
      still debating what the new interface will look like, and whether we
      will need a replacement at all.
      
      This also doesn't change the names of the syscall entry points, which can
      be done more easily when we actually switch over the 32-bit architectures
      to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
      SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      9afc5eee
  19. 18 7月, 2018 1 次提交
  20. 12 7月, 2018 1 次提交
    • M
      kernel: add ksys_personality() · bf1c77b4
      Mark Rutland 提交于
      Using this helper allows us to avoid the in-kernel call to the
      sys_personality() syscall. The ksys_ prefix denotes that this function
      is meant as a drop-in replacement for the syscall. In particular, it
      uses the same calling convention as sys_personality().
      
      Since ksys_personality is trivial, it is implemented directly in
      <linux/syscalls.h>, as we do for ksys_close() and friends.
      
      This helper is necessary to enable conversion of arm64's syscall
      handling to use pt_regs wrappers.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Martin <dave.martin@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      bf1c77b4
  21. 25 6月, 2018 1 次提交
    • A
      disable -Wattribute-alias warning for SYSCALL_DEFINEx() · bee20031
      Arnd Bergmann 提交于
      gcc-8 warns for every single definition of a system call entry
      point, e.g.:
      
      include/linux/compat.h:56:18: error: 'compat_sys_rt_sigprocmask' alias between functions of incompatible types 'long int(int,  compat_sigset_t *, compat_sigset_t *, compat_size_t)' {aka 'long int(int,  struct <anonymous> *, struct <anonymous> *, unsigned int)'} and 'long int(long int,  long int,  long int,  long int)' [-Werror=attribute-alias]
        asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\
                        ^~~~~~~~~~
      include/linux/compat.h:45:2: note: in expansion of macro 'COMPAT_SYSCALL_DEFINEx'
        COMPAT_SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
        ^~~~~~~~~~~~~~~~~~~~~~
      kernel/signal.c:2601:1: note: in expansion of macro 'COMPAT_SYSCALL_DEFINE4'
       COMPAT_SYSCALL_DEFINE4(rt_sigprocmask, int, how, compat_sigset_t __user *, nset,
       ^~~~~~~~~~~~~~~~~~~~~~
      include/linux/compat.h:60:18: note: aliased declaration here
        asmlinkage long compat_SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\
                        ^~~~~~~~~~
      
      The new warning seems reasonable in principle, but it doesn't
      help us here, since we rely on the type mismatch to sanitize the
      system call arguments. After I reported this as GCC PR82435, a new
      -Wno-attribute-alias option was added that could be used to turn the
      warning off globally on the command line, but I'd prefer to do it a
      little more fine-grained.
      
      Interestingly, turning a warning off and on again inside of
      a single macro doesn't always work, in this case I had to add
      an extra statement inbetween and decided to copy the __SC_TEST
      one from the native syscall to the compat syscall macro.  See
      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83256 for more details
      about this.
      
      [paul.burton@mips.com:
        - Rebase atop current master.
        - Split GCC & version arguments to __diag_ignore() in order to match
          changes to the preceding patch.
        - Add the comment argument to match the preceding patch.]
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82435Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: NStafford Horne <shorne@gmail.com>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      bee20031
  22. 24 6月, 2018 1 次提交
  23. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  24. 03 5月, 2018 1 次提交
    • C
      aio: implement io_pgetevents · 7a074e96
      Christoph Hellwig 提交于
      This is the io_getevents equivalent of ppoll/pselect and allows to
      properly mix signals and aio completions (especially with IOCB_CMD_POLL)
      and atomically executes the following sequence:
      
      	sigset_t origmask;
      
      	pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
      	ret = io_getevents(ctx, min_nr, nr, events, timeout);
      	pthread_sigmask(SIG_SETMASK, &origmask, NULL);
      
      Note that unlike many other signal related calls we do not pass a sigmask
      size, as that would get us to 7 arguments, which aren't easily supported
      by the syscall infrastructure.  It seems a lot less painful to just add a
      new syscall variant in the unlikely case we're going to increase the
      sigset size.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7a074e96
  25. 20 4月, 2018 1 次提交
    • A
      y2038: ipc: Use __kernel_timespec · 21fc538d
      Arnd Bergmann 提交于
      This is a preparatation for changing over __kernel_timespec to 64-bit
      times, which involves assigning new system call numbers for mq_timedsend(),
      mq_timedreceive() and semtimedop() for compatibility with future y2038
      proof user space.
      
      The existing ABIs will remain available through compat code.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      21fc538d