1. 17 5月, 2019 1 次提交
  2. 21 3月, 2019 6 次提交
    • D
      vfs: syscall: Add fspick() to select a superblock for reconfiguration · cf3cba4a
      David Howells 提交于
      Provide an fspick() system call that can be used to pick an existing
      mountpoint into an fs_context which can thereafter be used to reconfigure a
      superblock (equivalent of the superblock side of -o remount).
      
      This looks like:
      
      	int fd = fspick(AT_FDCWD, "/mnt",
      			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
      	fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
      	fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
      	fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
      
      At the point of fspick being called, the file descriptor referring to the
      filesystem context is in exactly the same state as the one that was created
      by fsopen() after fsmount() has been successfully called.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cf3cba4a
    • D
      vfs: syscall: Add fsmount() to create a mount for a superblock · 93766fbd
      David Howells 提交于
      Provide a system call by which a filesystem opened with fsopen() and
      configured by a series of fsconfig() calls can have a detached mount object
      created for it.  This mount object can then be attached to the VFS mount
      hierarchy using move_mount() by passing the returned file descriptor as the
      from directory fd.
      
      The system call looks like:
      
      	int mfd = fsmount(int fsfd, unsigned int flags,
      			  unsigned int attr_flags);
      
      where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
      FSMOUNT_CLOEXEC.  attr_flags is a bitwise-OR of the following flags:
      
      	MOUNT_ATTR_RDONLY	Mount read-only
      	MOUNT_ATTR_NOSUID	Ignore suid and sgid bits
      	MOUNT_ATTR_NODEV	Disallow access to device special files
      	MOUNT_ATTR_NOEXEC	Disallow program execution
      	MOUNT_ATTR__ATIME	Setting on how atime should be updated
      	MOUNT_ATTR_RELATIME	- Update atime relative to mtime/ctime
      	MOUNT_ATTR_NOATIME	- Do not update access times
      	MOUNT_ATTR_STRICTATIME	- Always perform atime updates
      	MOUNT_ATTR_NODIRATIME	Do not update directory access times
      
      In the event that fsmount() fails, it may be possible to get an error
      message by calling read() on fsfd.  If no message is available, ENODATA
      will be reported.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      93766fbd
    • D
      vfs: syscall: Add fsconfig() for configuring and managing a context · ecdab150
      David Howells 提交于
      Add a syscall for configuring a filesystem creation context and triggering
      actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
      
          long fsconfig(int fs_fd, unsigned int cmd, const char *key,
      		  const void *value, int aux);
      
      Where fs_fd indicates the context, cmd indicates the action to take, key
      indicates the parameter name for parameter-setting actions and, if needed,
      value points to a buffer containing the value and aux can give more
      information for the value.
      
      The following command IDs are proposed:
      
       (*) FSCONFIG_SET_FLAG: No value is specified.  The parameter must be
           boolean in nature.  The key may be prefixed with "no" to invert the
           setting. value must be NULL and aux must be 0.
      
       (*) FSCONFIG_SET_STRING: A string value is specified.  The parameter can
           be expecting boolean, integer, string or take a path.  A conversion to
           an appropriate type will be attempted (which may include looking up as
           a path).  value points to a NUL-terminated string and aux must be 0.
      
       (*) FSCONFIG_SET_BINARY: A binary blob is specified.  value points to
           the blob and aux indicates its size.  The parameter must be expecting
           a blob.
      
       (*) FSCONFIG_SET_PATH: A non-empty path is specified.  The parameter must
           be expecting a path object.  value points to a NUL-terminated string
           that is the path and aux is a file descriptor at which to start a
           relative lookup or AT_FDCWD.
      
       (*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
           implied.
      
       (*) FSCONFIG_SET_FD: An open file descriptor is specified.  value must
           be NULL and aux indicates the file descriptor.
      
       (*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
      
       (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
      
      For the "set" command IDs, the idea is that the file_system_type will point
      to a list of parameters and the types of value that those parameters expect
      to take.  The core code can then do the parse and argument conversion and
      then give the LSM and FS a cooked option or array of options to use.
      
      Source specification is also done the same way same way, using special keys
      "source", "source1", "source2", etc..
      
      [!] Note that, for the moment, the key and value are just glued back
      together and handed to the filesystem.  Every filesystem that uses options
      uses match_token() and co. to do this, and this will need to be changed -
      but not all at once.
      
      Example usage:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
          fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
          fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
          fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
          fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
          fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
          fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
          fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
          fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("afs", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("jffs2", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecdab150
    • D
      vfs: syscall: Add fsopen() to prepare for superblock creation · 24dcb3d9
      David Howells 提交于
      Provide an fsopen() system call that starts the process of preparing to
      create a superblock that will then be mountable, using an fd as a context
      handle.  fsopen() is given the name of the filesystem that will be used:
      
      	int mfd = fsopen(const char *fsname, unsigned int flags);
      
      where flags can be 0 or FSOPEN_CLOEXEC.
      
      For example:
      
      	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
      	fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
      	fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
      	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
      	fsinfo(sfd, NULL, ...); // query new superblock attributes
      	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
      	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
      
      	sfd = fsopen("afs", -1);
      	fsconfig(fd, FSCONFIG_SET_STRING, "source",
      		 "#grand.central.org:root.cell", 0);
      	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
      	mfd = fsmount(sfd, 0, MS_NODEV);
      	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
      
      If an error is reported at any step, an error message may be available to be
      read() back (ENODATA will be reported if there isn't an error available) in
      the form:
      
      	"e <subsys>:<problem>"
      	"e SELinux:Mount on mountpoint not permitted"
      
      Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
      even if the fsmount() fails.  read() is still possible to retrieve error
      information.
      
      The fsopen() syscall creates a mount context and hangs it of the fd that it
      returns.
      
      Netlink is not used because it is optional and would make the core VFS
      dependent on the networking layer and also potentially add network
      namespace issues.
      
      Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
      fsopen().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24dcb3d9
    • D
      vfs: syscall: Add move_mount(2) to move mounts around · 2db154b3
      David Howells 提交于
      Add a move_mount() system call that will move a mount from one place to
      another and, in the next commit, allow to attach an unattached mount tree.
      
      The new system call looks like the following:
      
      	int move_mount(int from_dfd, const char *from_path,
      		       int to_dfd, const char *to_path,
      		       unsigned int flags);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2db154b3
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000
  3. 06 3月, 2019 1 次提交
    • C
      signal: add pidfd_send_signal() syscall · 3eb39f47
      Christian Brauner 提交于
      The kill() syscall operates on process identifiers (pid). After a process
      has exited its pid can be reused by another process. If a caller sends a
      signal to a reused pid it will end up signaling the wrong process. This
      issue has often surfaced and there has been a push to address this problem [1].
      
      This patch uses file descriptors (fd) from proc/<pid> as stable handles on
      struct pid. Even if a pid is recycled the handle will not change. The fd
      can be used to send signals to the process it refers to.
      Thus, the new syscall pidfd_send_signal() is introduced to solve this
      problem. Instead of pids it operates on process fds (pidfd).
      
      /* prototype and argument /*
      long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
      
      /* syscall number 424 */
      The syscall number was chosen to be 424 to align with Arnd's rework in his
      y2038 to minimize merge conflicts (cf. [25]).
      
      In addition to the pidfd and signal argument it takes an additional
      siginfo_t and flags argument. If the siginfo_t argument is NULL then
      pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
      is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
      The flags argument is added to allow for future extensions of this syscall.
      It currently needs to be passed as 0. Failing to do so will cause EINVAL.
      
      /* pidfd_send_signal() replaces multiple pid-based syscalls */
      The pidfd_send_signal() syscall currently takes on the job of
      rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
      positive pid is passed to kill(2). It will however be possible to also
      replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
      
      /* sending signals to threads (tid) and process groups (pgid) */
      Specifically, the pidfd_send_signal() syscall does currently not operate on
      process groups or threads. This is left for future extensions.
      In order to extend the syscall to allow sending signal to threads and
      process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
      PIDFD_TYPE_TID) should be added. This implies that the flags argument will
      determine what is signaled and not the file descriptor itself. Put in other
      words, grouping in this api is a property of the flags argument not a
      property of the file descriptor (cf. [13]). Clarification for this has been
      requested by Eric (cf. [19]).
      When appropriate extensions through the flags argument are added then
      pidfd_send_signal() can additionally replace the part of kill(2) which
      operates on process groups as well as the tgkill(2) and
      rt_tgsigqueueinfo(2) syscalls.
      How such an extension could be implemented has been very roughly sketched
      in [14], [15], and [16]. However, this should not be taken as a commitment
      to a particular implementation. There might be better ways to do it.
      Right now this is intentionally left out to keep this patchset as simple as
      possible (cf. [4]).
      
      /* naming */
      The syscall had various names throughout iterations of this patchset:
      - procfd_signal()
      - procfd_send_signal()
      - taskfd_send_signal()
      In the last round of reviews it was pointed out that given that if the
      flags argument decides the scope of the signal instead of different types
      of fds it might make sense to either settle for "procfd_" or "pidfd_" as
      prefix. The community was willing to accept either (cf. [17] and [18]).
      Given that one developer expressed strong preference for the "pidfd_"
      prefix (cf. [13]) and with other developers less opinionated about the name
      we should settle for "pidfd_" to avoid further bikeshedding.
      
      The  "_send_signal" suffix was chosen to reflect the fact that the syscall
      takes on the job of multiple syscalls. It is therefore intentional that the
      name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
      fomer because it might imply that pidfd_send_signal() is a replacement for
      kill(2), and not the latter because it is a hassle to remember the correct
      spelling - especially for non-native speakers - and because it is not
      descriptive enough of what the syscall actually does. The name
      "pidfd_send_signal" makes it very clear that its job is to send signals.
      
      /* zombies */
      Zombies can be signaled just as any other process. No special error will be
      reported since a zombie state is an unreliable state (cf. [3]). However,
      this can be added as an extension through the @flags argument if the need
      ever arises.
      
      /* cross-namespace signals */
      The patch currently enforces that the signaler and signalee either are in
      the same pid namespace or that the signaler's pid namespace is an ancestor
      of the signalee's pid namespace. This is done for the sake of simplicity
      and because it is unclear to what values certain members of struct
      siginfo_t would need to be set to (cf. [5], [6]).
      
      /* compat syscalls */
      It became clear that we would like to avoid adding compat syscalls
      (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
      itself by adding __copy_siginfo_from_user_generic() which lets us avoid
      compat syscalls (cf. [8]). It should be noted that the addition of
      __copy_siginfo_from_user_any() is caused by a bug in the original
      implementation of rt_sigqueueinfo(2) (cf. 12).
      With upcoming rework for syscall handling things might improve
      significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
      any additional callers.
      
      /* testing */
      This patch was tested on x64 and x86.
      
      /* userspace usage */
      An asciinema recording for the basic functionality can be found under [9].
      With this patch a process can be killed via:
      
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <unistd.h>
      
       static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                               unsigned int flags)
       {
       #ifdef __NR_pidfd_send_signal
               return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
       #else
               return -ENOSYS;
       #endif
       }
      
       int main(int argc, char *argv[])
       {
               int fd, ret, saved_errno, sig;
      
               if (argc < 3)
                       exit(EXIT_FAILURE);
      
               fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
               if (fd < 0) {
                       printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               sig = atoi(argv[2]);
      
               printf("Sending signal %d to process %s\n", sig, argv[1]);
               ret = do_pidfd_send_signal(fd, sig, NULL, 0);
      
               saved_errno = errno;
               close(fd);
               errno = saved_errno;
      
               if (ret < 0) {
                       printf("%s - Failed to send signal %d to process %s\n",
                              strerror(errno), sig, argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               exit(EXIT_SUCCESS);
       }
      
      /* Q&A
       * Given that it seems the same questions get asked again by people who are
       * late to the party it makes sense to add a Q&A section to the commit
       * message so it's hopefully easier to avoid duplicate threads.
       *
       * For the sake of progress please consider these arguments settled unless
       * there is a new point that desperately needs to be addressed. Please make
       * sure to check the links to the threads in this commit message whether
       * this has not already been covered.
       */
      Q-01: (Florian Weimer [20], Andrew Morton [21])
            What happens when the target process has exited?
      A-01: Sending the signal will fail with ESRCH (cf. [22]).
      
      Q-02:  (Andrew Morton [21])
             Is the task_struct pinned by the fd?
      A-02:  No. A reference to struct pid is kept. struct pid - as far as I
             understand - was created exactly for the reason to not require to
             pin struct task_struct (cf. [22]).
      
      Q-03: (Andrew Morton [21])
            Does the entire procfs directory remain visible? Just one entry
            within it?
      A-03: The same thing that happens right now when you hold a file descriptor
            to /proc/<pid> open (cf. [22]).
      
      Q-04: (Andrew Morton [21])
            Does the pid remain reserved?
      A-04: No. This patchset guarantees a stable handle not that pids are not
            recycled (cf. [22]).
      
      Q-05: (Andrew Morton [21])
            Do attempts to signal that fd return errors?
      A-05: See {Q,A}-01.
      
      Q-06: (Andrew Morton [22])
            Is there a cleaner way of obtaining the fd? Another syscall perhaps.
      A-06: Userspace can already trivially retrieve file descriptors from procfs
            so this is something that we will need to support anyway. Hence,
            there's no immediate need to add another syscalls just to make
            pidfd_send_signal() not dependent on the presence of procfs. However,
            adding a syscalls to get such file descriptors is planned for a
            future patchset (cf. [22]).
      
      Q-07: (Andrew Morton [21] and others)
            This fd-for-a-process sounds like a handy thing and people may well
            think up other uses for it in the future, probably unrelated to
            signals. Are the code and the interface designed to permit such
            future applications?
      A-07: Yes (cf. [22]).
      
      Q-08: (Andrew Morton [21] and others)
            Now I think about it, why a new syscall? This thing is looking
            rather like an ioctl?
      A-08: This has been extensively discussed. It was agreed that a syscall is
            preferred for a variety or reasons. Here are just a few taken from
            prior threads. Syscalls are safer than ioctl()s especially when
            signaling to fds. Processes are a core kernel concept so a syscall
            seems more appropriate. The layout of the syscall with its four
            arguments would require the addition of a custom struct for the
            ioctl() thereby causing at least the same amount or even more
            complexity for userspace than a simple syscall. The new syscall will
            replace multiple other pid-based syscalls (see description above).
            The file-descriptors-for-processes concept introduced with this
            syscall will be extended with other syscalls in the future. See also
            [22], [23] and various other threads already linked in here.
      
      Q-09: (Florian Weimer [24])
            What happens if you use the new interface with an O_PATH descriptor?
      A-09:
            pidfds opened as O_PATH fds cannot be used to send signals to a
            process (cf. [2]). Signaling processes through pidfds is the
            equivalent of writing to a file. Thus, this is not an operation that
            operates "purely at the file descriptor level" as required by the
            open(2) manpage. See also [4].
      
      /* References */
      [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
      [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
      [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
      [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
      [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
      [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
      [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
      [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
      [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
      [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
      [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
      [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
      [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
      [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
      [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
      [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
      [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
      [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
      [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
      [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
      [23]: https://lwn.net/Articles/773459/
      [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NTycho Andersen <tycho@tycho.ws>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NAleksa Sarai <cyphar@cyphar.com>
      3eb39f47
  4. 28 2月, 2019 2 次提交
    • J
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe 提交于
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edafccee
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
  5. 07 2月, 2019 4 次提交
    • A
      y2038: add 64-bit time_t syscalls to all 32-bit architectures · 48166e6e
      Arnd Bergmann 提交于
      This adds 21 new system calls on each ABI that has 32-bit time_t
      today. All of these have the exact same semantics as their existing
      counterparts, and the new ones all have macro names that end in 'time64'
      for clarification.
      
      This gets us to the point of being able to safely use a C library
      that has 64-bit time_t in user space. There are still a couple of
      loose ends to tie up in various areas of the code, but this is the
      big one, and should be entirely uncontroversial at this point.
      
      In particular, there are four system calls (getitimer, setitimer,
      waitid, and getrusage) that don't have a 64-bit counterpart yet,
      but these can all be safely implemented in the C library by wrapping
      around the existing system calls because the 32-bit time_t they
      pass only counts elapsed time, not time since the epoch. They
      will be dealt with later.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      48166e6e
    • A
      y2038: rename old time and utime syscalls · d33c577c
      Arnd Bergmann 提交于
      The time, stime, utime, utimes, and futimesat system calls are only
      used on older architectures, and we do not provide y2038 safe variants
      of them, as they are replaced by clock_gettime64, clock_settime64,
      and utimensat_time64.
      
      However, for consistency it seems better to have the 32-bit architectures
      that still use them call the "time32" entry points (leaving the
      traditional handlers for the 64-bit architectures), like we do for system
      calls that now require two versions.
      
      Note: We used to always define __ARCH_WANT_SYS_TIME and
      __ARCH_WANT_SYS_UTIME and only set __ARCH_WANT_COMPAT_SYS_TIME and
      __ARCH_WANT_SYS_UTIME32 for compat mode on 64-bit kernels. Now this is
      reversed: only 64-bit architectures set __ARCH_WANT_SYS_TIME/UTIME, while
      we need __ARCH_WANT_SYS_TIME32/UTIME32 for 32-bit architectures and compat
      mode. The resulting asm/unistd.h changes look a bit counterintuitive.
      
      This is only a cleanup patch and it should not change any behavior.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      d33c577c
    • A
      y2038: use time32 syscall names on 32-bit · 00bf25d6
      Arnd Bergmann 提交于
      This is the big flip, where all 32-bit architectures set COMPAT_32BIT_TIME
      and use the _time32 system calls from the former compat layer instead
      of the system calls that take __kernel_timespec and similar arguments.
      
      The temporary redirects for __kernel_timespec, __kernel_itimerspec
      and __kernel_timex can get removed with this.
      
      It would be easy to split this commit by architecture, but with the new
      generated system call tables, it's easy enough to do it all at once,
      which makes it a little easier to check that the changes are the same
      in each table.
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      00bf25d6
    • A
      y2038: syscalls: rename y2038 compat syscalls · 8dabe724
      Arnd Bergmann 提交于
      A lot of system calls that pass a time_t somewhere have an implementation
      using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
      been reworked so that this implementation can now be used on 32-bit
      architectures as well.
      
      The missing step is to redefine them using the regular SYSCALL_DEFINEx()
      to get them out of the compat namespace and make it possible to build them
      on 32-bit architectures.
      
      Any system call that ends in 'time' gets a '32' suffix on its name for
      that version, while the others get a '_time32' suffix, to distinguish
      them from the normal version, which takes a 64-bit time argument in the
      future.
      
      In this step, only 64-bit architectures are changed, doing this rename
      first lets us avoid touching the 32-bit architectures twice.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8dabe724
  6. 26 1月, 2019 1 次提交
    • A
      arch: add split IPC system calls where needed · 0d6040d4
      Arnd Bergmann 提交于
      The IPC system call handling is highly inconsistent across architectures,
      some use sys_ipc, some use separate calls, and some use both.  We also
      have some architectures that require passing IPC_64 in the flags, and
      others that set it implicitly.
      
      For the addition of a y2038 safe semtimedop() system call, I chose to only
      support the separate entry points, but that requires first supporting
      the regular ones with their own syscall numbers.
      
      The IPC_64 is now implied by the new semctl/shmctl/msgctl system
      calls even on the architectures that require passing it with the ipc()
      multiplexer.
      
      I'm not adding the new semtimedop() or semop() on 32-bit architectures,
      those will get implemented using the new semtimedop_time64() version
      that gets added along with the other time64 calls.
      Three 64-bit architectures (powerpc, s390 and sparc) get semtimedop().
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      0d6040d4
  7. 06 6月, 2018 1 次提交
    • M
      x86: Wire up restartable sequence system call · 05c17ced
      Mathieu Desnoyers 提交于
      Wire up the rseq system call on x86 32/64.
      
      This provides an ABI improving the speed of a user-space getcpu
      operation on x86 by removing the need to perform a function call, "lsl"
      instruction, or system call on the fast path, as well as improving the
      speed of user-space operations on per-cpu data.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-8-mathieu.desnoyers@efficios.com
      05c17ced
  8. 03 5月, 2018 1 次提交
    • C
      aio: implement io_pgetevents · 7a074e96
      Christoph Hellwig 提交于
      This is the io_getevents equivalent of ppoll/pselect and allows to
      properly mix signals and aio completions (especially with IOCB_CMD_POLL)
      and atomically executes the following sequence:
      
      	sigset_t origmask;
      
      	pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
      	ret = io_getevents(ctx, min_nr, nr, events, timeout);
      	pthread_sigmask(SIG_SETMASK, &origmask, NULL);
      
      Note that unlike many other signal related calls we do not pass a sigmask
      size, as that would get us to 7 arguments, which aren't easily supported
      by the syscall infrastructure.  It seems a lot less painful to just add a
      new syscall variant in the unlikely case we're going to increase the
      sigset size.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7a074e96
  9. 09 4月, 2018 3 次提交
    • D
      syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*() · d5a00528
      Dominik Brodowski 提交于
      This rename allows us to have a coherent syscall stub naming convention on
      64-bit x86 (0xffffffff prefix removed):
      
       810f0af0 t            kernel_waitid	# common (32/64) kernel helper
      
       <inline>            __do_sys_waitid	# inlined helper doing actual work
       810f0be0 t          __se_sys_waitid	# C func calling inlined helper
      
       <inline>     __do_compat_sys_waitid	# inlined helper doing actual work
       810f0d80 t   __se_compat_sys_waitid	# compat C func calling inlined helper
      
       810f2080 T         __x64_sys_waitid	# x64 64-bit-ptregs -> C stub
       810f20b0 T        __ia32_sys_waitid	# ia32 32-bit-ptregs -> C stub[*]
       810f2470 T __ia32_compat_sys_waitid	# ia32 32-bit-ptregs -> compat C stub
       810f2490 T  __x32_compat_sys_waitid	# x32 64-bit-ptregs -> compat C stub
      
          [*] This stub is unused, as the syscall table links
      	__ia32_compat_sys_waitid instead of __ia32_sys_waitid as we need
      	a compat variant here.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-4-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d5a00528
    • D
      syscalls/core, syscalls/x86: Clean up compat syscall stub naming convention · 5ac9efa3
      Dominik Brodowski 提交于
      Tidy the naming convention for compat syscall subs. Hints which describe
      the purpose of the stub go in front and receive a double underscore to
      denote that they are generated on-the-fly by the COMPAT_SYSCALL_DEFINEx()
      macro.
      
      For the generic case, this means:
      
      t            kernel_waitid	# common C function (see kernel/exit.c)
      
          __do_compat_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
      T   __se_compat_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long,
      				# casts them to unsigned long and then to
      				# the declared type)
      
      T        compat_sys_waitid      # alias to __se_compat_sys_waitid()
      				# (taking parameters as declared), to
      				# be included in syscall table
      
      For x86, the naming is as follows:
      
      t            kernel_waitid	# common C function (see kernel/exit.c)
      
          __do_compat_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
      t   __se_compat_sys_waitid      # sign-extending C function calling inlined
      				# helper (takes parameters of type long,
      				# casts them to unsigned long and then to
      				# the declared type)
      
      T __ia32_compat_sys_waitid	# IA32_EMULATION 32-bit-ptregs -> C stub,
      				# calls __se_compat_sys_waitid(); to be
      				# included in syscall table
      
      T  __x32_compat_sys_waitid	# x32 64-bit-ptregs -> C stub, calls
      				# __se_compat_sys_waitid(); to be included
      				# in syscall table
      
      If only one of IA32_EMULATION and x32 is enabled, __se_compat_sys_waitid()
      may be inlined into the stub __{ia32,x32}_compat_sys_waitid().
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-3-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ac9efa3
    • D
      syscalls/core, syscalls/x86: Clean up syscall stub naming convention · e145242e
      Dominik Brodowski 提交于
      Tidy the naming convention for compat syscall subs. Hints which describe
      the purpose of the stub go in front and receive a double underscore to
      denote that they are generated on-the-fly by the SYSCALL_DEFINEx() macro.
      
      For the generic case, this means (0xffffffff prefix removed):
      
       810f08d0 t     kernel_waitid	# common C function (see kernel/exit.c)
      
       <inline>     __do_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
       810f1aa0 T   __se_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long;
      				# casts them to the declared type)
      
       810f1aa0 T        sys_waitid	# alias to __se_sys_waitid() (taking
      				# parameters as declared), to be included
      				# in syscall table
      
      For x86, the naming is as follows:
      
       810efc70 t     kernel_waitid	# common C function (see kernel/exit.c)
      
       <inline>     __do_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
       810efd60 t   __se_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long;
      				# casts them to the declared type)
      
       810f1140 T __ia32_sys_waitid	# IA32_EMULATION 32-bit-ptregs -> C stub,
      				# calls __se_sys_waitid(); to be included
      				# in syscall table
      
       810f1110 T        sys_waitid	# x86 64-bit-ptregs -> C stub, calls
      				# __se_sys_waitid(); to be included in
      				# syscall table
      
      For x86, sys_waitid() will be re-named to __x64_sys_waitid in a follow-up
      patch.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-2-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e145242e
  10. 05 4月, 2018 1 次提交
    • D
      syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32 · ebeb8c82
      Dominik Brodowski 提交于
      Extend ARCH_HAS_SYSCALL_WRAPPER for i386 emulation and for x32 on 64-bit
      x86.
      
      For x32, all we need to do is to create an additional stub for each
      compat syscall which decodes the parameters in x86-64 ordering, e.g.:
      
      	asmlinkage long __compat_sys_x32_xyzzy(struct pt_regs *regs)
      	{
      		return c_SyS_xyzzy(regs->di, regs->si, regs->dx);
      	}
      
      For i386 emulation, we need to teach compat_sys_*() to take struct
      pt_regs as its only argument, e.g.:
      
      	asmlinkage long __compat_sys_ia32_xyzzy(struct pt_regs *regs)
      	{
      		return c_SyS_xyzzy(regs->bx, regs->cx, regs->dx);
      	}
      
      In addition, we need to create additional stubs for common syscalls
      (that is, for syscalls which have the same parameters on 32-bit and
      64-bit), e.g.:
      
      	asmlinkage long __sys_ia32_xyzzy(struct pt_regs *regs)
      	{
      		return c_sys_xyzzy(regs->bx, regs->cx, regs->dx);
      	}
      
      This approach avoids leaking random user-provided register content down
      the call chain.
      
      This patch is based on an original proof-of-concept
      
       | From: Linus Torvalds <torvalds@linux-foundation.org>
       | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
      
      and was split up and heavily modified by me, in particular to base it on
      ARCH_HAS_SYSCALL_WRAPPER.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-6-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ebeb8c82
  11. 03 4月, 2018 2 次提交
  12. 07 3月, 2018 3 次提交
  13. 18 4月, 2017 1 次提交
    • A
      Remove compat_sys_getdents64() · 2611dc19
      Al Viro 提交于
      Unlike normal compat syscall variants, it is needed only for
      biarch architectures that have different alignement requirements for
      u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
      a store of 64bit value at 32bit-aligned address.  We used to have one
      such (ia64), but its biarch support has been gone since 2010 (after
      being broken in 2008, which went unnoticed since nobody had been using
      it).
      
      It had escaped removal at the same time only because back in 2004
      a patch that switched several syscalls on amd64 from private wrappers to
      generic compat ones had switched to use of compat_sys_getdents64(), which
      hadn't needed (or used) a compat wrapper on amd64.
      
      Let's bury it - it's at least 7 years overdue.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2611dc19
  14. 20 3月, 2017 1 次提交
    • K
      x86/syscalls/32: Wire up arch_prctl on x86-32 · 79170fda
      Kyle Huey 提交于
      Hook up arch_prctl to call do_arch_prctl() on x86-32, and in 32 bit compat
      mode on x86-64. This allows to have arch_prctls that are not specific to 64
      bits.
      
      On UML, simply stub out this syscall.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: linux-kselftest@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Robert O'Callahan <robert@ocallahan.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: user-mode-linux-user@lists.sourceforge.net
      Cc: David Matlack <dmatlack@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/20170320081628.18952-7-khuey@kylehuey.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      79170fda
  15. 03 3月, 2017 1 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
  16. 18 10月, 2016 1 次提交
    • D
      x86, pkeys: remove cruft from never-merged syscalls · eac0ca77
      Dave Hansen 提交于
      pkey_set() and pkey_get() were syscalls present in older versions
      of the protection keys patches.  The syscall number definitions
      were inadvertently left in place.  This patch removes them.
      
      I did a git grep and verified that these are the last places in
      the tree that these appear, save for the protection_keys.c tests
      and Documentation.  Those spots talk about functions called
      pkey_get/set() which are wrappers for the direct PKRU
      instructions, not the syscalls.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: mgorman@techsingularity.net
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Fixes: f9afc619 ("x86: Wire up protection keys system calls")
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac0ca77
  17. 09 9月, 2016 1 次提交
  18. 01 8月, 2016 1 次提交
  19. 12 5月, 2016 1 次提交
  20. 13 4月, 2016 1 次提交
  21. 05 3月, 2016 1 次提交
  22. 02 12月, 2015 1 次提交
  23. 06 11月, 2015 1 次提交
  24. 09 10月, 2015 1 次提交
  25. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  26. 05 9月, 2015 1 次提交