1. 18 1月, 2020 1 次提交
    • A
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai 提交于
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  2. 14 1月, 2020 1 次提交
  3. 07 9月, 2019 1 次提交
    • A
      ipc: fix semtimedop for generic 32-bit architectures · 78e05972
      Arnd Bergmann 提交于
      As Vincent noticed, the y2038 conversion of semtimedop in linux-5.1
      broke when commit 00bf25d6 ("y2038: use time32 syscall names on
      32-bit") changed all system calls on all architectures that take
      a 32-bit time_t to point to the _time32 implementation, but left out
      semtimedop in the asm-generic header.
      
      This affects all 32-bit architectures using asm-generic/unistd.h:
      h8300, unicore32, openrisc, nios2, hexagon, c6x, arc, nds32 and csky.
      
      The notable exception is riscv32, which has dropped support for the
      time32 system calls entirely.
      Reported-by: NVincent Chen <deanbo422@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Fixes: 00bf25d6 ("y2038: use time32 syscall names on 32-bit")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      78e05972
  4. 15 7月, 2019 1 次提交
  5. 28 6月, 2019 1 次提交
    • C
      arch: wire-up pidfd_open() · 7615d9e1
      Christian Brauner 提交于
      This wires up the pidfd_open() syscall into all arches at once.
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      7615d9e1
  6. 09 6月, 2019 1 次提交
    • C
      arch: wire-up clone3() syscall · 8f3220a8
      Christian Brauner 提交于
      Wire up the clone3() call on all arches that don't require hand-rolled
      assembly.
      
      Some of the arches look like they need special assembly massaging and it is
      probably smarter if the appropriate arch maintainers would do the actual
      wiring. Arches that are wired-up are:
      - x86{_32,64}
      - arm{64}
      - xtensa
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      8f3220a8
  7. 17 5月, 2019 1 次提交
  8. 06 3月, 2019 1 次提交
    • C
      signal: add pidfd_send_signal() syscall · 3eb39f47
      Christian Brauner 提交于
      The kill() syscall operates on process identifiers (pid). After a process
      has exited its pid can be reused by another process. If a caller sends a
      signal to a reused pid it will end up signaling the wrong process. This
      issue has often surfaced and there has been a push to address this problem [1].
      
      This patch uses file descriptors (fd) from proc/<pid> as stable handles on
      struct pid. Even if a pid is recycled the handle will not change. The fd
      can be used to send signals to the process it refers to.
      Thus, the new syscall pidfd_send_signal() is introduced to solve this
      problem. Instead of pids it operates on process fds (pidfd).
      
      /* prototype and argument /*
      long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
      
      /* syscall number 424 */
      The syscall number was chosen to be 424 to align with Arnd's rework in his
      y2038 to minimize merge conflicts (cf. [25]).
      
      In addition to the pidfd and signal argument it takes an additional
      siginfo_t and flags argument. If the siginfo_t argument is NULL then
      pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
      is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
      The flags argument is added to allow for future extensions of this syscall.
      It currently needs to be passed as 0. Failing to do so will cause EINVAL.
      
      /* pidfd_send_signal() replaces multiple pid-based syscalls */
      The pidfd_send_signal() syscall currently takes on the job of
      rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
      positive pid is passed to kill(2). It will however be possible to also
      replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
      
      /* sending signals to threads (tid) and process groups (pgid) */
      Specifically, the pidfd_send_signal() syscall does currently not operate on
      process groups or threads. This is left for future extensions.
      In order to extend the syscall to allow sending signal to threads and
      process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
      PIDFD_TYPE_TID) should be added. This implies that the flags argument will
      determine what is signaled and not the file descriptor itself. Put in other
      words, grouping in this api is a property of the flags argument not a
      property of the file descriptor (cf. [13]). Clarification for this has been
      requested by Eric (cf. [19]).
      When appropriate extensions through the flags argument are added then
      pidfd_send_signal() can additionally replace the part of kill(2) which
      operates on process groups as well as the tgkill(2) and
      rt_tgsigqueueinfo(2) syscalls.
      How such an extension could be implemented has been very roughly sketched
      in [14], [15], and [16]. However, this should not be taken as a commitment
      to a particular implementation. There might be better ways to do it.
      Right now this is intentionally left out to keep this patchset as simple as
      possible (cf. [4]).
      
      /* naming */
      The syscall had various names throughout iterations of this patchset:
      - procfd_signal()
      - procfd_send_signal()
      - taskfd_send_signal()
      In the last round of reviews it was pointed out that given that if the
      flags argument decides the scope of the signal instead of different types
      of fds it might make sense to either settle for "procfd_" or "pidfd_" as
      prefix. The community was willing to accept either (cf. [17] and [18]).
      Given that one developer expressed strong preference for the "pidfd_"
      prefix (cf. [13]) and with other developers less opinionated about the name
      we should settle for "pidfd_" to avoid further bikeshedding.
      
      The  "_send_signal" suffix was chosen to reflect the fact that the syscall
      takes on the job of multiple syscalls. It is therefore intentional that the
      name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
      fomer because it might imply that pidfd_send_signal() is a replacement for
      kill(2), and not the latter because it is a hassle to remember the correct
      spelling - especially for non-native speakers - and because it is not
      descriptive enough of what the syscall actually does. The name
      "pidfd_send_signal" makes it very clear that its job is to send signals.
      
      /* zombies */
      Zombies can be signaled just as any other process. No special error will be
      reported since a zombie state is an unreliable state (cf. [3]). However,
      this can be added as an extension through the @flags argument if the need
      ever arises.
      
      /* cross-namespace signals */
      The patch currently enforces that the signaler and signalee either are in
      the same pid namespace or that the signaler's pid namespace is an ancestor
      of the signalee's pid namespace. This is done for the sake of simplicity
      and because it is unclear to what values certain members of struct
      siginfo_t would need to be set to (cf. [5], [6]).
      
      /* compat syscalls */
      It became clear that we would like to avoid adding compat syscalls
      (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
      itself by adding __copy_siginfo_from_user_generic() which lets us avoid
      compat syscalls (cf. [8]). It should be noted that the addition of
      __copy_siginfo_from_user_any() is caused by a bug in the original
      implementation of rt_sigqueueinfo(2) (cf. 12).
      With upcoming rework for syscall handling things might improve
      significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
      any additional callers.
      
      /* testing */
      This patch was tested on x64 and x86.
      
      /* userspace usage */
      An asciinema recording for the basic functionality can be found under [9].
      With this patch a process can be killed via:
      
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <unistd.h>
      
       static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                               unsigned int flags)
       {
       #ifdef __NR_pidfd_send_signal
               return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
       #else
               return -ENOSYS;
       #endif
       }
      
       int main(int argc, char *argv[])
       {
               int fd, ret, saved_errno, sig;
      
               if (argc < 3)
                       exit(EXIT_FAILURE);
      
               fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
               if (fd < 0) {
                       printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               sig = atoi(argv[2]);
      
               printf("Sending signal %d to process %s\n", sig, argv[1]);
               ret = do_pidfd_send_signal(fd, sig, NULL, 0);
      
               saved_errno = errno;
               close(fd);
               errno = saved_errno;
      
               if (ret < 0) {
                       printf("%s - Failed to send signal %d to process %s\n",
                              strerror(errno), sig, argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               exit(EXIT_SUCCESS);
       }
      
      /* Q&A
       * Given that it seems the same questions get asked again by people who are
       * late to the party it makes sense to add a Q&A section to the commit
       * message so it's hopefully easier to avoid duplicate threads.
       *
       * For the sake of progress please consider these arguments settled unless
       * there is a new point that desperately needs to be addressed. Please make
       * sure to check the links to the threads in this commit message whether
       * this has not already been covered.
       */
      Q-01: (Florian Weimer [20], Andrew Morton [21])
            What happens when the target process has exited?
      A-01: Sending the signal will fail with ESRCH (cf. [22]).
      
      Q-02:  (Andrew Morton [21])
             Is the task_struct pinned by the fd?
      A-02:  No. A reference to struct pid is kept. struct pid - as far as I
             understand - was created exactly for the reason to not require to
             pin struct task_struct (cf. [22]).
      
      Q-03: (Andrew Morton [21])
            Does the entire procfs directory remain visible? Just one entry
            within it?
      A-03: The same thing that happens right now when you hold a file descriptor
            to /proc/<pid> open (cf. [22]).
      
      Q-04: (Andrew Morton [21])
            Does the pid remain reserved?
      A-04: No. This patchset guarantees a stable handle not that pids are not
            recycled (cf. [22]).
      
      Q-05: (Andrew Morton [21])
            Do attempts to signal that fd return errors?
      A-05: See {Q,A}-01.
      
      Q-06: (Andrew Morton [22])
            Is there a cleaner way of obtaining the fd? Another syscall perhaps.
      A-06: Userspace can already trivially retrieve file descriptors from procfs
            so this is something that we will need to support anyway. Hence,
            there's no immediate need to add another syscalls just to make
            pidfd_send_signal() not dependent on the presence of procfs. However,
            adding a syscalls to get such file descriptors is planned for a
            future patchset (cf. [22]).
      
      Q-07: (Andrew Morton [21] and others)
            This fd-for-a-process sounds like a handy thing and people may well
            think up other uses for it in the future, probably unrelated to
            signals. Are the code and the interface designed to permit such
            future applications?
      A-07: Yes (cf. [22]).
      
      Q-08: (Andrew Morton [21] and others)
            Now I think about it, why a new syscall? This thing is looking
            rather like an ioctl?
      A-08: This has been extensively discussed. It was agreed that a syscall is
            preferred for a variety or reasons. Here are just a few taken from
            prior threads. Syscalls are safer than ioctl()s especially when
            signaling to fds. Processes are a core kernel concept so a syscall
            seems more appropriate. The layout of the syscall with its four
            arguments would require the addition of a custom struct for the
            ioctl() thereby causing at least the same amount or even more
            complexity for userspace than a simple syscall. The new syscall will
            replace multiple other pid-based syscalls (see description above).
            The file-descriptors-for-processes concept introduced with this
            syscall will be extended with other syscalls in the future. See also
            [22], [23] and various other threads already linked in here.
      
      Q-09: (Florian Weimer [24])
            What happens if you use the new interface with an O_PATH descriptor?
      A-09:
            pidfds opened as O_PATH fds cannot be used to send signals to a
            process (cf. [2]). Signaling processes through pidfds is the
            equivalent of writing to a file. Thus, this is not an operation that
            operates "purely at the file descriptor level" as required by the
            open(2) manpage. See also [4].
      
      /* References */
      [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
      [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
      [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
      [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
      [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
      [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
      [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
      [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
      [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
      [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
      [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
      [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
      [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
      [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
      [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
      [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
      [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
      [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
      [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
      [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
      [23]: https://lwn.net/Articles/773459/
      [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NTycho Andersen <tycho@tycho.ws>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NAleksa Sarai <cyphar@cyphar.com>
      3eb39f47
  9. 28 2月, 2019 2 次提交
    • J
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe 提交于
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edafccee
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1
  10. 20 2月, 2019 1 次提交
    • A
      asm-generic: Make time32 syscall numbers optional · c8ce48f0
      Arnd Bergmann 提交于
      We don't want new architectures to even provide the old 32-bit time_t
      based system calls any more, or define the syscall number macros.
      
      Add a new __ARCH_WANT_TIME32_SYSCALLS macro that gets enabled for all
      existing 32-bit architectures using the generic system call table,
      so we don't change any current behavior.
      Since this symbol is evaluated in user space as well, we cannot use
      a Kconfig CONFIG_* macro but have to define it in uapi/asm/unistd.h.
      
      On 64-bit architectures, the same system call numbers mostly refer to
      the system calls we want to keep, as they already pass 64-bit time_t.
      
      As new architectures no longer provide these, we need new exceptions
      in checksyscalls.sh.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      c8ce48f0
  11. 19 2月, 2019 1 次提交
    • Y
      asm-generic: Drop getrlimit and setrlimit syscalls from default list · 80d7da1c
      Yury Norov 提交于
      The newer prlimit64 syscall provides all the functionality of getrlimit
      and setrlimit syscalls and adds the pid of target process, so future
      architectures won't need to include getrlimit and setrlimit.
      
      Therefore drop getrlimit and setrlimit syscalls from the generic syscall
      list unless __ARCH_WANT_SET_GET_RLIMIT is defined by the architecture's
      unistd.h prior to including asm-generic/unistd.h, and adjust all
      architectures using the generic syscall list to define it so that no
      in-tree architectures are affected.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-hexagon@vger.kernel.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Signed-off-by: NYury Norov <ynorov@caviumnetworks.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: Mark Salter <msalter@redhat.com> [c6x]
      Acked-by: James Hogan <james.hogan@imgtec.com> [metag]
      Acked-by: Ley Foon Tan <lftan@altera.com> [nios2]
      Acked-by: Stafford Horne <shorne@gmail.com> [openrisc]
      Acked-by: Will Deacon <will.deacon@arm.com> [arm64]
      Acked-by: Vineet Gupta <vgupta@synopsys.com> #arch/arc bits
      Signed-off-by: NYury Norov <ynorov@marvell.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      80d7da1c
  12. 18 2月, 2019 1 次提交
  13. 07 2月, 2019 3 次提交
    • A
      y2038: add 64-bit time_t syscalls to all 32-bit architectures · 48166e6e
      Arnd Bergmann 提交于
      This adds 21 new system calls on each ABI that has 32-bit time_t
      today. All of these have the exact same semantics as their existing
      counterparts, and the new ones all have macro names that end in 'time64'
      for clarification.
      
      This gets us to the point of being able to safely use a C library
      that has 64-bit time_t in user space. There are still a couple of
      loose ends to tie up in various areas of the code, but this is the
      big one, and should be entirely uncontroversial at this point.
      
      In particular, there are four system calls (getitimer, setitimer,
      waitid, and getrusage) that don't have a 64-bit counterpart yet,
      but these can all be safely implemented in the C library by wrapping
      around the existing system calls because the 32-bit time_t they
      pass only counts elapsed time, not time since the epoch. They
      will be dealt with later.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      48166e6e
    • A
      y2038: use time32 syscall names on 32-bit · 00bf25d6
      Arnd Bergmann 提交于
      This is the big flip, where all 32-bit architectures set COMPAT_32BIT_TIME
      and use the _time32 system calls from the former compat layer instead
      of the system calls that take __kernel_timespec and similar arguments.
      
      The temporary redirects for __kernel_timespec, __kernel_itimerspec
      and __kernel_timex can get removed with this.
      
      It would be easy to split this commit by architecture, but with the new
      generated system call tables, it's easy enough to do it all at once,
      which makes it a little easier to check that the changes are the same
      in each table.
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      00bf25d6
    • A
      y2038: syscalls: rename y2038 compat syscalls · 8dabe724
      Arnd Bergmann 提交于
      A lot of system calls that pass a time_t somewhere have an implementation
      using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
      been reworked so that this implementation can now be used on 32-bit
      architectures as well.
      
      The missing step is to redefine them using the regular SYSCALL_DEFINEx()
      to get them out of the compat namespace and make it possible to build them
      on 32-bit architectures.
      
      Any system call that ends in 'time' gets a '32' suffix on its name for
      that version, while the others get a '_time32' suffix, to distinguish
      them from the normal version, which takes a 64-bit time argument in the
      future.
      
      In this step, only 64-bit architectures are changed, doing this rename
      first lets us avoid touching the 32-bit architectures twice.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8dabe724
  14. 26 1月, 2019 1 次提交
    • A
      arch: add split IPC system calls where needed · 0d6040d4
      Arnd Bergmann 提交于
      The IPC system call handling is highly inconsistent across architectures,
      some use sys_ipc, some use separate calls, and some use both.  We also
      have some architectures that require passing IPC_64 in the flags, and
      others that set it implicitly.
      
      For the addition of a y2038 safe semtimedop() system call, I chose to only
      support the separate entry points, but that requires first supporting
      the regular ones with their own syscall numbers.
      
      The IPC_64 is now implied by the new semctl/shmctl/msgctl system
      calls even on the architectures that require passing it with the ipc()
      multiplexer.
      
      I'm not adding the new semtimedop() or semop() on 32-bit architectures,
      those will get implemented using the new semtimedop_time64() version
      that gets added along with the other time64 calls.
      Three 64-bit architectures (powerpc, s390 and sparc) get semtimedop().
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      0d6040d4
  15. 06 12月, 2018 2 次提交
  16. 29 8月, 2018 1 次提交
    • A
      y2038: Remove stat64 family from default syscall set · bf4b6a7d
      Arnd Bergmann 提交于
      New architectures should no longer need stat64, which is not y2038
      safe and has been replaced by statx(). This removes the 'select
      __ARCH_WANT_STAT64' statement from asm-generic/unistd.h and instead
      moves it into the respective asm/unistd.h UAPI header files for each
      architecture that uses it today.
      
      In the generic file, the system call number and entry points are now
      made conditional, so newly added architectures (e.g. riscv32 or csky)
      will never need to carry backwards compatiblity for it.
      
      arm64 is the only 64-bit architecture using the asm-generic/unistd.h
      file, and it already sets __ARCH_WANT_NEW_STAT in its headers, and I
      use the same #ifdef here: future 64-bit architectures therefore won't
      see newstat or stat64 any more. They don't suffer from the y2038 time_t
      overflow, but for consistency it seems best to also let them use statx().
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      bf4b6a7d
  17. 11 7月, 2018 1 次提交
  18. 03 5月, 2018 1 次提交
    • C
      aio: implement io_pgetevents · 7a074e96
      Christoph Hellwig 提交于
      This is the io_getevents equivalent of ppoll/pselect and allows to
      properly mix signals and aio completions (especially with IOCB_CMD_POLL)
      and atomically executes the following sequence:
      
      	sigset_t origmask;
      
      	pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
      	ret = io_getevents(ctx, min_nr, nr, events, timeout);
      	pthread_sigmask(SIG_SETMASK, &origmask, NULL);
      
      Note that unlike many other signal related calls we do not pass a sigmask
      size, as that would get us to 7 arguments, which aren't easily supported
      by the syscall infrastructure.  It seems a lot less painful to just add a
      new syscall variant in the unlikely case we're going to increase the
      sigset size.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7a074e96
  19. 26 3月, 2018 1 次提交
    • A
      asm-generic: clean up asm/unistd.h · a0673fdb
      Arnd Bergmann 提交于
      The score architecture used a number of old system calls for compatibility
      with a traditional libc port, all architectures that got added later
      skip these. With score out of the way, we can finally clean up the
      syscall list to no longer provide these.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      a0673fdb
  20. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX license identifier to uapi header files with no license · 6f52b16c
      Greg Kroah-Hartman 提交于
      Many user space API headers are missing licensing information, which
      makes it hard for compliance tools to determine the correct license.
      
      By default are files without license information under the default
      license of the kernel, which is GPLV2.  Marking them GPLV2 would exclude
      them from being included in non GPLV2 code, which is obviously not
      intended. The user space API headers fall under the syscall exception
      which is in the kernels COPYING file:
      
         NOTE! This copyright does *not* cover user programs that use kernel
         services by normal system calls - this is merely considered normal use
         of the kernel, and does *not* fall under the heading of "derived work".
      
      otherwise syscall usage would not be possible.
      
      Update the files which contain no license information with an SPDX
      license identifier.  The chosen identifier is 'GPL-2.0 WITH
      Linux-syscall-note' which is the officially assigned identifier for the
      Linux syscall exception.  SPDX license identifiers are a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.  See the previous patch in this series for the
      methodology of how this patch was researched.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f52b16c
  21. 18 4月, 2017 1 次提交
    • A
      Remove compat_sys_getdents64() · 2611dc19
      Al Viro 提交于
      Unlike normal compat syscall variants, it is needed only for
      biarch architectures that have different alignement requirements for
      u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
      a store of 64bit value at 32bit-aligned address.  We used to have one
      such (ia64), but its biarch support has been gone since 2010 (after
      being broken in 2008, which went unnoticed since nobody had been using
      it).
      
      It had escaped removal at the same time only because back in 2004
      a patch that switched several syscalls on amd64 from private wrappers to
      generic compat ones had switched to use of compat_sys_getdents64(), which
      hadn't needed (or used) a compat wrapper on amd64.
      
      Let's bury it - it's at least 7 years overdue.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2611dc19
  22. 20 3月, 2017 1 次提交
    • S
      generic syscalls: Wire up statx syscall · fdfe4a39
      Stafford Horne 提交于
      The new syscall statx is implemented as generic code, so enable it
      for architectures like openrisc which use the generic syscall table.
      
      Fixes: a528d35e ("statx: Add a system call to make enhanced file info available")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Signed-off-by: NStafford Horne <shorne@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      fdfe4a39
  23. 18 10月, 2016 1 次提交
    • D
      generic syscalls: kill cruft from removed pkey syscalls · 71757904
      Dave Hansen 提交于
      pkey_set() and pkey_get() were syscalls present in older versions
      of the protection keys patches.  They were fully excised from the
      x86 code, but some cruft was left in the generic syscall code.  The
      C++ comments were intended to help to make it more glaring to me to
      fix them before actually submitting them.  That technique worked,
      but later than I would have liked.
      
      I test-compiled this for arm64.
      
      Fixes: a60f7b69 ("generic syscalls: Wire up memory protection keys syscalls")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: mgorman@techsingularity.net
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71757904
  24. 09 9月, 2016 1 次提交
  25. 05 5月, 2016 2 次提交
    • J
      asm-generic: Drop renameat syscall from default list · b0da6d44
      James Hogan 提交于
      The newer renameat2 syscall provides all the functionality provided by
      the renameat syscall and adds flags, so future architectures won't need
      to include renameat.
      
      Therefore drop the renameat syscall from the generic syscall list unless
      __ARCH_WANT_RENAMEAT is defined by the architecture's unistd.h prior to
      including asm-generic/unistd.h, and adjust all architectures using the
      generic syscall list to define it so that no in-tree architectures are
      affected.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Acked-by: NVineet Gupta <vgupta@synopsys.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: linux@lists.openrisc.net
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: nios2-dev@lists.rocketboards.org
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      b0da6d44
    • Y
      asm-generic: use compat version for preadv2 and pwritev2 · 1f93e9f2
      Yury Norov 提交于
      Compat architectures that does not use generic unistd (mips, s390),
      declare compat version in their syscall tables for preadv2 and
      pwritev2. Generic unistd syscall table should do it as well.
      
      [arnd: this initially slipped through the review and an
       incorrect patch got merged. arch/tile/ is the only architecture
       that could be affected for their 32-bit compat mode, every
       other architecture we support today is fine.]
      Signed-off-by: NYury Norov <ynorov@caviumnetworks.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      1f93e9f2
  26. 24 4月, 2016 1 次提交
  27. 02 12月, 2015 1 次提交
    • Z
      vfs: add copy_file_range syscall and vfs helper · 29732938
      Zach Brown 提交于
      Add a copy_file_range() system call for offloading copies between
      regular files.
      
      This gives an interface to underlying layers of the storage stack which
      can copy without reading and writing all the data.  There are a few
      candidates that should support copy offloading in the nearer term:
      
      - btrfs shares extent references with its clone ioctl
      - NFS has patches to add a COPY command which copies on the server
      - SCSI has a family of XCOPY commands which copy in the device
      
      This system call avoids the complexity of also accelerating the creation
      of the destination file by operating on an existing destination file
      descriptor, not a path.
      
      Currently the high level vfs entry point limits copy offloading to files
      on the same mount and super (and not in the same file).  This can be
      relaxed if we get implementations which can copy between file systems
      safely.
      Signed-off-by: NZach Brown <zab@redhat.com>
      [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                       Change flags parameter from int to unsigned int,
                       Add function to include/linux/syscalls.h,
                       Check copy len after file open mode,
                       Don't forbid ranges inside the same file,
                       Use rw_verify_area() to veriy ranges,
                       Use file_out rather than file_in,
                       Add COPY_FR_REFLINK flag]
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      29732938
  28. 06 11月, 2015 1 次提交
  29. 23 9月, 2015 1 次提交
  30. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  31. 14 12月, 2014 1 次提交
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
  32. 27 9月, 2014 1 次提交
  33. 19 8月, 2014 1 次提交
  34. 06 8月, 2014 1 次提交
    • T
      random: introduce getrandom(2) system call · c6e9d6f3
      Theodore Ts'o 提交于
      The getrandom(2) system call was requested by the LibreSSL Portable
      developers.  It is analoguous to the getentropy(2) system call in
      OpenBSD.
      
      The rationale of this system call is to provide resiliance against
      file descriptor exhaustion attacks, where the attacker consumes all
      available file descriptors, forcing the use of the fallback code where
      /dev/[u]random is not available.  Since the fallback code is often not
      well-tested, it is better to eliminate this potential failure mode
      entirely.
      
      The other feature provided by this new system call is the ability to
      request randomness from the /dev/urandom entropy pool, but to block
      until at least 128 bits of entropy has been accumulated in the
      /dev/urandom entropy pool.  Historically, the emphasis in the
      /dev/urandom development has been to ensure that urandom pool is
      initialized as quickly as possible after system boot, and preferably
      before the init scripts start execution.
      
      This is because changing /dev/urandom reads to block represents an
      interface change that could potentially break userspace which is not
      acceptable.  In practice, on most x86 desktop and server systems, in
      general the entropy pool can be initialized before it is needed (and
      in modern kernels, we will printk a warning message if not).  However,
      on an embedded system, this may not be the case.  And so with this new
      interface, we can provide the functionality of blocking until the
      urandom pool has been initialized.  Any userspace program which uses
      this new functionality must take care to assure that if it is used
      during the boot process, that it will not cause the init scripts or
      other portions of the system startup to hang indefinitely.
      
      SYNOPSIS
      	#include <linux/random.h>
      
      	int getrandom(void *buf, size_t buflen, unsigned int flags);
      
      DESCRIPTION
      	The system call getrandom() fills the buffer pointed to by buf
      	with up to buflen random bytes which can be used to seed user
      	space random number generators (i.e., DRBG's) or for other
      	cryptographic uses.  It should not be used for Monte Carlo
      	simulations or other programs/algorithms which are doing
      	probabilistic sampling.
      
      	If the GRND_RANDOM flags bit is set, then draw from the
      	/dev/random pool instead of the /dev/urandom pool.  The
      	/dev/random pool is limited based on the entropy that can be
      	obtained from environmental noise, so if there is insufficient
      	entropy, the requested number of bytes may not be returned.
      	If there is no entropy available at all, getrandom(2) will
      	either block, or return an error with errno set to EAGAIN if
      	the GRND_NONBLOCK bit is set in flags.
      
      	If the GRND_RANDOM bit is not set, then the /dev/urandom pool
      	will be used.  Unlike using read(2) to fetch data from
      	/dev/urandom, if the urandom pool has not been sufficiently
      	initialized, getrandom(2) will block (or return -1 with the
      	errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).
      
      	The getentropy(2) system call in OpenBSD can be emulated using
      	the following function:
      
                  int getentropy(void *buf, size_t buflen)
                  {
                          int     ret;
      
                          if (buflen > 256)
                                  goto failure;
                          ret = getrandom(buf, buflen, 0);
                          if (ret < 0)
                                  return ret;
                          if (ret == buflen)
                                  return 0;
                  failure:
                          errno = EIO;
                          return -1;
                  }
      
      RETURN VALUE
             On success, the number of bytes that was filled in the buf is
             returned.  This may not be all the bytes requested by the
             caller via buflen if insufficient entropy was present in the
             /dev/random pool, or if the system call was interrupted by a
             signal.
      
             On error, -1 is returned, and errno is set appropriately.
      
      ERRORS
      	EINVAL		An invalid flag was passed to getrandom(2)
      
      	EFAULT		buf is outside the accessible address space.
      
      	EAGAIN		The requested entropy was not available, and
      			getentropy(2) would have blocked if the
      			GRND_NONBLOCK flag was not set.
      
      	EINTR		While blocked waiting for entropy, the call was
      			interrupted by a signal handler; see the description
      			of how interrupted read(2) calls on "slow" devices
      			are handled with and without the SA_RESTART flag
      			in the signal(7) man page.
      
      NOTES
      	For small requests (buflen <= 256) getrandom(2) will not
      	return EINTR when reading from the urandom pool once the
      	entropy pool has been initialized, and it will return all of
      	the bytes that have been requested.  This is the recommended
      	way to use getrandom(2), and is designed for compatibility
      	with OpenBSD's getentropy() system call.
      
      	However, if you are using GRND_RANDOM, then getrandom(2) may
      	block until the entropy accounting determines that sufficient
      	environmental noise has been gathered such that getrandom(2)
      	will be operating as a NRBG instead of a DRBG for those people
      	who are working in the NIST SP 800-90 regime.  Since it may
      	block for a long time, these guarantees do *not* apply.  The
      	user may want to interrupt a hanging process using a signal,
      	so blocking until all of the requested bytes are returned
      	would be unfriendly.
      
      	For this reason, the user of getrandom(2) MUST always check
      	the return value, in case it returns some error, or if fewer
      	bytes than requested was returned.  In the case of
      	!GRND_RANDOM and small request, the latter should never
      	happen, but the careful userspace code (and all crypto code
      	should be careful) should check for this anyway!
      
      	Finally, unless you are doing long-term key generation (and
      	perhaps not even then), you probably shouldn't be using
      	GRND_RANDOM.  The cryptographic algorithms used for
      	/dev/urandom are quite conservative, and so should be
      	sufficient for all purposes.  The disadvantage of GRND_RANDOM
      	is that it can block, and the increased complexity required to
      	deal with partially fulfilled getrandom(2) requests.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NZach Brown <zab@zabbo.net>
      c6e9d6f3
  35. 19 7月, 2014 1 次提交
    • K
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook 提交于
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      48dc92b9