- 18 1月, 2020 1 次提交
-
-
由 Aleksa Sarai 提交于
/* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014b ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 1月, 2020 1 次提交
-
-
由 Sargun Dhillon 提交于
This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: NSargun Dhillon <sargun@sargun.me> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.meSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 07 9月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
As Vincent noticed, the y2038 conversion of semtimedop in linux-5.1 broke when commit 00bf25d6 ("y2038: use time32 syscall names on 32-bit") changed all system calls on all architectures that take a 32-bit time_t to point to the _time32 implementation, but left out semtimedop in the asm-generic header. This affects all 32-bit architectures using asm-generic/unistd.h: h8300, unicore32, openrisc, nios2, hexagon, c6x, arc, nds32 and csky. The notable exception is riscv32, which has dropped support for the time32 system calls entirely. Reported-by: NVincent Chen <deanbo422@gmail.com> Cc: stable@vger.kernel.org Cc: Vincent Chen <deanbo422@gmail.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Stafford Horne <shorne@gmail.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Ley Foon Tan <lftan@altera.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com> Cc: Guo Ren <guoren@kernel.org> Fixes: 00bf25d6 ("y2038: use time32 syscall names on 32-bit") Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 15 7月, 2019 1 次提交
-
-
由 Christian Brauner 提交于
This lets us catch new architectures that implicitly make use of clone3 without setting __ARCH_WANT_SYS_CLONE3. Failing on missing __ARCH_WANT_SYS_CLONE3 is a good indicator that they either did not really want this syscall or haven't really thought about whether it needs special treatment and just accidently included it in their entrypoints by e.g. generating their syscall table automatically via asm-generic/unistd.h This patch has been compile-tested for the h8300 architecture which is one of the architectures that does not yet implement clone3 and generates its syscall table via asm-generic/unistd.h. Signed-off-by: NChristian Brauner <christian@brauner.io> Suggested-by: NArnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20190714192205.27190-3-christian@brauner.ioReviewed-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NChristian Brauner <christian@brauner.io>
-
- 28 6月, 2019 1 次提交
-
-
由 Christian Brauner 提交于
This wires up the pidfd_open() syscall into all arches at once. Signed-off-by: NChristian Brauner <christian@brauner.io> Reviewed-by: NDavid Howells <dhowells@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
-
- 09 6月, 2019 1 次提交
-
-
由 Christian Brauner 提交于
Wire up the clone3() call on all arches that don't require hand-rolled assembly. Some of the arches look like they need special assembly massaging and it is probably smarter if the appropriate arch maintainers would do the actual wiring. Arches that are wired-up are: - x86{_32,64} - arm{64} - xtensa Signed-off-by: NChristian Brauner <christian@brauner.io> Acked-by: NArnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <adrian@lisas.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
-
- 17 5月, 2019 1 次提交
-
-
由 David Howells 提交于
Wire up the mount API syscalls on non-x86 arches. Reported-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NDavid Howells <dhowells@redhat.com> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 06 3月, 2019 1 次提交
-
-
由 Christian Brauner 提交于
The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: NChristian Brauner <christian@brauner.io> Reviewed-by: NTycho Andersen <tycho@tycho.ws> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NDavid Howells <dhowells@redhat.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NSerge Hallyn <serge@hallyn.com> Acked-by: NAleksa Sarai <cyphar@cyphar.com>
-
- 28 2月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
If we have fixed user buffers, we can map them into the kernel when we setup the io_uring. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring instance, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring instance. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring instance. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 20 2月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
We don't want new architectures to even provide the old 32-bit time_t based system calls any more, or define the syscall number macros. Add a new __ARCH_WANT_TIME32_SYSCALLS macro that gets enabled for all existing 32-bit architectures using the generic system call table, so we don't change any current behavior. Since this symbol is evaluated in user space as well, we cannot use a Kconfig CONFIG_* macro but have to define it in uapi/asm/unistd.h. On 64-bit architectures, the same system call numbers mostly refer to the system calls we want to keep, as they already pass 64-bit time_t. As new architectures no longer provide these, we need new exceptions in checksyscalls.sh. Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 19 2月, 2019 1 次提交
-
-
由 Yury Norov 提交于
The newer prlimit64 syscall provides all the functionality of getrlimit and setrlimit syscalls and adds the pid of target process, so future architectures won't need to include getrlimit and setrlimit. Therefore drop getrlimit and setrlimit syscalls from the generic syscall list unless __ARCH_WANT_SET_GET_RLIMIT is defined by the architecture's unistd.h prior to including asm-generic/unistd.h, and adjust all architectures using the generic syscall list to define it so that no in-tree architectures are affected. Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-hexagon@vger.kernel.org Cc: uclinux-h8-devel@lists.sourceforge.jp Signed-off-by: NYury Norov <ynorov@caviumnetworks.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Acked-by: Mark Salter <msalter@redhat.com> [c6x] Acked-by: James Hogan <james.hogan@imgtec.com> [metag] Acked-by: Ley Foon Tan <lftan@altera.com> [nios2] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Will Deacon <will.deacon@arm.com> [arm64] Acked-by: Vineet Gupta <vgupta@synopsys.com> #arch/arc bits Signed-off-by: NYury Norov <ynorov@marvell.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 18 2月, 2019 1 次提交
-
-
由 Yury Norov 提交于
The only difference between native and compat openat and open_by_handle_at is that non-compat version forces O_LARGEFILE, and it should be the default behaviour for all architectures, as we are going to drop the support of 32-bit userspace off_t. Signed-off-by: NYury Norov <ynorov@caviumnetworks.com> Signed-off-by: NYury Norov <ynorov@marvell.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 07 2月, 2019 3 次提交
-
-
由 Arnd Bergmann 提交于
This adds 21 new system calls on each ABI that has 32-bit time_t today. All of these have the exact same semantics as their existing counterparts, and the new ones all have macro names that end in 'time64' for clarification. This gets us to the point of being able to safely use a C library that has 64-bit time_t in user space. There are still a couple of loose ends to tie up in various areas of the code, but this is the big one, and should be entirely uncontroversial at this point. In particular, there are four system calls (getitimer, setitimer, waitid, and getrusage) that don't have a 64-bit counterpart yet, but these can all be safely implemented in the C library by wrapping around the existing system calls because the 32-bit time_t they pass only counts elapsed time, not time since the epoch. They will be dealt with later. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
-
由 Arnd Bergmann 提交于
This is the big flip, where all 32-bit architectures set COMPAT_32BIT_TIME and use the _time32 system calls from the former compat layer instead of the system calls that take __kernel_timespec and similar arguments. The temporary redirects for __kernel_timespec, __kernel_itimerspec and __kernel_timex can get removed with this. It would be easy to split this commit by architecture, but with the new generated system call tables, it's easy enough to do it all at once, which makes it a little easier to check that the changes are the same in each table. Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Arnd Bergmann 提交于
A lot of system calls that pass a time_t somewhere have an implementation using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have been reworked so that this implementation can now be used on 32-bit architectures as well. The missing step is to redefine them using the regular SYSCALL_DEFINEx() to get them out of the compat namespace and make it possible to build them on 32-bit architectures. Any system call that ends in 'time' gets a '32' suffix on its name for that version, while the others get a '_time32' suffix, to distinguish them from the normal version, which takes a 64-bit time argument in the future. In this step, only 64-bit architectures are changed, doing this rename first lets us avoid touching the 32-bit architectures twice. Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 26 1月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
The IPC system call handling is highly inconsistent across architectures, some use sys_ipc, some use separate calls, and some use both. We also have some architectures that require passing IPC_64 in the flags, and others that set it implicitly. For the addition of a y2038 safe semtimedop() system call, I chose to only support the separate entry points, but that requires first supporting the regular ones with their own syscall numbers. The IPC_64 is now implied by the new semctl/shmctl/msgctl system calls even on the architectures that require passing it with the ipc() multiplexer. I'm not adding the new semtimedop() or semop() on 32-bit architectures, those will get implemented using the new semtimedop_time64() version that gets added along with the other time64 calls. Three 64-bit architectures (powerpc, s390 and sparc) get semtimedop(). Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
-
- 06 12月, 2018 2 次提交
-
-
由 Guo Ren 提交于
The broken macros make the glibc compile error. If there is no __NR3264_fstat*, we should also removed related definitions. Reported-by: NMarcin Juszkiewicz <marcin.juszkiewicz@linaro.org> Fixes: bf4b6a7d ("y2038: Remove stat64 family from default syscall set") [arnd: Both Marcin and Guo provided this patch to fix up my clearly broken commit, I applied the version with the better changelog.] Signed-off-by: NGuo Ren <ren_guo@c-sky.com> Signed-off-by: NMao Han <han_mao@c-sky.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 AKASHI Takahiro 提交于
The initial user of this system call number is arm64. Signed-off-by: NAKASHI Takahiro <takahiro.akashi@linaro.org> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 29 8月, 2018 1 次提交
-
-
由 Arnd Bergmann 提交于
New architectures should no longer need stat64, which is not y2038 safe and has been replaced by statx(). This removes the 'select __ARCH_WANT_STAT64' statement from asm-generic/unistd.h and instead moves it into the respective asm/unistd.h UAPI header files for each architecture that uses it today. In the generic file, the system call number and entry points are now made conditional, so newly added architectures (e.g. riscv32 or csky) will never need to carry backwards compatiblity for it. arm64 is the only 64-bit architecture using the asm-generic/unistd.h file, and it already sets __ARCH_WANT_NEW_STAT in its headers, and I use the same #ifdef here: future 64-bit architectures therefore won't see newstat or stat64 any more. They don't suffer from the y2038 time_t overflow, but for consistency it seems best to also let them use statx(). Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 11 7月, 2018 1 次提交
-
-
由 Will Deacon 提交于
The new rseq call arrived in 4.18-rc1, so provide it in the asm-generic unistd.h for architectures such as arm64. Acked-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 03 5月, 2018 1 次提交
-
-
由 Christoph Hellwig 提交于
This is the io_getevents equivalent of ppoll/pselect and allows to properly mix signals and aio completions (especially with IOCB_CMD_POLL) and atomically executes the following sequence: sigset_t origmask; pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); ret = io_getevents(ctx, min_nr, nr, events, timeout); pthread_sigmask(SIG_SETMASK, &origmask, NULL); Note that unlike many other signal related calls we do not pass a sigmask size, as that would get us to 7 arguments, which aren't easily supported by the syscall infrastructure. It seems a lot less painful to just add a new syscall variant in the unlikely case we're going to increase the sigset size. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 26 3月, 2018 1 次提交
-
-
由 Arnd Bergmann 提交于
The score architecture used a number of old system calls for compatibility with a traditional libc port, all architectures that got added later skip these. With score out of the way, we can finally clean up the syscall list to no longer provide these. Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 02 11月, 2017 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license. By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". otherwise syscall usage would not be possible. Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org> Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 18 4月, 2017 1 次提交
-
-
由 Al Viro 提交于
Unlike normal compat syscall variants, it is needed only for biarch architectures that have different alignement requirements for u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle a store of 64bit value at 32bit-aligned address. We used to have one such (ia64), but its biarch support has been gone since 2010 (after being broken in 2008, which went unnoticed since nobody had been using it). It had escaped removal at the same time only because back in 2004 a patch that switched several syscalls on amd64 from private wrappers to generic compat ones had switched to use of compat_sys_getdents64(), which hadn't needed (or used) a compat wrapper on amd64. Let's bury it - it's at least 7 years overdue. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 20 3月, 2017 1 次提交
-
-
由 Stafford Horne 提交于
The new syscall statx is implemented as generic code, so enable it for architectures like openrisc which use the generic syscall table. Fixes: a528d35e ("statx: Add a system call to make enhanced file info available") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: NStafford Horne <shorne@gmail.com> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 18 10月, 2016 1 次提交
-
-
由 Dave Hansen 提交于
pkey_set() and pkey_get() were syscalls present in older versions of the protection keys patches. They were fully excised from the x86 code, but some cruft was left in the generic syscall code. The C++ comments were intended to help to make it more glaring to me to fix them before actually submitting them. That technique worked, but later than I would have liked. I test-compiled this for arm64. Fixes: a60f7b69 ("generic syscalls: Wire up memory protection keys syscalls") Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Cc: linux-arch@vger.kernel.org Cc: mgorman@techsingularity.net Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 9月, 2016 1 次提交
-
-
由 Dave Hansen 提交于
These new syscalls are implemented as generic code, so enable them for architectures like arm64 which use the generic syscall table. According to Arnd: Even if the support is x86 specific for the forseeable future, it may be good to reserve the number just in case. The other architecture specific syscall lists are usually left to the individual arch maintainers, most a lot of the newer architectures share this table. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: mgorman@techsingularity.net Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163018.505A6875@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 05 5月, 2016 2 次提交
-
-
由 James Hogan 提交于
The newer renameat2 syscall provides all the functionality provided by the renameat syscall and adds flags, so future architectures won't need to include renameat. Therefore drop the renameat syscall from the generic syscall list unless __ARCH_WANT_RENAMEAT is defined by the architecture's unistd.h prior to including asm-generic/unistd.h, and adjust all architectures using the generic syscall list to define it so that no in-tree architectures are affected. Signed-off-by: NJames Hogan <james.hogan@imgtec.com> Acked-by: NVineet Gupta <vgupta@synopsys.com> Cc: linux-arch@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: linux-c6x-dev@linux-c6x.org Cc: Richard Kuo <rkuo@codeaurora.org> Cc: linux-hexagon@vger.kernel.org Cc: linux-metag@vger.kernel.org Cc: Jonas Bonn <jonas@southpole.se> Cc: linux@lists.openrisc.net Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ley Foon Tan <lftan@altera.com> Cc: nios2-dev@lists.rocketboards.org Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: uclinux-h8-devel@lists.sourceforge.jp Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Yury Norov 提交于
Compat architectures that does not use generic unistd (mips, s390), declare compat version in their syscall tables for preadv2 and pwritev2. Generic unistd syscall table should do it as well. [arnd: this initially slipped through the review and an incorrect patch got merged. arch/tile/ is the only architecture that could be affected for their 32-bit compat mode, every other architecture we support today is fine.] Signed-off-by: NYury Norov <ynorov@caviumnetworks.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 24 4月, 2016 1 次提交
-
-
由 Andre Przywara 提交于
These new syscalls are implemented as generic code, so enable them for architectures like arm64 which use the generic syscall table. Signed-off-by: NAndre Przywara <andre.przywara@arm.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 02 12月, 2015 1 次提交
-
-
由 Zach Brown 提交于
Add a copy_file_range() system call for offloading copies between regular files. This gives an interface to underlying layers of the storage stack which can copy without reading and writing all the data. There are a few candidates that should support copy offloading in the nearer term: - btrfs shares extent references with its clone ioctl - NFS has patches to add a COPY command which copies on the server - SCSI has a family of XCOPY commands which copy in the device This system call avoids the complexity of also accelerating the creation of the destination file by operating on an existing destination file descriptor, not a path. Currently the high level vfs entry point limits copy offloading to files on the same mount and super (and not in the same file). This can be relaxed if we get implementations which can copy between file systems safely. Signed-off-by: NZach Brown <zab@redhat.com> [Anna Schumaker: Change -EINVAL to -EBADF during file verification, Change flags parameter from int to unsigned int, Add function to include/linux/syscalls.h, Check copy len after file open mode, Don't forbid ranges inside the same file, Use rw_verify_area() to veriy ranges, Use file_out rather than file_in, Add COPY_FR_REFLINK flag] Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 06 11月, 2015 1 次提交
-
-
由 Eric B Munson 提交于
With the refactored mlock code, introduce a new system call for mlock. The new call will allow the user to specify what lock states are being added. mlock2 is trivial at the moment, but a follow on patch will add a new mlock state making it useful. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 9月, 2015 1 次提交
-
-
由 Dr. David Alan Gilbert 提交于
Add the userfaultfd syscalls to uapi asm-generic, it was tested with postcopy live migration on aarch64 with both 4k and 64k pagesize kernels. Signed-off-by: NDr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 9月, 2015 1 次提交
-
-
由 Mathieu Desnoyers 提交于
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nicholas Miell <nmiell@comcast.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 12月, 2014 1 次提交
-
-
由 David Drysdale 提交于
This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: NDavid Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 9月, 2014 1 次提交
-
-
由 Alexei Starovoitov 提交于
done as separate commit to ease conflict resolution Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 8月, 2014 1 次提交
-
-
由 Will Deacon 提交于
Commit 9183df25 ("shm: add memfd_create() syscall") added a new system call (memfd_create) but didn't update the asm-generic unistd header. This patch adds the new system call to the asm-generic version of unistd.h so that it can be used by architectures such as arm64. Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: NDavid Herrmann <dh.herrmann@gmail.com> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 06 8月, 2014 1 次提交
-
-
由 Theodore Ts'o 提交于
The getrandom(2) system call was requested by the LibreSSL Portable developers. It is analoguous to the getentropy(2) system call in OpenBSD. The rationale of this system call is to provide resiliance against file descriptor exhaustion attacks, where the attacker consumes all available file descriptors, forcing the use of the fallback code where /dev/[u]random is not available. Since the fallback code is often not well-tested, it is better to eliminate this potential failure mode entirely. The other feature provided by this new system call is the ability to request randomness from the /dev/urandom entropy pool, but to block until at least 128 bits of entropy has been accumulated in the /dev/urandom entropy pool. Historically, the emphasis in the /dev/urandom development has been to ensure that urandom pool is initialized as quickly as possible after system boot, and preferably before the init scripts start execution. This is because changing /dev/urandom reads to block represents an interface change that could potentially break userspace which is not acceptable. In practice, on most x86 desktop and server systems, in general the entropy pool can be initialized before it is needed (and in modern kernels, we will printk a warning message if not). However, on an embedded system, this may not be the case. And so with this new interface, we can provide the functionality of blocking until the urandom pool has been initialized. Any userspace program which uses this new functionality must take care to assure that if it is used during the boot process, that it will not cause the init scripts or other portions of the system startup to hang indefinitely. SYNOPSIS #include <linux/random.h> int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The system call getrandom() fills the buffer pointed to by buf with up to buflen random bytes which can be used to seed user space random number generators (i.e., DRBG's) or for other cryptographic uses. It should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. If the GRND_RANDOM flags bit is set, then draw from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise, so if there is insufficient entropy, the requested number of bytes may not be returned. If there is no entropy available at all, getrandom(2) will either block, or return an error with errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags. If the GRND_RANDOM bit is not set, then the /dev/urandom pool will be used. Unlike using read(2) to fetch data from /dev/urandom, if the urandom pool has not been sufficiently initialized, getrandom(2) will block (or return -1 with the errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags). The getentropy(2) system call in OpenBSD can be emulated using the following function: int getentropy(void *buf, size_t buflen) { int ret; if (buflen > 256) goto failure; ret = getrandom(buf, buflen, 0); if (ret < 0) return ret; if (ret == buflen) return 0; failure: errno = EIO; return -1; } RETURN VALUE On success, the number of bytes that was filled in the buf is returned. This may not be all the bytes requested by the caller via buflen if insufficient entropy was present in the /dev/random pool, or if the system call was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL An invalid flag was passed to getrandom(2) EFAULT buf is outside the accessible address space. EAGAIN The requested entropy was not available, and getentropy(2) would have blocked if the GRND_NONBLOCK flag was not set. EINTR While blocked waiting for entropy, the call was interrupted by a signal handler; see the description of how interrupted read(2) calls on "slow" devices are handled with and without the SA_RESTART flag in the signal(7) man page. NOTES For small requests (buflen <= 256) getrandom(2) will not return EINTR when reading from the urandom pool once the entropy pool has been initialized, and it will return all of the bytes that have been requested. This is the recommended way to use getrandom(2), and is designed for compatibility with OpenBSD's getentropy() system call. However, if you are using GRND_RANDOM, then getrandom(2) may block until the entropy accounting determines that sufficient environmental noise has been gathered such that getrandom(2) will be operating as a NRBG instead of a DRBG for those people who are working in the NIST SP 800-90 regime. Since it may block for a long time, these guarantees do *not* apply. The user may want to interrupt a hanging process using a signal, so blocking until all of the requested bytes are returned would be unfriendly. For this reason, the user of getrandom(2) MUST always check the return value, in case it returns some error, or if fewer bytes than requested was returned. In the case of !GRND_RANDOM and small request, the latter should never happen, but the careful userspace code (and all crypto code should be careful) should check for this anyway! Finally, unless you are doing long-term key generation (and perhaps not even then), you probably shouldn't be using GRND_RANDOM. The cryptographic algorithms used for /dev/urandom are quite conservative, and so should be sufficient for all purposes. The disadvantage of GRND_RANDOM is that it can block, and the increased complexity required to deal with partially fulfilled getrandom(2) requests. Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NZach Brown <zab@zabbo.net>
-
- 19 7月, 2014 1 次提交
-
-
由 Kees Cook 提交于
This adds the new "seccomp" syscall with both an "operation" and "flags" parameter for future expansion. The third argument is a pointer value, used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...). In addition to the TSYNC flag later in this patch series, there is a non-zero chance that this syscall could be used for configuring a fixed argument area for seccomp-tracer-aware processes to pass syscall arguments in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter" for this syscall. Additionally, this syscall uses operation, flags, and user pointer for arguments because strictly passing arguments via a user pointer would mean seccomp itself would be unable to trivially filter the seccomp syscall itself. Signed-off-by: NKees Cook <keescook@chromium.org> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
-