1. 18 7月, 2018 1 次提交
  2. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  3. 03 5月, 2018 1 次提交
    • C
      aio: implement io_pgetevents · 7a074e96
      Christoph Hellwig 提交于
      This is the io_getevents equivalent of ppoll/pselect and allows to
      properly mix signals and aio completions (especially with IOCB_CMD_POLL)
      and atomically executes the following sequence:
      
      	sigset_t origmask;
      
      	pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
      	ret = io_getevents(ctx, min_nr, nr, events, timeout);
      	pthread_sigmask(SIG_SETMASK, &origmask, NULL);
      
      Note that unlike many other signal related calls we do not pass a sigmask
      size, as that would get us to 7 arguments, which aren't easily supported
      by the syscall infrastructure.  It seems a lot less painful to just add a
      new syscall variant in the unlikely case we're going to increase the
      sigset size.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7a074e96
  4. 20 4月, 2018 1 次提交
    • A
      y2038: ipc: Use __kernel_timespec · 21fc538d
      Arnd Bergmann 提交于
      This is a preparatation for changing over __kernel_timespec to 64-bit
      times, which involves assigning new system call numbers for mq_timedsend(),
      mq_timedreceive() and semtimedop() for compatibility with future y2038
      proof user space.
      
      The existing ABIs will remain available through compat code.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      21fc538d
  5. 19 4月, 2018 2 次提交
    • D
      time: Change nanosleep to safe __kernel_* types · 01909974
      Deepa Dinamani 提交于
      Change over clock_nanosleep syscalls to use y2038 safe
      __kernel_timespec times. This will enable changing over
      of these syscalls to use new y2038 safe syscalls when
      the architectures define the CONFIG_64BIT_TIME.
      
      Note that nanosleep syscall is deprecated and does not have a
      plan for making it y2038 safe. But, the syscall should work as
      before on 64 bit machines and on 32 bit machines, the syscall
      works correctly until y2038 as before using the existing compat
      syscall version. There is no new syscall for supporting 64 bit
      time_t on 32 bit architectures.
      
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      01909974
    • D
      time: Change types to new y2038 safe __kernel_* types · 6d5b8413
      Deepa Dinamani 提交于
      Change over clock_settime, clock_gettime and clock_getres
      syscalls to use __kernel_timespec times. This will enable
      changing over of these syscalls to use new y2038 safe syscalls
      when the architectures define the CONFIG_64BIT_TIME.
      
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      6d5b8413
  6. 09 4月, 2018 1 次提交
    • D
      syscalls/core, syscalls/x86: Clean up syscall stub naming convention · e145242e
      Dominik Brodowski 提交于
      Tidy the naming convention for compat syscall subs. Hints which describe
      the purpose of the stub go in front and receive a double underscore to
      denote that they are generated on-the-fly by the SYSCALL_DEFINEx() macro.
      
      For the generic case, this means (0xffffffff prefix removed):
      
       810f08d0 t     kernel_waitid	# common C function (see kernel/exit.c)
      
       <inline>     __do_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
       810f1aa0 T   __se_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long;
      				# casts them to the declared type)
      
       810f1aa0 T        sys_waitid	# alias to __se_sys_waitid() (taking
      				# parameters as declared), to be included
      				# in syscall table
      
      For x86, the naming is as follows:
      
       810efc70 t     kernel_waitid	# common C function (see kernel/exit.c)
      
       <inline>     __do_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
       810efd60 t   __se_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long;
      				# casts them to the declared type)
      
       810f1140 T __ia32_sys_waitid	# IA32_EMULATION 32-bit-ptregs -> C stub,
      				# calls __se_sys_waitid(); to be included
      				# in syscall table
      
       810f1110 T        sys_waitid	# x86 64-bit-ptregs -> C stub, calls
      				# __se_sys_waitid(); to be included in
      				# syscall table
      
      For x86, sys_waitid() will be re-named to __x64_sys_waitid in a follow-up
      patch.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-2-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e145242e
  7. 05 4月, 2018 2 次提交
    • D
      syscalls/x86: Use 'struct pt_regs' based syscall calling convention for 64-bit syscalls · fa697140
      Dominik Brodowski 提交于
      Let's make use of ARCH_HAS_SYSCALL_WRAPPER=y on pure 64-bit x86-64 systems:
      
      Each syscall defines a stub which takes struct pt_regs as its only
      argument. It decodes just those parameters it needs, e.g:
      
      	asmlinkage long sys_xyzzy(const struct pt_regs *regs)
      	{
      		return SyS_xyzzy(regs->di, regs->si, regs->dx);
      	}
      
      This approach avoids leaking random user-provided register content down
      the call chain.
      
      For example, for sys_recv() which is a 4-parameter syscall, the assembly
      now is (in slightly reordered fashion):
      
      	<sys_recv>:
      		callq	<__fentry__>
      
      		/* decode regs->di, ->si, ->dx and ->r10 */
      		mov	0x70(%rdi),%rdi
      		mov	0x68(%rdi),%rsi
      		mov	0x60(%rdi),%rdx
      		mov	0x38(%rdi),%rcx
      
      		[ SyS_recv() is automatically inlined by the compiler,
      		  as it is not [yet] used anywhere else ]
      		/* clear %r9 and %r8, the 5th and 6th args */
      		xor	%r9d,%r9d
      		xor	%r8d,%r8d
      
      		/* do the actual work */
      		callq	__sys_recvfrom
      
      		/* cleanup and return */
      		cltq
      		retq
      
      The only valid place in an x86-64 kernel which rightfully calls
      a syscall function on its own -- vsyscall -- needs to be modified
      to pass struct pt_regs onwards as well.
      
      To keep the syscall table generation working independent of
      SYSCALL_PTREGS being enabled, the stubs are named the same as the
      "original" syscall stubs, i.e. sys_*().
      
      This patch is based on an original proof-of-concept
      
       | From: Linus Torvalds <torvalds@linux-foundation.org>
       | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
      
      and was split up and heavily modified by me, in particular to base it on
      ARCH_HAS_SYSCALL_WRAPPER, to limit it to 64-bit-only for the time being,
      and to update the vsyscall to the new calling convention.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-4-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fa697140
    • D
      syscalls/core: Introduce CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y · 1bd21c6c
      Dominik Brodowski 提交于
      It may be useful for an architecture to override the definitions of the
      SYSCALL_DEFINE0() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>,
      in particular to use a different calling convention for syscalls.
      
      This patch provides a mechanism to do so: It introduces
      CONFIG_ARCH_HAS_SYSCALL_WRAPPER. If it is enabled, <asm/sycall_wrapper.h>
      is included in <linux/syscalls.h> and may be used to define the macros
      mentioned above. Moreover, as the syscall calling convention may be
      different if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is set, the syscall function
      prototypes in <linux/syscalls.h> are #ifndef'd out in that case.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-3-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1bd21c6c
  8. 03 4月, 2018 31 次提交