1. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  2. 26 5月, 2018 1 次提交
  3. 18 5月, 2018 1 次提交
  4. 12 5月, 2018 1 次提交
  5. 07 5月, 2018 1 次提交
  6. 12 4月, 2018 3 次提交
  7. 08 4月, 2018 1 次提交
  8. 06 4月, 2018 2 次提交
    • S
      init, tracing: Have printk come through the trace events for initcall_debug · 4e37958d
      Steven Rostedt (VMware) 提交于
      With trace events set before and after the initcall function calls, instead
      of having a separate routine for printing out the initcalls when
      initcall_debug is specified on the kernel command line, have the code
      register a callback to the tracepoints where the initcall trace events are.
      
      This removes the need for having a separate function to do the initcalls as
      the tracepoint callbacks can handle the printk. It also includes other
      initcalls that are not called by the do_one_initcall() which includes
      console and security initcalls.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      4e37958d
    • S
      init, tracing: Add initcall trace events · 4ee7c60d
      Steven Rostedt (VMware) 提交于
      Being able to trace the start and stop of initcalls is useful to see where
      the timings are an issue. There is already an "initcall_debug" parameter,
      but that can cause a large overhead itself, as the printing of the
      information may take longer than the initcall functions.
      
      Adding in a start and finish trace event around the initcall functions, as
      well as a trace event that records the level of the initcalls, one can get a
      much finer measurement of the times and interactions of the initcalls
      themselves, as trace events are much lighter than printk()s.
      Suggested-by: NAbderrahmane Benbachir <abderrahmane.benbachir@polymtl.ca>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      4ee7c60d
  9. 05 4月, 2018 2 次提交
    • D
      syscalls/core: Prepare CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y for compat syscalls · 7303e30e
      Dominik Brodowski 提交于
      It may be useful for an architecture to override the definitions of the
      COMPAT_SYSCALL_DEFINE0() and __COMPAT_SYSCALL_DEFINEx() macros in
      <linux/compat.h>, in particular to use a different calling convention
      for syscalls. This patch provides a mechanism to do so, based on the
      previously introduced CONFIG_ARCH_HAS_SYSCALL_WRAPPER. If it is enabled,
      <asm/sycall_wrapper.h> is included in <linux/compat.h> and may be used
      to define the macros mentioned above. Moreover, as the syscall calling
      convention may be different if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is set,
      the compat syscall function prototypes in <linux/compat.h> are #ifndef'd
      out in that case.
      
      As some of the syscalls and/or compat syscalls may not be present,
      the COND_SYSCALL() and COND_SYSCALL_COMPAT() macros in kernel/sys_ni.c
      as well as the SYS_NI() and COMPAT_SYS_NI() macros in
      kernel/time/posix-stubs.c can be re-defined in <asm/syscall_wrapper.h> iff
      CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-5-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7303e30e
    • D
      syscalls/core: Introduce CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y · 1bd21c6c
      Dominik Brodowski 提交于
      It may be useful for an architecture to override the definitions of the
      SYSCALL_DEFINE0() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>,
      in particular to use a different calling convention for syscalls.
      
      This patch provides a mechanism to do so: It introduces
      CONFIG_ARCH_HAS_SYSCALL_WRAPPER. If it is enabled, <asm/sycall_wrapper.h>
      is included in <linux/syscalls.h> and may be used to define the macros
      mentioned above. Moreover, as the syscall calling convention may be
      different if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is set, the syscall function
      prototypes in <linux/syscalls.h> are #ifndef'd out in that case.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-3-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1bd21c6c
  10. 03 4月, 2018 24 次提交
  11. 26 3月, 2018 1 次提交
  12. 23 3月, 2018 1 次提交
    • S
      init: Fix initcall0 name as it is "pure" not "early" · a6fb6012
      Steven Rostedt (VMware) 提交于
      The early_initcall() functions get assigned to __initcall_start[]. These are
      called by do_pre_smp_initcalls(). The initcall_levels[] array starts with
      __initcall0_start[], and initcall_levels[] are to match the
      initcall_level_names[] array. The first name in that array is "early", but
      that is not correct. As pure_initcall() functions get assigned to
      __initcall0_start[] array.
      
      Change the first name in initcall_level_names[] array to "pure".
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a6fb6012
  13. 20 3月, 2018 1 次提交
    • J
      jump_label: Disable jump labels in __exit code · 578ae447
      Josh Poimboeuf 提交于
      With the following commit:
      
        33352244 ("jump_label: Explicitly disable jump labels in __init code")
      
      ... we explicitly disabled jump labels in __init code, so they could be
      detected and not warned about in the following commit:
      
        dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      
      In-kernel __exit code has the same issue.  It's never used, so it's
      freed along with the rest of initmem.  But jump label entries in __exit
      code aren't explicitly disabled, so we get the following warning when
      enabling pr_debug() in __exit code:
      
        can't patch jump_label at dmi_sysfs_exit+0x0/0x2d
        WARNING: CPU: 0 PID: 22572 at kernel/jump_label.c:376 __jump_label_update+0x9d/0xb0
      
      Fix the warning by disabling all jump labels in initmem (which includes
      both __init and __exit code).
      Reported-and-tested-by: NLi Wang <liwang@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      Link: http://lkml.kernel.org/r/7121e6e595374f06616c505b6e690e275c0054d1.1521483452.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      578ae447