1. 06 6月, 2018 2 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
    • M
      uapi/headers: Provide types_32_64.h · b575e837
      Mathieu Desnoyers 提交于
      Provide helper macros for fields which represent pointers in
      kernel-userspace ABI. This facilitates handling of 32-bit
      user-space by 64-bit kernels by defining those fields as
      32-bit 0-padding and 32-bit integer on 32-bit architectures,
      which allows the kernel to treat those as 64-bit integers.
      The order of padding and 32-bit integer depends on the
      endianness.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-2-mathieu.desnoyers@efficios.com
      b575e837
  2. 05 6月, 2018 1 次提交
    • L
      swait: strengthen language to discourage use · c5e7a7ea
      Linus Torvalds 提交于
      We already earlier discouraged people from using this interface in
      commit 88796e7e ("sched/swait: Document it clearly that the swait
      facilities are special and shouldn't be used"), but I just got a pull
      request with a new broken user.
      
      So make the comment *really* clear.
      
      The swait interfaces are bad, and should not be used unless you have
      some *very* strong reasons that include tons of hard performance numbers
      on just why you want to use them, and you show that you actually
      understand that they aren't at all like the normal wait/wakeup
      interfaces.
      
      So far, every single user has been suspect.  The main user is KVM, which
      is completely pointless (there is only ever one waiter, which avoids the
      interface subtleties, but also means that having a queue instead of a
      pointer is counter-productive and certainly not an "optimization").
      
      So make the comments much stronger.
      
      Not that anybody likely reads them anyway, but there's always some
      slight hope that it will cause somebody to think twice.
      
      I'd like to remove this interface entirely, but there is the theoretical
      possibility that it's actually the right thing to use in some situation,
      most likely some deep RT use.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5e7a7ea
  3. 03 6月, 2018 1 次提交
    • J
      block: don't use blocking queue entered for recursive bio submits · cd4a4ae4
      Jens Axboe 提交于
      If we end up splitting a bio and the queue goes away between
      the initial submission and the later split submission, then we
      can block forever in blk_queue_enter() waiting for the reference
      to drop to zero. This will never happen, since we already hold
      a reference.
      
      Mark a split bio as already having entered the queue, so we can
      just use the live non-blocking queue enter variant.
      
      Thanks to Tetsuo Handa for the analysis.
      
      Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd4a4ae4
  4. 02 6月, 2018 1 次提交
    • D
      bpf: fix uapi hole for 32 bit compat applications · 36f9814a
      Daniel Borkmann 提交于
      In 64 bit, we have a 4 byte hole between ifindex and netns_dev in the
      case of struct bpf_map_info but also struct bpf_prog_info. In net-next
      commit b85fab0e ("bpf: Add gpl_compatible flag to struct bpf_prog_info")
      added a bitfield into it to expose some flags related to programs. Thus,
      add an unnamed __u32 bitfield for both so that alignment keeps the same
      in both 32 and 64 bit cases, and can be naturally extended from there
      as in b85fab0e.
      
      Before:
      
        # file test.o
        test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
        # pahole test.o
        struct bpf_map_info {
      	__u32                      type;                 /*     0     4 */
      	__u32                      id;                   /*     4     4 */
      	__u32                      key_size;             /*     8     4 */
      	__u32                      value_size;           /*    12     4 */
      	__u32                      max_entries;          /*    16     4 */
      	__u32                      map_flags;            /*    20     4 */
      	char                       name[16];             /*    24    16 */
      	__u32                      ifindex;              /*    40     4 */
      	__u64                      netns_dev;            /*    44     8 */
      	__u64                      netns_ino;            /*    52     8 */
      
      	/* size: 64, cachelines: 1, members: 10 */
      	/* padding: 4 */
        };
      
      After (same as on 64 bit):
      
        # file test.o
        test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
        # pahole test.o
        struct bpf_map_info {
      	__u32                      type;                 /*     0     4 */
      	__u32                      id;                   /*     4     4 */
      	__u32                      key_size;             /*     8     4 */
      	__u32                      value_size;           /*    12     4 */
      	__u32                      max_entries;          /*    16     4 */
      	__u32                      map_flags;            /*    20     4 */
      	char                       name[16];             /*    24    16 */
      	__u32                      ifindex;              /*    40     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	__u64                      netns_dev;            /*    48     8 */
      	__u64                      netns_ino;            /*    56     8 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      
      	/* size: 64, cachelines: 1, members: 10 */
      	/* sum members: 60, holes: 1, sum holes: 4 */
        };
      Reported-by: NDmitry V. Levin <ldv@altlinux.org>
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Fixes: 52775b33 ("bpf: offload: report device information about offloaded maps")
      Fixes: 675fc275 ("bpf: offload: report device information for offloaded programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      36f9814a
  5. 01 6月, 2018 5 次提交
  6. 31 5月, 2018 9 次提交
  7. 30 5月, 2018 1 次提交
  8. 29 5月, 2018 18 次提交
  9. 28 5月, 2018 2 次提交