1. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  2. 05 6月, 2018 1 次提交
    • L
      swait: strengthen language to discourage use · c5e7a7ea
      Linus Torvalds 提交于
      We already earlier discouraged people from using this interface in
      commit 88796e7e ("sched/swait: Document it clearly that the swait
      facilities are special and shouldn't be used"), but I just got a pull
      request with a new broken user.
      
      So make the comment *really* clear.
      
      The swait interfaces are bad, and should not be used unless you have
      some *very* strong reasons that include tons of hard performance numbers
      on just why you want to use them, and you show that you actually
      understand that they aren't at all like the normal wait/wakeup
      interfaces.
      
      So far, every single user has been suspect.  The main user is KVM, which
      is completely pointless (there is only ever one waiter, which avoids the
      interface subtleties, but also means that having a queue instead of a
      pointer is counter-productive and certainly not an "optimization").
      
      So make the comments much stronger.
      
      Not that anybody likely reads them anyway, but there's always some
      slight hope that it will cause somebody to think twice.
      
      I'd like to remove this interface entirely, but there is the theoretical
      possibility that it's actually the right thing to use in some situation,
      most likely some deep RT use.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5e7a7ea
  3. 03 6月, 2018 1 次提交
    • J
      block: don't use blocking queue entered for recursive bio submits · cd4a4ae4
      Jens Axboe 提交于
      If we end up splitting a bio and the queue goes away between
      the initial submission and the later split submission, then we
      can block forever in blk_queue_enter() waiting for the reference
      to drop to zero. This will never happen, since we already hold
      a reference.
      
      Mark a split bio as already having entered the queue, so we can
      just use the live non-blocking queue enter variant.
      
      Thanks to Tetsuo Handa for the analysis.
      
      Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd4a4ae4
  4. 01 6月, 2018 5 次提交
  5. 31 5月, 2018 5 次提交
  6. 30 5月, 2018 1 次提交
  7. 29 5月, 2018 8 次提交
  8. 28 5月, 2018 2 次提交
  9. 26 5月, 2018 9 次提交
  10. 25 5月, 2018 5 次提交
    • H
      884571f0
    • S
      perf/core: Fix bad use of igrab() · 9511bce9
      Song Liu 提交于
      As Miklos reported and suggested:
      
       "This pattern repeats two times in trace_uprobe.c and in
        kernel/events/core.c as well:
      
            ret = kern_path(filename, LOOKUP_FOLLOW, &path);
            if (ret)
                goto fail_address_parse;
      
            inode = igrab(d_inode(path.dentry));
            path_put(&path);
      
        And it's wrong.  You can only hold a reference to the inode if you
        have an active ref to the superblock as well (which is normally
        through path.mnt) or holding s_umount.
      
        This way unmounting the containing filesystem while the tracepoint is
        active will give you the "VFS: Busy inodes after unmount..." message
        and a crash when the inode is finally put.
      
        Solution: store path instead of inode."
      
      This patch fixes the issue in kernel/event/core.c.
      Reviewed-and-tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 375637bc ("perf/core: Introduce address range filtering")
      Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9511bce9
    • S
      perf/core: Fix group scheduling with mixed hw and sw events · a1150c20
      Song Liu 提交于
      When hw and sw events are mixed in the same group, they are all attached
      to the hw perf_event_context. This sometimes requires moving group of
      perf_event to a different context.
      
      We found a bug in how the kernel handles this, for example if we do:
      
         perf stat -e '{faults,ref-cycles,faults}'  -I 1000
      
           1.005591180              1,297      faults
           1.005591180        457,476,576      ref-cycles
           1.005591180    <not supported>      faults
      
      First, sw event "faults" is attached to the sw context, and becomes the
      group leader. Then, hw event "ref-cycles" is attached, so both events
      are moved to the hw context. Last, another sw "faults" tries to attach,
      but it fails because of mismatch between the new target ctx (from sw
      pmu) and the group_leader's ctx (hw context, same as ref-cycles).
      
      The broken condition is:
         group_leader is sw event;
         group_leader is on hw context;
         add a sw event to the group.
      
      Fix this scenario by checking group_leader's context (instead of just
      event type). If group_leader is on hw context, use the ->pmu of this
      context to look up context for the new event.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: b04243ef ("perf: Complete software pmu grouping")
      Link: http://lkml.kernel.org/r/20180503194716.162815-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1150c20
    • J
      Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" · d883c6cf
      Joonsoo Kim 提交于
      This reverts the following commits that change CMA design in MM.
      
       3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
      
       1d47a3ec ("mm/cma: remove ALLOC_CMA")
      
       bad8c6c0 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
      
      Ville reported a following error on i386.
      
        Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
        microcode: microcode updated early to revision 0x4, date = 2013-06-28
        Initializing CPU#0
        Initializing HighMem for node 0 (000377fe:00118000)
        Initializing Movable for node 0 (00000001:00118000)
        BUG: Bad page state in process swapper  pfn:377fe
        page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
        flags: 0x80000000()
        raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
        page dumped because: nonzero mapcount
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
        Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
        Call Trace:
         dump_stack+0x60/0x96
         bad_page+0x9a/0x100
         free_pages_check_bad+0x3f/0x60
         free_pcppages_bulk+0x29d/0x5b0
         free_unref_page_commit+0x84/0xb0
         free_unref_page+0x3e/0x70
         __free_pages+0x1d/0x20
         free_highmem_page+0x19/0x40
         add_highpages_with_active_regions+0xab/0xeb
         set_highmem_pages_init+0x66/0x73
         mem_init+0x1b/0x1d7
         start_kernel+0x17a/0x363
         i386_start_kernel+0x95/0x99
         startup_32_smp+0x164/0x168
      
      The reason for this error is that the span of MOVABLE_ZONE is extended
      to whole node span for future CMA initialization, and, normal memory is
      wrongly freed here.  I submitted the fix and it seems to work, but,
      another problem happened.
      
      It's so late time to fix the later problem so I decide to reverting the
      series.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d883c6cf
    • M
      blk-mq: avoid starving tag allocation after allocating process migrates · e6fc4649
      Ming Lei 提交于
      When the allocation process is scheduled back and the mapped hw queue is
      changed, fake one extra wake up on previous queue for compensating wake
      up miss, so other allocations on the previous queue won't be starved.
      
      This patch fixes one request allocation hang issue, which can be
      triggered easily in case of very low nr_request.
      
      The race is as follows:
      
      1) 2 hw queues, nr_requests are 2, and wake_batch is one
      
      2) there are 3 waiters on hw queue 0
      
      3) two in-flight requests in hw queue 0 are completed, and only two
         waiters of 3 are waken up because of wake_batch, but both the two
         waiters can be scheduled to another CPU and cause to switch to hw
         queue 1
      
      4) then the 3rd waiter will wait for ever, since no in-flight request
         is in hw queue 0 any more.
      
      5) this patch fixes it by the fake wakeup when waiter is scheduled to
         another hw queue
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Modified commit message to make it clearer, and make it apply on
      top of the 4.18 branch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6fc4649
  11. 24 5月, 2018 1 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
  12. 23 5月, 2018 1 次提交