1. 05 6月, 2018 1 次提交
    • L
      swait: strengthen language to discourage use · c5e7a7ea
      Linus Torvalds 提交于
      We already earlier discouraged people from using this interface in
      commit 88796e7e ("sched/swait: Document it clearly that the swait
      facilities are special and shouldn't be used"), but I just got a pull
      request with a new broken user.
      
      So make the comment *really* clear.
      
      The swait interfaces are bad, and should not be used unless you have
      some *very* strong reasons that include tons of hard performance numbers
      on just why you want to use them, and you show that you actually
      understand that they aren't at all like the normal wait/wakeup
      interfaces.
      
      So far, every single user has been suspect.  The main user is KVM, which
      is completely pointless (there is only ever one waiter, which avoids the
      interface subtleties, but also means that having a queue instead of a
      pointer is counter-productive and certainly not an "optimization").
      
      So make the comments much stronger.
      
      Not that anybody likely reads them anyway, but there's always some
      slight hope that it will cause somebody to think twice.
      
      I'd like to remove this interface entirely, but there is the theoretical
      possibility that it's actually the right thing to use in some situation,
      most likely some deep RT use.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5e7a7ea
  2. 03 6月, 2018 1 次提交
    • J
      block: don't use blocking queue entered for recursive bio submits · cd4a4ae4
      Jens Axboe 提交于
      If we end up splitting a bio and the queue goes away between
      the initial submission and the later split submission, then we
      can block forever in blk_queue_enter() waiting for the reference
      to drop to zero. This will never happen, since we already hold
      a reference.
      
      Mark a split bio as already having entered the queue, so we can
      just use the live non-blocking queue enter variant.
      
      Thanks to Tetsuo Handa for the analysis.
      
      Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd4a4ae4
  3. 01 6月, 2018 5 次提交
  4. 31 5月, 2018 5 次提交
  5. 30 5月, 2018 1 次提交
  6. 29 5月, 2018 8 次提交
  7. 28 5月, 2018 2 次提交
  8. 26 5月, 2018 9 次提交
  9. 25 5月, 2018 5 次提交
    • H
      884571f0
    • S
      perf/core: Fix bad use of igrab() · 9511bce9
      Song Liu 提交于
      As Miklos reported and suggested:
      
       "This pattern repeats two times in trace_uprobe.c and in
        kernel/events/core.c as well:
      
            ret = kern_path(filename, LOOKUP_FOLLOW, &path);
            if (ret)
                goto fail_address_parse;
      
            inode = igrab(d_inode(path.dentry));
            path_put(&path);
      
        And it's wrong.  You can only hold a reference to the inode if you
        have an active ref to the superblock as well (which is normally
        through path.mnt) or holding s_umount.
      
        This way unmounting the containing filesystem while the tracepoint is
        active will give you the "VFS: Busy inodes after unmount..." message
        and a crash when the inode is finally put.
      
        Solution: store path instead of inode."
      
      This patch fixes the issue in kernel/event/core.c.
      Reviewed-and-tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 375637bc ("perf/core: Introduce address range filtering")
      Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9511bce9
    • S
      perf/core: Fix group scheduling with mixed hw and sw events · a1150c20
      Song Liu 提交于
      When hw and sw events are mixed in the same group, they are all attached
      to the hw perf_event_context. This sometimes requires moving group of
      perf_event to a different context.
      
      We found a bug in how the kernel handles this, for example if we do:
      
         perf stat -e '{faults,ref-cycles,faults}'  -I 1000
      
           1.005591180              1,297      faults
           1.005591180        457,476,576      ref-cycles
           1.005591180    <not supported>      faults
      
      First, sw event "faults" is attached to the sw context, and becomes the
      group leader. Then, hw event "ref-cycles" is attached, so both events
      are moved to the hw context. Last, another sw "faults" tries to attach,
      but it fails because of mismatch between the new target ctx (from sw
      pmu) and the group_leader's ctx (hw context, same as ref-cycles).
      
      The broken condition is:
         group_leader is sw event;
         group_leader is on hw context;
         add a sw event to the group.
      
      Fix this scenario by checking group_leader's context (instead of just
      event type). If group_leader is on hw context, use the ->pmu of this
      context to look up context for the new event.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: b04243ef ("perf: Complete software pmu grouping")
      Link: http://lkml.kernel.org/r/20180503194716.162815-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1150c20
    • J
      Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" · d883c6cf
      Joonsoo Kim 提交于
      This reverts the following commits that change CMA design in MM.
      
       3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
      
       1d47a3ec ("mm/cma: remove ALLOC_CMA")
      
       bad8c6c0 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
      
      Ville reported a following error on i386.
      
        Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
        microcode: microcode updated early to revision 0x4, date = 2013-06-28
        Initializing CPU#0
        Initializing HighMem for node 0 (000377fe:00118000)
        Initializing Movable for node 0 (00000001:00118000)
        BUG: Bad page state in process swapper  pfn:377fe
        page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
        flags: 0x80000000()
        raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
        page dumped because: nonzero mapcount
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
        Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
        Call Trace:
         dump_stack+0x60/0x96
         bad_page+0x9a/0x100
         free_pages_check_bad+0x3f/0x60
         free_pcppages_bulk+0x29d/0x5b0
         free_unref_page_commit+0x84/0xb0
         free_unref_page+0x3e/0x70
         __free_pages+0x1d/0x20
         free_highmem_page+0x19/0x40
         add_highpages_with_active_regions+0xab/0xeb
         set_highmem_pages_init+0x66/0x73
         mem_init+0x1b/0x1d7
         start_kernel+0x17a/0x363
         i386_start_kernel+0x95/0x99
         startup_32_smp+0x164/0x168
      
      The reason for this error is that the span of MOVABLE_ZONE is extended
      to whole node span for future CMA initialization, and, normal memory is
      wrongly freed here.  I submitted the fix and it seems to work, but,
      another problem happened.
      
      It's so late time to fix the later problem so I decide to reverting the
      series.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d883c6cf
    • M
      blk-mq: avoid starving tag allocation after allocating process migrates · e6fc4649
      Ming Lei 提交于
      When the allocation process is scheduled back and the mapped hw queue is
      changed, fake one extra wake up on previous queue for compensating wake
      up miss, so other allocations on the previous queue won't be starved.
      
      This patch fixes one request allocation hang issue, which can be
      triggered easily in case of very low nr_request.
      
      The race is as follows:
      
      1) 2 hw queues, nr_requests are 2, and wake_batch is one
      
      2) there are 3 waiters on hw queue 0
      
      3) two in-flight requests in hw queue 0 are completed, and only two
         waiters of 3 are waken up because of wake_batch, but both the two
         waiters can be scheduled to another CPU and cause to switch to hw
         queue 1
      
      4) then the 3rd waiter will wait for ever, since no in-flight request
         is in hw queue 0 any more.
      
      5) this patch fixes it by the fake wakeup when waiter is scheduled to
         another hw queue
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Modified commit message to make it clearer, and make it apply on
      top of the 4.18 branch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6fc4649
  10. 24 5月, 2018 1 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
  11. 23 5月, 2018 1 次提交
  12. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e