1. 25 1月, 2018 1 次提交
    • P
      perf/core: Fix lock inversion between perf,trace,cpuhp · 82d94856
      Peter Zijlstra 提交于
      Lockdep gifted us with noticing the following 4-way lockup scenario:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1             mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                 static_key_enable()
      
       #2     do_cpu_up()
                perf_event_init_cpu()
       #3         mutex_lock(&pmus_lock)
       #4         mutex_lock(&ctx->mutex)
      
              perf_event_task_disable()
                mutex_lock(&current->perf_event_mutex)
       #4       ctx = perf_event_ctx_lock()
       #5       perf_event_for_each_child()
      
              do_exit()
                task_work_run()
                  __fput()
                    perf_release()
                      perf_event_release_kernel()
       #4               mutex_lock(&ctx->mutex)
       #5               mutex_lock(&event->child_mutex)
                        free_event()
                          _free_event()
                            event->destroy() := perf_trace_destroy
       #0                     mutex_lock(&event_mutex);
      
      Fix that by moving the free_event() out from under the locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      82d94856
  2. 19 1月, 2018 2 次提交
    • S
      tracing: Fix converting enum's from the map in trace_event_eval_update() · 1ebe1eaf
      Steven Rostedt (VMware) 提交于
      Since enums do not get converted by the TRACE_EVENT macro into their values,
      the event format displaces the enum name and not the value. This breaks
      tools like perf and trace-cmd that need to interpret the raw binary data. To
      solve this, an enum map was created to convert these enums into their actual
      numbers on boot up. This is done by TRACE_EVENTS() adding a
      TRACE_DEFINE_ENUM() macro.
      
      Some enums were not being converted. This was caused by an optization that
      had a bug in it.
      
      All calls get checked against this enum map to see if it should be converted
      or not, and it compares the call's system to the system that the enum map
      was created under. If they match, then they call is processed.
      
      To cut down on the number of iterations needed to find the maps with a
      matching system, since calls and maps are grouped by system, when a match is
      made, the index into the map array is saved, so that the next call, if it
      belongs to the same system as the previous call, could start right at that
      array index and not have to scan all the previous arrays.
      
      The problem was, the saved index was used as the variable to know if this is
      a call in a new system or not. If the index was zero, it was assumed that
      the call is in a new system and would keep incrementing the saved index
      until it found a matching system. The issue arises when the first matching
      system was at index zero. The next map, if it belonged to the same system,
      would then think it was the first match and increment the index to one. If
      the next call belong to the same system, it would begin its search of the
      maps off by one, and miss the first enum that should be converted. This left
      a single enum not converted properly.
      
      Also add a comment to describe exactly what that index was for. It took me a
      bit too long to figure out what I was thinking when debugging this issue.
      
      Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0c564a53 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Teste-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ebe1eaf
    • S
      ring-buffer: Fix duplicate results in mapping context to bits in recursive lock · 0164e0d7
      Steven Rostedt (VMware) 提交于
      In bringing back the context checks, the code checks first if its normal
      (non-interrupt) context, and then for NMI then IRQ then softirq. The final
      check is redundant. Since the if branch is only hit if the context is one of
      NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
      it is SOFTIRQ. The current code returns the same result even if its not a
      SOFTIRQ. Which is confusing.
      
        pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ
      
      Is redundant as RB_CTX_SOFTIRQ *is* 2!
      
      Fixes: a0e3a18f ("ring-buffer: Bring back context level recursive checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0164e0d7
  3. 18 1月, 2018 2 次提交
    • T
      irq/matrix: Spread interrupts on allocation · a0c9259d
      Thomas Gleixner 提交于
      Keith reported an issue with vector space exhaustion on a server machine
      which is caused by the i40e driver allocating 168 MSI interrupts when the
      driver is initialized, even when most of these interrupts are not used at
      all.
      
      The x86 vector allocation code tries to avoid the immediate allocation with
      the reservation mode, but the card uses MSI and does not support MSI entry
      masking, which prevents reservation mode and requires immediate vector
      allocation.
      
      The matrix allocator is a bit naive and prefers the first CPU in the
      cpumask which describes the possible target CPUs for an allocation. That
      results in allocating all 168 vectors on CPU0 which later causes vector
      space exhaustion when the NVMe driver tries to allocate managed interrupts
      on each CPU for the per CPU queues.
      
      Avoid this by finding the CPU which has the lowest vector allocation count
      to spread out the non managed interrupt accross the possible target CPUs.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos
      a0c9259d
    • D
      bpf: mark dst unknown on inconsistent {s, u}bounds adjustments · 6f16101e
      Daniel Borkmann 提交于
      syzkaller generated a BPF proglet and triggered a warning with
      the following:
      
        0: (b7) r0 = 0
        1: (d5) if r0 s<= 0x0 goto pc+0
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        2: (1f) r0 -= r1
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        verifier internal error: known but bad sbounds
      
      What happens is that in the first insn, r0's min/max value
      are both 0 due to the immediate assignment, later in the jsle
      test the bounds are updated for the min value in the false
      path, meaning, they yield smin_val = 1, smax_val = 0, and when
      ctx pointer is subtracted from r0, verifier bails out with the
      internal error and throwing a WARN since smin_val != smax_val
      for the known constant.
      
      For min_val > max_val scenario it means that reg_set_min_max()
      and reg_set_min_max_inv() (which both refine existing bounds)
      demonstrated that such branch cannot be taken at runtime.
      
      In above scenario for the case where it will be taken, the
      existing [0, 0] bounds are kept intact. Meaning, the rejection
      is not due to a verifier internal error, and therefore the
      WARN() is not necessary either.
      
      We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
      when either known scalars have smin_val != smax_val or
      umin_val != umax_val or any scalar reg with bounds
      smin_val > smax_val or umin_val > umax_val. However, there
      may be a small risk of breakage of buggy programs, so handle
      this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
      just taint the dst reg as unknown scalar when we see ops with
      such kind of src reg.
      
      Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f16101e
  4. 17 1月, 2018 1 次提交
  5. 16 1月, 2018 3 次提交
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
    • R
      tracing: Prevent PROFILE_ALL_BRANCHES when FORTIFY_SOURCE=y · 68e76e03
      Randy Dunlap 提交于
      I regularly get 50 MB - 60 MB files during kernel randconfig builds.
      These large files mostly contain (many repeats of; e.g., 124,594):
      
      In file included from ../include/linux/string.h:6:0,
                       from ../include/linux/uuid.h:20,
                       from ../include/linux/mod_devicetable.h:13,
                       from ../scripts/mod/devicetable-offsets.c:3:
      ../include/linux/compiler.h:64:4: warning: '______f' is static but declared in inline function 'strcpy' which is not static [enabled by default]
          ______f = {     \
          ^
      ../include/linux/compiler.h:56:23: note: in expansion of macro '__trace_if'
                             ^
      ../include/linux/string.h:425:2: note: in expansion of macro 'if'
        if (p_size == (size_t)-1 && q_size == (size_t)-1)
        ^
      
      This only happens when CONFIG_FORTIFY_SOURCE=y and
      CONFIG_PROFILE_ALL_BRANCHES=y, so prevent PROFILE_ALL_BRANCHES if
      FORTIFY_SOURCE=y.
      
      Link: http://lkml.kernel.org/r/9199446b-a141-c0c3-9678-a3f9107f2750@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      68e76e03
    • S
      ring-buffer: Bring back context level recursive checks · a0e3a18f
      Steven Rostedt (VMware) 提交于
      Commit 1a149d7d ("ring-buffer: Rewrite trace_recursive_(un)lock() to be
      simpler") replaced the context level recursion checks with a simple counter.
      This would prevent the ring buffer code from recursively calling itself more
      than the max number of contexts that exist (Normal, softirq, irq, nmi). But
      this change caused a lockup in a specific case, which was during suspend and
      resume using a global clock. Adding a stack dump to see where this occurred,
      the issue was in the trace global clock itself:
      
        trace_buffer_lock_reserve+0x1c/0x50
        __trace_graph_entry+0x2d/0x90
        trace_graph_entry+0xe8/0x200
        prepare_ftrace_return+0x69/0xc0
        ftrace_graph_caller+0x78/0xa8
        queued_spin_lock_slowpath+0x5/0x1d0
        trace_clock_global+0xb0/0xc0
        ring_buffer_lock_reserve+0xf9/0x390
      
      The function graph tracer traced queued_spin_lock_slowpath that was called
      by trace_clock_global. This pointed out that the trace_clock_global() is not
      reentrant, as it takes a spin lock. It depended on the ring buffer recursive
      lock from letting that happen.
      
      By removing the context detection and adding just a max number of allowable
      recursions, it allowed the trace_clock_global() to be entered again and try
      to retake the spinlock it already held, causing a deadlock.
      
      Fixes: 1a149d7d ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler")
      Reported-by: NDavid Weinehall <david.weinehall@gmail.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a0e3a18f
  6. 15 1月, 2018 5 次提交
    • T
      timers: Unconditionally check deferrable base · ed4bbf79
      Thomas Gleixner 提交于
      When the timer base is checked for expired timers then the deferrable base
      must be checked as well. This was missed when making the deferrable base
      independent of base::nohz_active.
      
      Fixes: ced6d5c1 ("timers: Use deferrable base independent of base::nohz_active")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Cc: rt@linutronix.de
      ed4bbf79
    • A
      bpf: fix 32-bit divide by zero · 68fda450
      Alexei Starovoitov 提交于
      due to some JITs doing if (src_reg == 0) check in 64-bit mode
      for div/mod operations mask upper 32-bits of src register
      before doing the check
      
      Fixes: 62258278 ("net: filter: x86: internal BPF JIT")
      Fixes: 7a12b503 ("sparc64: Add eBPF JIT.")
      Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      68fda450
    • L
      futex: Prevent overflow by strengthen input validation · fbe0e839
      Li Jinyue 提交于
      UBSAN reports signed integer overflow in kernel/futex.c:
      
       UBSAN: Undefined behaviour in kernel/futex.c:2041:18
       signed integer overflow:
       0 - -2147483648 cannot be represented in type 'int'
      
      Add a sanity check to catch negative values of nr_wake and nr_requeue.
      Signed-off-by: NLi Jinyue <lijinyue@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Cc: dvhart@infradead.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
      fbe0e839
    • P
      futex: Avoid violating the 10th rule of futex · c1e2f0ea
      Peter Zijlstra 提交于
      Julia reported futex state corruption in the following scenario:
      
         waiter                                  waker                                            stealer (prio > waiter)
      
         futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
               timeout=[N ms])
            futex_wait_requeue_pi()
               futex_wait_queue_me()
                  freezable_schedule()
                  <scheduled out>
                                                 futex(LOCK_PI, uaddr2)
                                                 futex(CMP_REQUEUE_PI, uaddr,
                                                       uaddr2, 1, 0)
                                                    /* requeues waiter to uaddr2 */
                                                 futex(UNLOCK_PI, uaddr2)
                                                       wake_futex_pi()
                                                          cmp_futex_value_locked(uaddr2, waiter)
                                                          wake_up_q()
                 <woken by waker>
                 <hrtimer_wakeup() fires,
                  clears sleeper->task>
                                                                                                 futex(LOCK_PI, uaddr2)
                                                                                                    __rt_mutex_start_proxy_lock()
                                                                                                       try_to_take_rt_mutex() /* steals lock */
                                                                                                          rt_mutex_set_owner(lock, stealer)
                                                                                                    <preempted>
               <scheduled in>
               rt_mutex_wait_proxy_lock()
                  __rt_mutex_slowlock()
                     try_to_take_rt_mutex() /* fails, lock held by stealer */
                     if (timeout && !timeout->task)
                        return -ETIMEDOUT;
                  fixup_owner()
                     /* lock wasn't acquired, so,
                        fixup_pi_state_owner skipped */
      
         return -ETIMEDOUT;
      
         /* At this point, we've returned -ETIMEDOUT to userspace, but the
          * futex word shows waiter to be the owner, and the pi_mutex has
          * stealer as the owner */
      
         futex_lock(LOCK_PI, uaddr2)
           -> bails with EDEADLK, futex word says we're owner.
      
      And suggested that what commit:
      
        73d786bd ("futex: Rework inconsistent rt_mutex/futex_q state")
      
      removes from fixup_owner() looks to be just what is needed. And indeed
      it is -- I completely missed that requeue_pi could also result in this
      case. So we need to restore that, except that subsequent patches, like
      commit:
      
        16ffa12d ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")
      
      changed all the locking rules. Even without that, the sequence:
      
      -               if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
      -                       locked = 1;
      -                       goto out;
      -               }
      
      -               raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
      -               owner = rt_mutex_owner(&q->pi_state->pi_mutex);
      -               if (!owner)
      -                       owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
      -               raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
      -               ret = fixup_pi_state_owner(uaddr, q, owner);
      
      already suggests there were races; otherwise we'd never have to look
      at next_owner.
      
      So instead of doing 3 consecutive wait_lock sections with who knows
      what races, we do it all in a single section. Additionally, the usage
      of pi_state->owner in fixup_owner() was only safe because only the
      rt_mutex owner would modify it, which this additional case wrecks.
      
      Luckily the values can only change away and not to the value we're
      testing, this means we can do a speculative test and double check once
      we have the wait_lock.
      
      Fixes: 73d786bd ("futex: Rework inconsistent rt_mutex/futex_q state")
      Reported-by: NJulia Cartwright <julia@ni.com>
      Reported-by: NGratian Crisan <gratian.crisan@ni.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJulia Cartwright <julia@ni.com>
      Tested-by: NGratian Crisan <gratian.crisan@ni.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
      c1e2f0ea
    • E
      bpf: fix divides by zero · c366287e
      Eric Dumazet 提交于
      Divides by zero are not nice, lets avoid them if possible.
      
      Also do_div() seems not needed when dealing with 32bit operands,
      but this seems a minor detail.
      
      Fixes: bd4cf0ed ("net: filter: rework/optimize internal BPF interpreter's instruction set")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c366287e
  7. 14 1月, 2018 1 次提交
  8. 13 1月, 2018 2 次提交
  9. 11 1月, 2018 4 次提交
    • D
      bpf, array: fix overflow in max_entries and undefined behavior in index_mask · bbeb6e43
      Daniel Borkmann 提交于
      syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
      and thus unprivileged. With the recently added logic in b2157399
      ("bpf: prevent out-of-bounds speculation") we round this up to the next
      power of two value for max_entries for unprivileged such that we can
      apply proper masking into potentially zeroed out map slots.
      
      However, this will generate an index_mask of 0xffffffff, and therefore
      a + 1 will let this overflow into new max_entries of 0. This will pass
      allocation, etc, and later on map access we still enforce on the original
      attr->max_entries value which was 0xfffffffd, therefore triggering GPF
      all over the place. Thus bail out on overflow in such case.
      
      Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
      since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
      space is undefined. Therefore, do this by hand in a 64 bit variable.
      
      This fixes all the issues triggered by syzkaller's reproducers.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
      Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
      Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
      Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
      Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bbeb6e43
    • D
      bpf: arsh is not supported in 32 bit alu thus reject it · 7891a87e
      Daniel Borkmann 提交于
      The following snippet was throwing an 'unknown opcode cc' warning
      in BPF interpreter:
      
        0: (18) r0 = 0x0
        2: (7b) *(u64 *)(r10 -16) = r0
        3: (cc) (u32) r0 s>>= (u32) r0
        4: (95) exit
      
      Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
      generation, not all of them do and interpreter does neither. We can
      leave existing ones and implement it later in bpf-next for the
      remaining ones, but reject this properly in verifier for the time
      being.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7891a87e
    • C
      bpf: fix spelling mistake: "obusing" -> "abusing" · 40950343
      Colin Ian King 提交于
      Trivial fix to spelling mistake in error message text.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      40950343
    • R
      cgroup: make cgroup.threads delegatable · 4f58424d
      Roman Gushchin 提交于
      Make cgroup.threads file delegatable.
      The behavior of cgroup.threads should follow the behavior of cgroup.procs.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Discovered-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4f58424d
  10. 10 1月, 2018 2 次提交
    • M
      membarrier: Disable preemption when calling smp_call_function_many() · 54167607
      Mathieu Desnoyers 提交于
      smp_call_function_many() requires disabling preemption around the call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: <stable@vger.kernel.org> # v4.14+
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54167607
    • A
      bpf: introduce BPF_JIT_ALWAYS_ON config · 290af866
      Alexei Starovoitov 提交于
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      290af866
  11. 09 1月, 2018 2 次提交
    • A
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov 提交于
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b2157399
    • I
      locking/lockdep: Remove cross-release leftovers · 527187d2
      Ingo Molnar 提交于
      There's two cross-release leftover facilities:
      
       - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
       - the complete_release_commit() callback (NOP as well)
      
      Remove them.
      
      Cc: David Sterba <dsterba@suse.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      527187d2
  12. 07 1月, 2018 1 次提交
  13. 06 1月, 2018 1 次提交
  14. 05 1月, 2018 2 次提交
  15. 30 12月, 2017 8 次提交
    • T
      timers: Invoke timer_start_debug() where it makes sense · fd45bb77
      Thomas Gleixner 提交于
      The timer start debug function is called before the proper timer base is
      set. As a consequence the trace data contains the stale CPU and flags
      values.
      
      Call the debug function after setting the new base and flags.
      
      Fixes: 500462a9 ("timers: Switch to a non-cascading wheel")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: rt@linutronix.de
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Link: https://lkml.kernel.org/r/20171222145337.792907137@linutronix.de
      fd45bb77
    • T
      nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick() · 5d62c183
      Thomas Gleixner 提交于
      The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
      subsequently invokes tick_nohz_stop_sched_tick() are:
      
        if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))
      
      If need_resched() is not set, but a timer softirq is pending then this is
      an indication that the softirq code punted and delegated the execution to
      softirqd. need_resched() is not true because the current interrupted task
      takes precedence over softirqd.
      
      Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
      timer interrupts because the timer wheel contains an expired timer, but
      softirqs are not yet executed. So it returns an immediate expiry request,
      which causes the timer to fire immediately again. Lather, rinse and
      repeat....
      
      Prevent that by adding a check for a pending timer soft interrupt to the
      conditions in tick_nohz_stop_sched_tick() which avoid calling
      get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
      prevents a repetitive programming of an already expired timer.
      Reported-by: NSebastian Siewior <bigeasy@linutronix.d>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
      5d62c183
    • T
      timers: Reinitialize per cpu bases on hotplug · 26456f87
      Thomas Gleixner 提交于
      The timer wheel bases are not (re)initialized on CPU hotplug. That leaves
      them with a potentially stale clk and next_expiry valuem, which can cause
      trouble then the CPU is plugged.
      
      Add a prepare callback which forwards the clock, sets next_expiry to far in
      the future and reset the control flags to a known state.
      
      Set base->must_forward_clk so the first timer which is queued will try to
      forward the clock to current jiffies.
      
      Fixes: 500462a9 ("timers: Switch to a non-cascading wheel")
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
      26456f87
    • A
      timers: Use deferrable base independent of base::nohz_active · ced6d5c1
      Anna-Maria Gleixner 提交于
      During boot and before base::nohz_active is set in the timer bases, deferrable
      timers are enqueued into the standard timer base. This works correctly as
      long as base::nohz_active is false.
      
      Once it base::nohz_active is set and a timer which was enqueued before that
      is accessed the lock selector code choses the lock of the deferred
      base. This causes unlocked access to the standard base and in case the
      timer is removed it does not clear the pending flag in the standard base
      bitmap which causes get_next_timer_interrupt() to return bogus values.
      
      To prevent that, the deferrable timers must be enqueued in the deferrable
      base, even when base::nohz_active is not set. Those deferrable timers also
      need to be expired unconditional.
      
      Fixes: 500462a9 ("timers: Switch to a non-cascading wheel")
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: rt@linutronix.de
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Link: https://lkml.kernel.org/r/20171222145337.633328378@linutronix.de
      ced6d5c1
    • T
      genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI · bc976233
      Thomas Gleixner 提交于
      The new reservation mode for interrupts assigns a dummy vector when the
      interrupt is allocated and assigns a real vector when the interrupt is
      requested. The reservation mode prevents vector pressure when devices with
      a large amount of queues/interrupts are initialized, but only a minimal
      subset of those queues/interrupts is actually used.
      
      This mode has an issue with MSI interrupts which cannot be masked. If the
      driver is not careful or the hardware emits an interrupt before the device
      irq is requestd by the driver then the interrupt ends up on the dummy
      vector as a spurious interrupt which can cause malfunction of the device or
      in the worst case a lockup of the machine.
      
      Change the logic for the reservation mode so that the early activation of
      MSI interrupts checks whether:
      
       - the device is a PCI/MSI device
       - the reservation mode of the underlying irqdomain is activated
       - PCI/MSI masking is globally enabled
       - the PCI/MSI device uses either MSI-X, which supports masking, or
         MSI with the maskbit supported.
      
      If one of those conditions is false, then clear the reservation mode flag
      in the irq data of the interrupt and invoke irq_domain_activate_irq() with
      the reserve argument cleared. In the x86 vector code, clear the can_reserve
      flag in the vector allocation data so a subsequent free_irq() won't create
      the same situation again. The interrupt stays assigned to a real vector
      until pci_disable_msi() is invoked and all allocations are undone.
      
      Fixes: 4900be83 ("x86/vector/msi: Switch to global reservation mode")
      Reported-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Reported-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
      bc976233
    • T
      genirq/irqdomain: Rename early argument of irq_domain_activate_irq() · 702cb0a0
      Thomas Gleixner 提交于
      The 'early' argument of irq_domain_activate_irq() is actually used to
      denote reservation mode. To avoid confusion, rename it before abuse
      happens.
      
      No functional change.
      
      Fixes: 72491643 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      702cb0a0
    • T
      genirq: Introduce IRQD_CAN_RESERVE flag · 69790ba9
      Thomas Gleixner 提交于
      Add a new flag to mark interrupts which can use reservation mode. This is
      going to be used in subsequent patches to disable reservation mode for a
      certain class of MSI devices.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      69790ba9
    • T
      genirq/msi: Handle reactivation only on success · da5dd9e8
      Thomas Gleixner 提交于
      When analyzing the fallout of the x86 vector allocation rework it turned
      out that the error handling in msi_domain_alloc_irqs() is broken.
      
      If MSI_FLAG_MUST_REACTIVATE is set for a MSI domain then it clears the
      activation flag for a successfully initialized msi descriptor. If a
      subsequent initialization fails then the error handling code path does not
      deactivate the interrupt because the activation flag got cleared.
      
      Move the clearing of the activation flag outside of the initialization loop
      so that an eventual failure can be cleaned up correctly.
      
      Fixes: 22d0b12f ("genirq/irqdomain: Add force reactivation flag to irq domains")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      
      da5dd9e8
  16. 28 12月, 2017 3 次提交