1. 01 2月, 2018 1 次提交
    • T
      genirq: Make legacy autoprobing work again · 1beaeacd
      Thomas Gleixner 提交于
      Meelis reported the following warning on a quad P3 HP NetServer museum piece:
      
      WARNING: CPU: 3 PID: 258 at kernel/irq/chip.c:244 __irq_startup+0x80/0x100
      EIP: __irq_startup+0x80/0x100
      irq_startup+0x7e/0x170
      probe_irq_on+0x128/0x2b0
      parport_irq_probe.constprop.18+0x8d/0x1af [parport_pc]
      parport_pc_probe_port+0xf11/0x1260 [parport_pc]
      parport_pc_init+0x78a/0xf10 [parport_pc]
      parport_parse_param.constprop.16+0xf0/0xf0 [parport_pc]
      do_one_initcall+0x45/0x1e0
      
      This is caused by the rewrite of the irq activation/startup sequence which
      missed to convert a callsite in the irq legacy auto probing code.
      
      To fix this irq_activate_and_startup() needs to gain a return value so the
      pending logic can work proper.
      
      Fixes: c942cee4 ("genirq: Separate activation and startup")
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801301935410.1797@nanos
      1beaeacd
  2. 27 1月, 2018 1 次提交
    • T
      hrtimer: Reset hrtimer cpu base proper on CPU hotplug · d5421ea4
      Thomas Gleixner 提交于
      The hrtimer interrupt code contains a hang detection and mitigation
      mechanism, which prevents that a long delayed hrtimer interrupt causes a
      continous retriggering of interrupts which prevent the system from making
      progress. If a hang is detected then the timer hardware is programmed with
      a certain delay into the future and a flag is set in the hrtimer cpu base
      which prevents newly enqueued timers from reprogramming the timer hardware
      prior to the chosen delay. The subsequent hrtimer interrupt after the delay
      clears the flag and resumes normal operation.
      
      If such a hang happens in the last hrtimer interrupt before a CPU is
      unplugged then the hang_detected flag is set and stays that way when the
      CPU is plugged in again. At that point the timer hardware is not armed and
      it cannot be armed because the hang_detected flag is still active, so
      nothing clears that flag. As a consequence the CPU does not receive hrtimer
      interrupts and no timers expire on that CPU which results in RCU stalls and
      other malfunctions.
      
      Clear the flag along with some other less critical members of the hrtimer
      cpu base to ensure starting from a clean state when a CPU is plugged in.
      
      Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
      root cause of that hard to reproduce heisenbug. Once understood it's
      trivial and certainly justifies a brown paperbag.
      
      Fixes: 41d2e494 ("hrtimer: Tune hrtimer_interrupt hang logic")
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Sewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
      d5421ea4
  3. 25 1月, 2018 3 次提交
    • P
      perf/core: Fix ctx::mutex deadlock · 0c7296ca
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup scenario:
      
      	sys_perf_event_open()
      	  perf_event_alloc()
      	    perf_try_init_event()
       #0	      ctx = perf_event_ctx_lock_nested(1)
      	      perf_swevent_init()
      		swevent_hlist_get()
       #1		  mutex_lock(&pmus_lock)
      
      	perf_event_init_cpu()
       #1	  mutex_lock(&pmus_lock)
       #2	  mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #2	   mutex_lock()
       #0	   mutex_lock_nested()
      
      And while we need that perf_event_ctx_lock_nested() for HW PMUs such
      that they can iterate the sibling list, trying to match it to the
      available counters, the software PMUs need do no such thing. Exclude
      them.
      
      In particular the swevent triggers the above invertion, while the
      tpevent PMU triggers a more elaborate one through their event_mutex.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0c7296ca
    • P
      perf/core: Fix another perf,trace,cpuhp lock inversion · 43fa87f7
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup race:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1              mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                  static_key_enable()
      
       #2	do_cpu_up()
      	  perf_event_init_cpu()
       #3	    mutex_lock(&pmus_lock)
       #4	    mutex_lock(&ctx->mutex)
      
      	perf_ioctl()
       #4	  ctx = perf_event_ctx_lock()
      	  _perf_iotcl()
      	    ftrace_profile_set_filter()
       #0	      mutex_lock(&event_mutex)
      
      Fudge it for now by noting that the tracepoint state does not depend
      on the event <-> context relation. Ugly though :/
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43fa87f7
    • P
      perf/core: Fix lock inversion between perf,trace,cpuhp · 82d94856
      Peter Zijlstra 提交于
      Lockdep gifted us with noticing the following 4-way lockup scenario:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1             mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                 static_key_enable()
      
       #2     do_cpu_up()
                perf_event_init_cpu()
       #3         mutex_lock(&pmus_lock)
       #4         mutex_lock(&ctx->mutex)
      
              perf_event_task_disable()
                mutex_lock(&current->perf_event_mutex)
       #4       ctx = perf_event_ctx_lock()
       #5       perf_event_for_each_child()
      
              do_exit()
                task_work_run()
                  __fput()
                    perf_release()
                      perf_event_release_kernel()
       #4               mutex_lock(&ctx->mutex)
       #5               mutex_lock(&event->child_mutex)
                        free_event()
                          _free_event()
                            event->destroy() := perf_trace_destroy
       #0                     mutex_lock(&event_mutex);
      
      Fix that by moving the free_event() out from under the locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      82d94856
  4. 24 1月, 2018 5 次提交
  5. 19 1月, 2018 2 次提交
    • S
      tracing: Fix converting enum's from the map in trace_event_eval_update() · 1ebe1eaf
      Steven Rostedt (VMware) 提交于
      Since enums do not get converted by the TRACE_EVENT macro into their values,
      the event format displaces the enum name and not the value. This breaks
      tools like perf and trace-cmd that need to interpret the raw binary data. To
      solve this, an enum map was created to convert these enums into their actual
      numbers on boot up. This is done by TRACE_EVENTS() adding a
      TRACE_DEFINE_ENUM() macro.
      
      Some enums were not being converted. This was caused by an optization that
      had a bug in it.
      
      All calls get checked against this enum map to see if it should be converted
      or not, and it compares the call's system to the system that the enum map
      was created under. If they match, then they call is processed.
      
      To cut down on the number of iterations needed to find the maps with a
      matching system, since calls and maps are grouped by system, when a match is
      made, the index into the map array is saved, so that the next call, if it
      belongs to the same system as the previous call, could start right at that
      array index and not have to scan all the previous arrays.
      
      The problem was, the saved index was used as the variable to know if this is
      a call in a new system or not. If the index was zero, it was assumed that
      the call is in a new system and would keep incrementing the saved index
      until it found a matching system. The issue arises when the first matching
      system was at index zero. The next map, if it belonged to the same system,
      would then think it was the first match and increment the index to one. If
      the next call belong to the same system, it would begin its search of the
      maps off by one, and miss the first enum that should be converted. This left
      a single enum not converted properly.
      
      Also add a comment to describe exactly what that index was for. It took me a
      bit too long to figure out what I was thinking when debugging this issue.
      
      Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0c564a53 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Teste-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ebe1eaf
    • S
      ring-buffer: Fix duplicate results in mapping context to bits in recursive lock · 0164e0d7
      Steven Rostedt (VMware) 提交于
      In bringing back the context checks, the code checks first if its normal
      (non-interrupt) context, and then for NMI then IRQ then softirq. The final
      check is redundant. Since the if branch is only hit if the context is one of
      NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
      it is SOFTIRQ. The current code returns the same result even if its not a
      SOFTIRQ. Which is confusing.
      
        pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ
      
      Is redundant as RB_CTX_SOFTIRQ *is* 2!
      
      Fixes: a0e3a18f ("ring-buffer: Bring back context level recursive checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0164e0d7
  6. 18 1月, 2018 2 次提交
    • T
      irq/matrix: Spread interrupts on allocation · a0c9259d
      Thomas Gleixner 提交于
      Keith reported an issue with vector space exhaustion on a server machine
      which is caused by the i40e driver allocating 168 MSI interrupts when the
      driver is initialized, even when most of these interrupts are not used at
      all.
      
      The x86 vector allocation code tries to avoid the immediate allocation with
      the reservation mode, but the card uses MSI and does not support MSI entry
      masking, which prevents reservation mode and requires immediate vector
      allocation.
      
      The matrix allocator is a bit naive and prefers the first CPU in the
      cpumask which describes the possible target CPUs for an allocation. That
      results in allocating all 168 vectors on CPU0 which later causes vector
      space exhaustion when the NVMe driver tries to allocate managed interrupts
      on each CPU for the per CPU queues.
      
      Avoid this by finding the CPU which has the lowest vector allocation count
      to spread out the non managed interrupt accross the possible target CPUs.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos
      a0c9259d
    • D
      bpf: mark dst unknown on inconsistent {s, u}bounds adjustments · 6f16101e
      Daniel Borkmann 提交于
      syzkaller generated a BPF proglet and triggered a warning with
      the following:
      
        0: (b7) r0 = 0
        1: (d5) if r0 s<= 0x0 goto pc+0
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        2: (1f) r0 -= r1
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        verifier internal error: known but bad sbounds
      
      What happens is that in the first insn, r0's min/max value
      are both 0 due to the immediate assignment, later in the jsle
      test the bounds are updated for the min value in the false
      path, meaning, they yield smin_val = 1, smax_val = 0, and when
      ctx pointer is subtracted from r0, verifier bails out with the
      internal error and throwing a WARN since smin_val != smax_val
      for the known constant.
      
      For min_val > max_val scenario it means that reg_set_min_max()
      and reg_set_min_max_inv() (which both refine existing bounds)
      demonstrated that such branch cannot be taken at runtime.
      
      In above scenario for the case where it will be taken, the
      existing [0, 0] bounds are kept intact. Meaning, the rejection
      is not due to a verifier internal error, and therefore the
      WARN() is not necessary either.
      
      We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
      when either known scalars have smin_val != smax_val or
      umin_val != umax_val or any scalar reg with bounds
      smin_val > smax_val or umin_val > umax_val. However, there
      may be a small risk of breakage of buggy programs, so handle
      this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
      just taint the dst reg as unknown scalar when we see ops with
      such kind of src reg.
      
      Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f16101e
  7. 17 1月, 2018 1 次提交
  8. 16 1月, 2018 3 次提交
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
    • R
      tracing: Prevent PROFILE_ALL_BRANCHES when FORTIFY_SOURCE=y · 68e76e03
      Randy Dunlap 提交于
      I regularly get 50 MB - 60 MB files during kernel randconfig builds.
      These large files mostly contain (many repeats of; e.g., 124,594):
      
      In file included from ../include/linux/string.h:6:0,
                       from ../include/linux/uuid.h:20,
                       from ../include/linux/mod_devicetable.h:13,
                       from ../scripts/mod/devicetable-offsets.c:3:
      ../include/linux/compiler.h:64:4: warning: '______f' is static but declared in inline function 'strcpy' which is not static [enabled by default]
          ______f = {     \
          ^
      ../include/linux/compiler.h:56:23: note: in expansion of macro '__trace_if'
                             ^
      ../include/linux/string.h:425:2: note: in expansion of macro 'if'
        if (p_size == (size_t)-1 && q_size == (size_t)-1)
        ^
      
      This only happens when CONFIG_FORTIFY_SOURCE=y and
      CONFIG_PROFILE_ALL_BRANCHES=y, so prevent PROFILE_ALL_BRANCHES if
      FORTIFY_SOURCE=y.
      
      Link: http://lkml.kernel.org/r/9199446b-a141-c0c3-9678-a3f9107f2750@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      68e76e03
    • S
      ring-buffer: Bring back context level recursive checks · a0e3a18f
      Steven Rostedt (VMware) 提交于
      Commit 1a149d7d ("ring-buffer: Rewrite trace_recursive_(un)lock() to be
      simpler") replaced the context level recursion checks with a simple counter.
      This would prevent the ring buffer code from recursively calling itself more
      than the max number of contexts that exist (Normal, softirq, irq, nmi). But
      this change caused a lockup in a specific case, which was during suspend and
      resume using a global clock. Adding a stack dump to see where this occurred,
      the issue was in the trace global clock itself:
      
        trace_buffer_lock_reserve+0x1c/0x50
        __trace_graph_entry+0x2d/0x90
        trace_graph_entry+0xe8/0x200
        prepare_ftrace_return+0x69/0xc0
        ftrace_graph_caller+0x78/0xa8
        queued_spin_lock_slowpath+0x5/0x1d0
        trace_clock_global+0xb0/0xc0
        ring_buffer_lock_reserve+0xf9/0x390
      
      The function graph tracer traced queued_spin_lock_slowpath that was called
      by trace_clock_global. This pointed out that the trace_clock_global() is not
      reentrant, as it takes a spin lock. It depended on the ring buffer recursive
      lock from letting that happen.
      
      By removing the context detection and adding just a max number of allowable
      recursions, it allowed the trace_clock_global() to be entered again and try
      to retake the spinlock it already held, causing a deadlock.
      
      Fixes: 1a149d7d ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler")
      Reported-by: NDavid Weinehall <david.weinehall@gmail.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a0e3a18f
  9. 15 1月, 2018 5 次提交
    • T
      timers: Unconditionally check deferrable base · ed4bbf79
      Thomas Gleixner 提交于
      When the timer base is checked for expired timers then the deferrable base
      must be checked as well. This was missed when making the deferrable base
      independent of base::nohz_active.
      
      Fixes: ced6d5c1 ("timers: Use deferrable base independent of base::nohz_active")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Cc: rt@linutronix.de
      ed4bbf79
    • A
      bpf: fix 32-bit divide by zero · 68fda450
      Alexei Starovoitov 提交于
      due to some JITs doing if (src_reg == 0) check in 64-bit mode
      for div/mod operations mask upper 32-bits of src register
      before doing the check
      
      Fixes: 62258278 ("net: filter: x86: internal BPF JIT")
      Fixes: 7a12b503 ("sparc64: Add eBPF JIT.")
      Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      68fda450
    • L
      futex: Prevent overflow by strengthen input validation · fbe0e839
      Li Jinyue 提交于
      UBSAN reports signed integer overflow in kernel/futex.c:
      
       UBSAN: Undefined behaviour in kernel/futex.c:2041:18
       signed integer overflow:
       0 - -2147483648 cannot be represented in type 'int'
      
      Add a sanity check to catch negative values of nr_wake and nr_requeue.
      Signed-off-by: NLi Jinyue <lijinyue@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Cc: dvhart@infradead.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
      fbe0e839
    • P
      futex: Avoid violating the 10th rule of futex · c1e2f0ea
      Peter Zijlstra 提交于
      Julia reported futex state corruption in the following scenario:
      
         waiter                                  waker                                            stealer (prio > waiter)
      
         futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
               timeout=[N ms])
            futex_wait_requeue_pi()
               futex_wait_queue_me()
                  freezable_schedule()
                  <scheduled out>
                                                 futex(LOCK_PI, uaddr2)
                                                 futex(CMP_REQUEUE_PI, uaddr,
                                                       uaddr2, 1, 0)
                                                    /* requeues waiter to uaddr2 */
                                                 futex(UNLOCK_PI, uaddr2)
                                                       wake_futex_pi()
                                                          cmp_futex_value_locked(uaddr2, waiter)
                                                          wake_up_q()
                 <woken by waker>
                 <hrtimer_wakeup() fires,
                  clears sleeper->task>
                                                                                                 futex(LOCK_PI, uaddr2)
                                                                                                    __rt_mutex_start_proxy_lock()
                                                                                                       try_to_take_rt_mutex() /* steals lock */
                                                                                                          rt_mutex_set_owner(lock, stealer)
                                                                                                    <preempted>
               <scheduled in>
               rt_mutex_wait_proxy_lock()
                  __rt_mutex_slowlock()
                     try_to_take_rt_mutex() /* fails, lock held by stealer */
                     if (timeout && !timeout->task)
                        return -ETIMEDOUT;
                  fixup_owner()
                     /* lock wasn't acquired, so,
                        fixup_pi_state_owner skipped */
      
         return -ETIMEDOUT;
      
         /* At this point, we've returned -ETIMEDOUT to userspace, but the
          * futex word shows waiter to be the owner, and the pi_mutex has
          * stealer as the owner */
      
         futex_lock(LOCK_PI, uaddr2)
           -> bails with EDEADLK, futex word says we're owner.
      
      And suggested that what commit:
      
        73d786bd ("futex: Rework inconsistent rt_mutex/futex_q state")
      
      removes from fixup_owner() looks to be just what is needed. And indeed
      it is -- I completely missed that requeue_pi could also result in this
      case. So we need to restore that, except that subsequent patches, like
      commit:
      
        16ffa12d ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")
      
      changed all the locking rules. Even without that, the sequence:
      
      -               if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
      -                       locked = 1;
      -                       goto out;
      -               }
      
      -               raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
      -               owner = rt_mutex_owner(&q->pi_state->pi_mutex);
      -               if (!owner)
      -                       owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
      -               raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
      -               ret = fixup_pi_state_owner(uaddr, q, owner);
      
      already suggests there were races; otherwise we'd never have to look
      at next_owner.
      
      So instead of doing 3 consecutive wait_lock sections with who knows
      what races, we do it all in a single section. Additionally, the usage
      of pi_state->owner in fixup_owner() was only safe because only the
      rt_mutex owner would modify it, which this additional case wrecks.
      
      Luckily the values can only change away and not to the value we're
      testing, this means we can do a speculative test and double check once
      we have the wait_lock.
      
      Fixes: 73d786bd ("futex: Rework inconsistent rt_mutex/futex_q state")
      Reported-by: NJulia Cartwright <julia@ni.com>
      Reported-by: NGratian Crisan <gratian.crisan@ni.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJulia Cartwright <julia@ni.com>
      Tested-by: NGratian Crisan <gratian.crisan@ni.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
      c1e2f0ea
    • E
      bpf: fix divides by zero · c366287e
      Eric Dumazet 提交于
      Divides by zero are not nice, lets avoid them if possible.
      
      Also do_div() seems not needed when dealing with 32bit operands,
      but this seems a minor detail.
      
      Fixes: bd4cf0ed ("net: filter: rework/optimize internal BPF interpreter's instruction set")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c366287e
  10. 14 1月, 2018 1 次提交
  11. 13 1月, 2018 2 次提交
  12. 11 1月, 2018 4 次提交
    • D
      bpf, array: fix overflow in max_entries and undefined behavior in index_mask · bbeb6e43
      Daniel Borkmann 提交于
      syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
      and thus unprivileged. With the recently added logic in b2157399
      ("bpf: prevent out-of-bounds speculation") we round this up to the next
      power of two value for max_entries for unprivileged such that we can
      apply proper masking into potentially zeroed out map slots.
      
      However, this will generate an index_mask of 0xffffffff, and therefore
      a + 1 will let this overflow into new max_entries of 0. This will pass
      allocation, etc, and later on map access we still enforce on the original
      attr->max_entries value which was 0xfffffffd, therefore triggering GPF
      all over the place. Thus bail out on overflow in such case.
      
      Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
      since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
      space is undefined. Therefore, do this by hand in a 64 bit variable.
      
      This fixes all the issues triggered by syzkaller's reproducers.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
      Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
      Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
      Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
      Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bbeb6e43
    • D
      bpf: arsh is not supported in 32 bit alu thus reject it · 7891a87e
      Daniel Borkmann 提交于
      The following snippet was throwing an 'unknown opcode cc' warning
      in BPF interpreter:
      
        0: (18) r0 = 0x0
        2: (7b) *(u64 *)(r10 -16) = r0
        3: (cc) (u32) r0 s>>= (u32) r0
        4: (95) exit
      
      Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
      generation, not all of them do and interpreter does neither. We can
      leave existing ones and implement it later in bpf-next for the
      remaining ones, but reject this properly in verifier for the time
      being.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7891a87e
    • C
      bpf: fix spelling mistake: "obusing" -> "abusing" · 40950343
      Colin Ian King 提交于
      Trivial fix to spelling mistake in error message text.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      40950343
    • R
      cgroup: make cgroup.threads delegatable · 4f58424d
      Roman Gushchin 提交于
      Make cgroup.threads file delegatable.
      The behavior of cgroup.threads should follow the behavior of cgroup.procs.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Discovered-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4f58424d
  13. 10 1月, 2018 2 次提交
    • M
      membarrier: Disable preemption when calling smp_call_function_many() · 54167607
      Mathieu Desnoyers 提交于
      smp_call_function_many() requires disabling preemption around the call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: <stable@vger.kernel.org> # v4.14+
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54167607
    • A
      bpf: introduce BPF_JIT_ALWAYS_ON config · 290af866
      Alexei Starovoitov 提交于
      The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
      
      A quote from goolge project zero blog:
      "At this point, it would normally be necessary to locate gadgets in
      the host kernel code that can be used to actually leak data by reading
      from an attacker-controlled location, shifting and masking the result
      appropriately and then using the result of that as offset to an
      attacker-controlled address for a load. But piecing gadgets together
      and figuring out which ones work in a speculation context seems annoying.
      So instead, we decided to use the eBPF interpreter, which is built into
      the host kernel - while there is no legitimate way to invoke it from inside
      a VM, the presence of the code in the host kernel's text section is sufficient
      to make it usable for the attack, just like with ordinary ROP gadgets."
      
      To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
      option that removes interpreter from the kernel in favor of JIT-only mode.
      So far eBPF JIT is supported by:
      x64, arm64, arm32, sparc64, s390, powerpc64, mips64
      
      The start of JITed program is randomized and code page is marked as read-only.
      In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
      
      v2->v3:
      - move __bpf_prog_ret0 under ifdef (Daniel)
      
      v1->v2:
      - fix init order, test_bpf and cBPF (Daniel's feedback)
      - fix offloaded bpf (Jakub's feedback)
      - add 'return 0' dummy in case something can invoke prog->bpf_func
      - retarget bpf tree. For bpf-next the patch would need one extra hunk.
        It will be sent when the trees are merged back to net-next
      
      Considered doing:
        int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
      but it seems better to land the patch as-is and in bpf-next remove
      bpf_jit_enable global variable from all JITs, consolidate in one place
      and remove this jit_init() function.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      290af866
  14. 09 1月, 2018 2 次提交
    • A
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov 提交于
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b2157399
    • I
      locking/lockdep: Remove cross-release leftovers · 527187d2
      Ingo Molnar 提交于
      There's two cross-release leftover facilities:
      
       - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
       - the complete_release_commit() callback (NOP as well)
      
      Remove them.
      
      Cc: David Sterba <dsterba@suse.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      527187d2
  15. 07 1月, 2018 1 次提交
  16. 06 1月, 2018 1 次提交
  17. 05 1月, 2018 2 次提交
  18. 30 12月, 2017 2 次提交
    • T
      timers: Invoke timer_start_debug() where it makes sense · fd45bb77
      Thomas Gleixner 提交于
      The timer start debug function is called before the proper timer base is
      set. As a consequence the trace data contains the stale CPU and flags
      values.
      
      Call the debug function after setting the new base and flags.
      
      Fixes: 500462a9 ("timers: Switch to a non-cascading wheel")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: rt@linutronix.de
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Link: https://lkml.kernel.org/r/20171222145337.792907137@linutronix.de
      fd45bb77
    • T
      nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick() · 5d62c183
      Thomas Gleixner 提交于
      The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
      subsequently invokes tick_nohz_stop_sched_tick() are:
      
        if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))
      
      If need_resched() is not set, but a timer softirq is pending then this is
      an indication that the softirq code punted and delegated the execution to
      softirqd. need_resched() is not true because the current interrupted task
      takes precedence over softirqd.
      
      Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
      timer interrupts because the timer wheel contains an expired timer, but
      softirqs are not yet executed. So it returns an immediate expiry request,
      which causes the timer to fire immediately again. Lather, rinse and
      repeat....
      
      Prevent that by adding a check for a pending timer soft interrupt to the
      conditions in tick_nohz_stop_sched_tick() which avoid calling
      get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
      prevents a repetitive programming of an already expired timer.
      Reported-by: NSebastian Siewior <bigeasy@linutronix.d>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
      5d62c183