1. 06 2月, 2018 9 次提交
    • W
      sched/rt: Make update_curr_rt() more accurate · e7ad2031
      Wen Yang 提交于
      rq->clock_task may be updated between the two calls of
      rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
      once makes it more accurate and efficient, taking update_curr() as
      reference.
      Signed-off-by: NWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7ad2031
    • S
      sched/rt: Up the root domain ref count when passing it around via IPIs · 364f5665
      Steven Rostedt (VMware) 提交于
      When issuing an IPI RT push, where an IPI is sent to each CPU that has more
      than one RT task scheduled on it, it references the root domain's rto_mask,
      that contains all the CPUs within the root domain that has more than one RT
      task in the runable state. The problem is, after the IPIs are initiated, the
      rq->lock is released. This means that the root domain that is associated to
      the run queue could be freed while the IPIs are going around.
      
      Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
      the root domain's ref count respectively. This way when initiating the IPIs,
      the scheduler will up the root domain's ref count before releasing the
      rq->lock, ensuring that the root domain does not go away until the IPI round
      is complete.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      364f5665
    • S
      sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() · ad0f1d9d
      Steven Rostedt (VMware) 提交于
      When the rto_push_irq_work_func() is called, it looks at the RT overloaded
      bitmask in the root domain via the runqueue (rq->rd). The problem is that
      during CPU up and down, nothing here stops rq->rd from changing between
      taking the rq->rd->rto_lock and releasing it. That means the lock that is
      released is not the same lock that was taken.
      
      Instead of using this_rq()->rd to get the root domain, as the irq work is
      part of the root domain, we can simply get the root domain from the irq work
      that is passed to the routine:
      
       container_of(work, struct root_domain, rto_push_work)
      
      This keeps the root domain consistent.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad0f1d9d
    • P
      sched/core: Optimize update_stats_*() · 2ed41a55
      Peter Zijlstra 提交于
      These functions are already gated by schedstats_enabled(), there is no
      point in then issuing another static_branch for every individual
      update in them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2ed41a55
    • P
      sched/core: Optimize ttwu_stat() · b85c8b71
      Peter Zijlstra 提交于
      The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
      there is absolutely no point in then issuing another static_branch for
      every single schedstat_inc() in there.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b85c8b71
    • M
      membarrier: Provide core serializing command, *_SYNC_CORE · 70216e18
      Mathieu Desnoyers 提交于
      Provide core serializing membarrier command to support memory reclaim
      by JIT.
      
      Each architecture needs to explicitly opt into that support by
      documenting in their architecture code how they provide the core
      serializing instructions required when returning from the membarrier
      IPI, and after the scheduler has updated the curr->mm pointer (before
      going back to user-space). They should then select
      ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
      their architecture.
      
      Architectures selecting this feature need to either document that
      they issue core serializing instructions when returning to user-space,
      or implement their architecture-specific sync_core_before_usermode().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70216e18
    • M
      membarrier: Provide GLOBAL_EXPEDITED command · c5f58bd5
      Mathieu Desnoyers 提交于
      Allow expedited membarrier to be used for data shared between processes
      through shared memory.
      
      Processes wishing to receive the membarriers register with
      MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
      membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This allows extremely simple kernel-level implementation: we have almost
      everything we need with the PRIVATE_EXPEDITED barrier code. All we need
      to do is to add a flag in the mm_struct that will be used to check
      whether we need to send the IPI to the current thread of each CPU.
      
      There is a slight downside to this approach compared to targeting
      specific shared memory users: when performing a membarrier operation,
      all registered "global" receivers will get the barrier, even if they
      don't share a memory mapping with the sender issuing
      MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This registration approach seems to fit the requirement of not
      disturbing processes that really deeply care about real-time: they
      simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
      
      In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
      command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
      MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
      compatibility.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5f58bd5
    • M
      membarrier: Document scheduler barrier requirements · 306e0604
      Mathieu Desnoyers 提交于
      Document the membarrier requirement on having a full memory barrier in
      __schedule() after coming from user-space, before storing to rq->curr.
      It is provided by smp_mb__after_spinlock() in __schedule().
      
      Document that membarrier requires a full barrier on transition from
      kernel thread to userspace thread. We currently have an implicit barrier
      from atomic_dec_and_test() in mmdrop() that ensures this.
      
      The x86 switch_mm_irqs_off() full barrier is currently provided by many
      cpumask update operations as well as write_cr3(). Document that
      write_cr3() provides this barrier.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      306e0604
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  2. 27 1月, 2018 1 次提交
    • T
      hrtimer: Reset hrtimer cpu base proper on CPU hotplug · d5421ea4
      Thomas Gleixner 提交于
      The hrtimer interrupt code contains a hang detection and mitigation
      mechanism, which prevents that a long delayed hrtimer interrupt causes a
      continous retriggering of interrupts which prevent the system from making
      progress. If a hang is detected then the timer hardware is programmed with
      a certain delay into the future and a flag is set in the hrtimer cpu base
      which prevents newly enqueued timers from reprogramming the timer hardware
      prior to the chosen delay. The subsequent hrtimer interrupt after the delay
      clears the flag and resumes normal operation.
      
      If such a hang happens in the last hrtimer interrupt before a CPU is
      unplugged then the hang_detected flag is set and stays that way when the
      CPU is plugged in again. At that point the timer hardware is not armed and
      it cannot be armed because the hang_detected flag is still active, so
      nothing clears that flag. As a consequence the CPU does not receive hrtimer
      interrupts and no timers expire on that CPU which results in RCU stalls and
      other malfunctions.
      
      Clear the flag along with some other less critical members of the hrtimer
      cpu base to ensure starting from a clean state when a CPU is plugged in.
      
      Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
      root cause of that hard to reproduce heisenbug. Once understood it's
      trivial and certainly justifies a brown paperbag.
      
      Fixes: 41d2e494 ("hrtimer: Tune hrtimer_interrupt hang logic")
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Sewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
      d5421ea4
  3. 26 1月, 2018 1 次提交
    • A
      module/retpoline: Warn about missing retpoline in module · caf7501a
      Andi Kleen 提交于
      There's a risk that a kernel which has full retpoline mitigations becomes
      vulnerable when a module gets loaded that hasn't been compiled with the
      right compiler or the right option.
      
      To enable detection of that mismatch at module load time, add a module info
      string "retpoline" at build time when the module was compiled with
      retpoline support. This only covers compiled C source, but assembler source
      or prebuilt object files are not checked.
      
      If a retpoline enabled kernel detects a non retpoline protected module at
      load time, print a warning and report it in the sysfs vulnerability file.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: gregkh@linuxfoundation.org
      Cc: torvalds@linux-foundation.org
      Cc: jeyu@kernel.org
      Cc: arjan@linux.intel.com
      Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org
      caf7501a
  4. 25 1月, 2018 3 次提交
    • P
      perf/core: Fix ctx::mutex deadlock · 0c7296ca
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup scenario:
      
      	sys_perf_event_open()
      	  perf_event_alloc()
      	    perf_try_init_event()
       #0	      ctx = perf_event_ctx_lock_nested(1)
      	      perf_swevent_init()
      		swevent_hlist_get()
       #1		  mutex_lock(&pmus_lock)
      
      	perf_event_init_cpu()
       #1	  mutex_lock(&pmus_lock)
       #2	  mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #2	   mutex_lock()
       #0	   mutex_lock_nested()
      
      And while we need that perf_event_ctx_lock_nested() for HW PMUs such
      that they can iterate the sibling list, trying to match it to the
      available counters, the software PMUs need do no such thing. Exclude
      them.
      
      In particular the swevent triggers the above invertion, while the
      tpevent PMU triggers a more elaborate one through their event_mutex.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0c7296ca
    • P
      perf/core: Fix another perf,trace,cpuhp lock inversion · 43fa87f7
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup race:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1              mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                  static_key_enable()
      
       #2	do_cpu_up()
      	  perf_event_init_cpu()
       #3	    mutex_lock(&pmus_lock)
       #4	    mutex_lock(&ctx->mutex)
      
      	perf_ioctl()
       #4	  ctx = perf_event_ctx_lock()
      	  _perf_iotcl()
      	    ftrace_profile_set_filter()
       #0	      mutex_lock(&event_mutex)
      
      Fudge it for now by noting that the tracepoint state does not depend
      on the event <-> context relation. Ugly though :/
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43fa87f7
    • P
      perf/core: Fix lock inversion between perf,trace,cpuhp · 82d94856
      Peter Zijlstra 提交于
      Lockdep gifted us with noticing the following 4-way lockup scenario:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1             mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                 static_key_enable()
      
       #2     do_cpu_up()
                perf_event_init_cpu()
       #3         mutex_lock(&pmus_lock)
       #4         mutex_lock(&ctx->mutex)
      
              perf_event_task_disable()
                mutex_lock(&current->perf_event_mutex)
       #4       ctx = perf_event_ctx_lock()
       #5       perf_event_for_each_child()
      
              do_exit()
                task_work_run()
                  __fput()
                    perf_release()
                      perf_event_release_kernel()
       #4               mutex_lock(&ctx->mutex)
       #5               mutex_lock(&event->child_mutex)
                        free_event()
                          _free_event()
                            event->destroy() := perf_trace_destroy
       #0                     mutex_lock(&event_mutex);
      
      Fix that by moving the free_event() out from under the locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      82d94856
  5. 24 1月, 2018 6 次提交
  6. 19 1月, 2018 2 次提交
    • S
      tracing: Fix converting enum's from the map in trace_event_eval_update() · 1ebe1eaf
      Steven Rostedt (VMware) 提交于
      Since enums do not get converted by the TRACE_EVENT macro into their values,
      the event format displaces the enum name and not the value. This breaks
      tools like perf and trace-cmd that need to interpret the raw binary data. To
      solve this, an enum map was created to convert these enums into their actual
      numbers on boot up. This is done by TRACE_EVENTS() adding a
      TRACE_DEFINE_ENUM() macro.
      
      Some enums were not being converted. This was caused by an optization that
      had a bug in it.
      
      All calls get checked against this enum map to see if it should be converted
      or not, and it compares the call's system to the system that the enum map
      was created under. If they match, then they call is processed.
      
      To cut down on the number of iterations needed to find the maps with a
      matching system, since calls and maps are grouped by system, when a match is
      made, the index into the map array is saved, so that the next call, if it
      belongs to the same system as the previous call, could start right at that
      array index and not have to scan all the previous arrays.
      
      The problem was, the saved index was used as the variable to know if this is
      a call in a new system or not. If the index was zero, it was assumed that
      the call is in a new system and would keep incrementing the saved index
      until it found a matching system. The issue arises when the first matching
      system was at index zero. The next map, if it belonged to the same system,
      would then think it was the first match and increment the index to one. If
      the next call belong to the same system, it would begin its search of the
      maps off by one, and miss the first enum that should be converted. This left
      a single enum not converted properly.
      
      Also add a comment to describe exactly what that index was for. It took me a
      bit too long to figure out what I was thinking when debugging this issue.
      
      Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0c564a53 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Teste-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ebe1eaf
    • S
      ring-buffer: Fix duplicate results in mapping context to bits in recursive lock · 0164e0d7
      Steven Rostedt (VMware) 提交于
      In bringing back the context checks, the code checks first if its normal
      (non-interrupt) context, and then for NMI then IRQ then softirq. The final
      check is redundant. Since the if branch is only hit if the context is one of
      NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
      it is SOFTIRQ. The current code returns the same result even if its not a
      SOFTIRQ. Which is confusing.
      
        pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ
      
      Is redundant as RB_CTX_SOFTIRQ *is* 2!
      
      Fixes: a0e3a18f ("ring-buffer: Bring back context level recursive checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0164e0d7
  7. 18 1月, 2018 4 次提交
    • M
      lockdep: Make lockdep checking constant · 08f36ff6
      Matthew Wilcox 提交于
      There are several places in the kernel which would like to pass a const
      pointer to lockdep_is_held().  Constify the entire path so nobody has to
      trick the compiler.
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20180117151414.23686-3-willy@infradead.org
      08f36ff6
    • M
      lockdep: Assign lock keys on registration · 64f29d1b
      Matthew Wilcox 提交于
      Lockdep is assigning lock keys when a lock was looked up.  This is
      unnecessary; if the lock has never been registered then it is known that it
      is not locked.  It also complicates the calling convention.  Switch to
      assigning the lock key in register_lock_class().
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20180117151414.23686-2-willy@infradead.org
      64f29d1b
    • T
      irq/matrix: Spread interrupts on allocation · a0c9259d
      Thomas Gleixner 提交于
      Keith reported an issue with vector space exhaustion on a server machine
      which is caused by the i40e driver allocating 168 MSI interrupts when the
      driver is initialized, even when most of these interrupts are not used at
      all.
      
      The x86 vector allocation code tries to avoid the immediate allocation with
      the reservation mode, but the card uses MSI and does not support MSI entry
      masking, which prevents reservation mode and requires immediate vector
      allocation.
      
      The matrix allocator is a bit naive and prefers the first CPU in the
      cpumask which describes the possible target CPUs for an allocation. That
      results in allocating all 168 vectors on CPU0 which later causes vector
      space exhaustion when the NVMe driver tries to allocate managed interrupts
      on each CPU for the per CPU queues.
      
      Avoid this by finding the CPU which has the lowest vector allocation count
      to spread out the non managed interrupt accross the possible target CPUs.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos
      a0c9259d
    • D
      bpf: mark dst unknown on inconsistent {s, u}bounds adjustments · 6f16101e
      Daniel Borkmann 提交于
      syzkaller generated a BPF proglet and triggered a warning with
      the following:
      
        0: (b7) r0 = 0
        1: (d5) if r0 s<= 0x0 goto pc+0
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        2: (1f) r0 -= r1
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        verifier internal error: known but bad sbounds
      
      What happens is that in the first insn, r0's min/max value
      are both 0 due to the immediate assignment, later in the jsle
      test the bounds are updated for the min value in the false
      path, meaning, they yield smin_val = 1, smax_val = 0, and when
      ctx pointer is subtracted from r0, verifier bails out with the
      internal error and throwing a WARN since smin_val != smax_val
      for the known constant.
      
      For min_val > max_val scenario it means that reg_set_min_max()
      and reg_set_min_max_inv() (which both refine existing bounds)
      demonstrated that such branch cannot be taken at runtime.
      
      In above scenario for the case where it will be taken, the
      existing [0, 0] bounds are kept intact. Meaning, the rejection
      is not due to a verifier internal error, and therefore the
      WARN() is not necessary either.
      
      We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
      when either known scalars have smin_val != smax_val or
      umin_val != umax_val or any scalar reg with bounds
      smin_val > smax_val or umin_val > umax_val. However, there
      may be a small risk of breakage of buggy programs, so handle
      this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
      just taint the dst reg as unknown scalar when we see ops with
      such kind of src reg.
      
      Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f16101e
  8. 17 1月, 2018 3 次提交
  9. 16 1月, 2018 11 次提交
    • A
      hrtimer: Implement SOFT/HARD clock base selection · 42f42da4
      Anna-Maria Gleixner 提交于
      All prerequisites to handle hrtimers for expiry in either hard or soft
      interrupt context are in place.
      
      Add the missing bit in hrtimer_init() which associates the timer to the
      hard or the softirq clock base.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-30-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      42f42da4
    • A
      hrtimer: Implement support for softirq based hrtimers · 5da70160
      Anna-Maria Gleixner 提交于
      hrtimer callbacks are always invoked in hard interrupt context. Several
      users in tree require soft interrupt context for their callbacks and
      achieve this by combining a hrtimer with a tasklet. The hrtimer schedules
      the tasklet in hard interrupt context and the tasklet callback gets invoked
      in softirq context later.
      
      That's suboptimal and aside of that the real-time patch moves most of the
      hrtimers into softirq context. So adding native support for hrtimers
      expiring in softirq context is a valuable extension for both mainline and
      the RT patch set.
      
      Each valid hrtimer clock id has two associated hrtimer clock bases: one for
      timers expiring in hardirq context and one for timers expiring in softirq
      context.
      
      Implement the functionality to associate a hrtimer with the hard or softirq
      related clock bases and update the relevant functions to take them into
      account when the next expiry time needs to be evaluated.
      
      Add a check into the hard interrupt context handler functions to check
      whether the first expiring softirq based timer has expired. If it's expired
      the softirq is raised and the accounting of softirq based timers to
      evaluate the next expiry time for programming the timer hardware is skipped
      until the softirq processing has finished. At the end of the softirq
      processing the regular processing is resumed.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-29-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5da70160
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
    • A
      hrtimer: Prepare handling of hard and softirq based hrtimers · c458b1d1
      Anna-Maria Gleixner 提交于
      The softirq based hrtimer can utilize most of the existing hrtimers
      functions, but need to operate on a different data set.
      
      Add an 'active_mask' parameter to various functions so the hard and soft bases
      can be selected. Fixup the existing callers and hand in the ACTIVE_HARD
      mask.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-28-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c458b1d1
    • A
      hrtimer: Add clock bases and hrtimer mode for softirq context · 98ecadd4
      Anna-Maria Gleixner 提交于
      Currently hrtimer callback functions are always executed in hard interrupt
      context. Users of hrtimers, which need their timer function to be executed
      in soft interrupt context, make use of tasklets to get the proper context.
      
      Add additional hrtimer clock bases for timers which must expire in softirq
      context, so the detour via the tasklet can be avoided. This is also
      required for RT, where the majority of hrtimer is moved into softirq
      hrtimer context.
      
      The selection of the expiry mode happens via a mode bit. Introduce
      HRTIMER_MODE_SOFT and the matching combinations with the ABS/REL/PINNED
      bits and update the decoding of hrtimer_mode in tracepoints.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-27-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      98ecadd4
    • A
      hrtimer: Use irqsave/irqrestore around __run_hrtimer() · dd934aa8
      Anna-Maria Gleixner 提交于
      __run_hrtimer() is called with the hrtimer_cpu_base.lock held and
      interrupts disabled. Before invoking the timer callback the base lock is
      dropped, but interrupts stay disabled.
      
      The upcoming support for softirq based hrtimers requires that interrupts
      are enabled before the timer callback is invoked.
      
      To avoid code duplication, take hrtimer_cpu_base.lock with
      raw_spin_lock_irqsave(flags) at the call site and hand in the flags as
      a parameter. So raw_spin_unlock_irqrestore() before the callback invocation
      will either keep interrupts disabled in interrupt context or restore to
      interrupt enabled state when called from softirq context.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-26-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dd934aa8
    • A
      hrtimer: Factor out __hrtimer_next_event_base() · ad38f596
      Anna-Maria Gleixner 提交于
      Preparatory patch for softirq based hrtimers to avoid code duplication.
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-25-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad38f596
    • A
      hrtimer: Factor out __hrtimer_start_range_ns() · 138a6b7a
      Anna-Maria Gleixner 提交于
      Preparatory patch for softirq based hrtimers to avoid code duplication,
      factor out the __hrtimer_start_range_ns() function from hrtimer_start_range_ns().
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-24-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      138a6b7a
    • A
      hrtimer: Remove the 'base' parameter from hrtimer_reprogram() · 3ec7a3ee
      Anna-Maria Gleixner 提交于
      hrtimer_reprogram() must have access to the hrtimer_clock_base of the new
      first expiring timer to access hrtimer_clock_base.offset for adjusting the
      expiry time to CLOCK_MONOTONIC. This is required to evaluate whether the
      new left most timer in the hrtimer_clock_base is the first expiring timer
      of all clock bases in a hrtimer_cpu_base.
      
      The only user of hrtimer_reprogram() is hrtimer_start_range_ns(), which has
      a pointer to hrtimer_clock_base() already and hands it in as a parameter. But
      hrtimer_start_range_ns() will be split for the upcoming support for softirq
      based hrtimers to avoid code duplication and will lose the direct access to
      the clock base pointer.
      
      Instead of handing in timer and timer->base as a parameter remove the base
      parameter from hrtimer_reprogram() instead and retrieve the clock base internally.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-23-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ec7a3ee
    • A
      hrtimer: Make remote enqueue decision less restrictive · 2ac2dccc
      Anna-Maria Gleixner 提交于
      The current decision whether a timer can be queued on a remote CPU checks
      for timer->expiry <= remote_cpu_base.expires_next.
      
      This is too restrictive because a timer with the same expiry time as an
      existing timer will be enqueued on right-hand size of the existing timer
      inside the rbtree, i.e. behind the first expiring timer.
      
      So its safe to allow enqueuing timers with the same expiry time as the
      first expiring timer on a remote CPU base.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-22-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2ac2dccc
    • A
      hrtimer: Unify remote enqueue handling · 14c80341
      Anna-Maria Gleixner 提交于
      hrtimer_reprogram() is conditionally invoked from hrtimer_start_range_ns()
      when hrtimer_cpu_base.hres_active is true.
      
      In the !hres_active case there is a special condition for the nohz_active
      case:
      
        If the newly enqueued timer expires before the first expiring timer on a
        remote CPU then the remote CPU needs to be notified and woken up from a
        NOHZ idle sleep to take the new first expiring timer into account.
      
      Previous changes have already established the prerequisites to make the
      remote enqueue behaviour the same whether high resolution mode is active or
      not:
      
        If the to be enqueued timer expires before the first expiring timer on a
        remote CPU, then it cannot be enqueued there.
      
      This was done for the high resolution mode because there is no way to
      access the remote CPU timer hardware. The same is true for NOHZ, but was
      handled differently by unconditionally enqueuing the timer and waking up
      the remote CPU so it can reprogram its timer. Again there is no compelling
      reason for this difference.
      
      hrtimer_check_target(), which makes the 'can remote enqueue' decision is
      already unconditional, but not yet functional because nothing updates
      hrtimer_cpu_base.expires_next in the !hres_active case.
      
      To unify this the following changes are required:
      
       1) Make the store of the new first expiry time unconditonal in
          hrtimer_reprogram() and check __hrtimer_hres_active() before proceeding
          to the actual hardware access. This check also lets the compiler
          eliminate the rest of the function in case of CONFIG_HIGH_RES_TIMERS=n.
      
       2) Invoke hrtimer_reprogram() unconditionally from
          hrtimer_start_range_ns()
      
       3) Remove the remote wakeup special case for the !high_res && nohz_active
          case.
      
      Confine the timers_nohz_active static key to timer.c which is the only user
      now.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-21-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      14c80341