1. 06 2月, 2018 1 次提交
  2. 27 1月, 2018 1 次提交
    • T
      hrtimer: Reset hrtimer cpu base proper on CPU hotplug · d5421ea4
      Thomas Gleixner 提交于
      The hrtimer interrupt code contains a hang detection and mitigation
      mechanism, which prevents that a long delayed hrtimer interrupt causes a
      continous retriggering of interrupts which prevent the system from making
      progress. If a hang is detected then the timer hardware is programmed with
      a certain delay into the future and a flag is set in the hrtimer cpu base
      which prevents newly enqueued timers from reprogramming the timer hardware
      prior to the chosen delay. The subsequent hrtimer interrupt after the delay
      clears the flag and resumes normal operation.
      
      If such a hang happens in the last hrtimer interrupt before a CPU is
      unplugged then the hang_detected flag is set and stays that way when the
      CPU is plugged in again. At that point the timer hardware is not armed and
      it cannot be armed because the hang_detected flag is still active, so
      nothing clears that flag. As a consequence the CPU does not receive hrtimer
      interrupts and no timers expire on that CPU which results in RCU stalls and
      other malfunctions.
      
      Clear the flag along with some other less critical members of the hrtimer
      cpu base to ensure starting from a clean state when a CPU is plugged in.
      
      Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
      root cause of that hard to reproduce heisenbug. Once understood it's
      trivial and certainly justifies a brown paperbag.
      
      Fixes: 41d2e494 ("hrtimer: Tune hrtimer_interrupt hang logic")
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Sewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
      d5421ea4
  3. 26 1月, 2018 1 次提交
    • A
      module/retpoline: Warn about missing retpoline in module · caf7501a
      Andi Kleen 提交于
      There's a risk that a kernel which has full retpoline mitigations becomes
      vulnerable when a module gets loaded that hasn't been compiled with the
      right compiler or the right option.
      
      To enable detection of that mismatch at module load time, add a module info
      string "retpoline" at build time when the module was compiled with
      retpoline support. This only covers compiled C source, but assembler source
      or prebuilt object files are not checked.
      
      If a retpoline enabled kernel detects a non retpoline protected module at
      load time, print a warning and report it in the sysfs vulnerability file.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: gregkh@linuxfoundation.org
      Cc: torvalds@linux-foundation.org
      Cc: jeyu@kernel.org
      Cc: arjan@linux.intel.com
      Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org
      caf7501a
  4. 25 1月, 2018 3 次提交
    • P
      perf/core: Fix ctx::mutex deadlock · 0c7296ca
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup scenario:
      
      	sys_perf_event_open()
      	  perf_event_alloc()
      	    perf_try_init_event()
       #0	      ctx = perf_event_ctx_lock_nested(1)
      	      perf_swevent_init()
      		swevent_hlist_get()
       #1		  mutex_lock(&pmus_lock)
      
      	perf_event_init_cpu()
       #1	  mutex_lock(&pmus_lock)
       #2	  mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #2	   mutex_lock()
       #0	   mutex_lock_nested()
      
      And while we need that perf_event_ctx_lock_nested() for HW PMUs such
      that they can iterate the sibling list, trying to match it to the
      available counters, the software PMUs need do no such thing. Exclude
      them.
      
      In particular the swevent triggers the above invertion, while the
      tpevent PMU triggers a more elaborate one through their event_mutex.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0c7296ca
    • P
      perf/core: Fix another perf,trace,cpuhp lock inversion · 43fa87f7
      Peter Zijlstra 提交于
      Lockdep noticed the following 3-way lockup race:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1              mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                  static_key_enable()
      
       #2	do_cpu_up()
      	  perf_event_init_cpu()
       #3	    mutex_lock(&pmus_lock)
       #4	    mutex_lock(&ctx->mutex)
      
      	perf_ioctl()
       #4	  ctx = perf_event_ctx_lock()
      	  _perf_iotcl()
      	    ftrace_profile_set_filter()
       #0	      mutex_lock(&event_mutex)
      
      Fudge it for now by noting that the tracepoint state does not depend
      on the event <-> context relation. Ugly though :/
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      43fa87f7
    • P
      perf/core: Fix lock inversion between perf,trace,cpuhp · 82d94856
      Peter Zijlstra 提交于
      Lockdep gifted us with noticing the following 4-way lockup scenario:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1             mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                 static_key_enable()
      
       #2     do_cpu_up()
                perf_event_init_cpu()
       #3         mutex_lock(&pmus_lock)
       #4         mutex_lock(&ctx->mutex)
      
              perf_event_task_disable()
                mutex_lock(&current->perf_event_mutex)
       #4       ctx = perf_event_ctx_lock()
       #5       perf_event_for_each_child()
      
              do_exit()
                task_work_run()
                  __fput()
                    perf_release()
                      perf_event_release_kernel()
       #4               mutex_lock(&ctx->mutex)
       #5               mutex_lock(&event->child_mutex)
                        free_event()
                          _free_event()
                            event->destroy() := perf_trace_destroy
       #0                     mutex_lock(&event_mutex);
      
      Fix that by moving the free_event() out from under the locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      82d94856
  5. 24 1月, 2018 6 次提交
  6. 19 1月, 2018 2 次提交
    • S
      tracing: Fix converting enum's from the map in trace_event_eval_update() · 1ebe1eaf
      Steven Rostedt (VMware) 提交于
      Since enums do not get converted by the TRACE_EVENT macro into their values,
      the event format displaces the enum name and not the value. This breaks
      tools like perf and trace-cmd that need to interpret the raw binary data. To
      solve this, an enum map was created to convert these enums into their actual
      numbers on boot up. This is done by TRACE_EVENTS() adding a
      TRACE_DEFINE_ENUM() macro.
      
      Some enums were not being converted. This was caused by an optization that
      had a bug in it.
      
      All calls get checked against this enum map to see if it should be converted
      or not, and it compares the call's system to the system that the enum map
      was created under. If they match, then they call is processed.
      
      To cut down on the number of iterations needed to find the maps with a
      matching system, since calls and maps are grouped by system, when a match is
      made, the index into the map array is saved, so that the next call, if it
      belongs to the same system as the previous call, could start right at that
      array index and not have to scan all the previous arrays.
      
      The problem was, the saved index was used as the variable to know if this is
      a call in a new system or not. If the index was zero, it was assumed that
      the call is in a new system and would keep incrementing the saved index
      until it found a matching system. The issue arises when the first matching
      system was at index zero. The next map, if it belonged to the same system,
      would then think it was the first match and increment the index to one. If
      the next call belong to the same system, it would begin its search of the
      maps off by one, and miss the first enum that should be converted. This left
      a single enum not converted properly.
      
      Also add a comment to describe exactly what that index was for. It took me a
      bit too long to figure out what I was thinking when debugging this issue.
      
      Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0c564a53 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Teste-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ebe1eaf
    • S
      ring-buffer: Fix duplicate results in mapping context to bits in recursive lock · 0164e0d7
      Steven Rostedt (VMware) 提交于
      In bringing back the context checks, the code checks first if its normal
      (non-interrupt) context, and then for NMI then IRQ then softirq. The final
      check is redundant. Since the if branch is only hit if the context is one of
      NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
      it is SOFTIRQ. The current code returns the same result even if its not a
      SOFTIRQ. Which is confusing.
      
        pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ
      
      Is redundant as RB_CTX_SOFTIRQ *is* 2!
      
      Fixes: a0e3a18f ("ring-buffer: Bring back context level recursive checks")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0164e0d7
  7. 18 1月, 2018 4 次提交
    • M
      lockdep: Make lockdep checking constant · 08f36ff6
      Matthew Wilcox 提交于
      There are several places in the kernel which would like to pass a const
      pointer to lockdep_is_held().  Constify the entire path so nobody has to
      trick the compiler.
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20180117151414.23686-3-willy@infradead.org
      08f36ff6
    • M
      lockdep: Assign lock keys on registration · 64f29d1b
      Matthew Wilcox 提交于
      Lockdep is assigning lock keys when a lock was looked up.  This is
      unnecessary; if the lock has never been registered then it is known that it
      is not locked.  It also complicates the calling convention.  Switch to
      assigning the lock key in register_lock_class().
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20180117151414.23686-2-willy@infradead.org
      64f29d1b
    • T
      irq/matrix: Spread interrupts on allocation · a0c9259d
      Thomas Gleixner 提交于
      Keith reported an issue with vector space exhaustion on a server machine
      which is caused by the i40e driver allocating 168 MSI interrupts when the
      driver is initialized, even when most of these interrupts are not used at
      all.
      
      The x86 vector allocation code tries to avoid the immediate allocation with
      the reservation mode, but the card uses MSI and does not support MSI entry
      masking, which prevents reservation mode and requires immediate vector
      allocation.
      
      The matrix allocator is a bit naive and prefers the first CPU in the
      cpumask which describes the possible target CPUs for an allocation. That
      results in allocating all 168 vectors on CPU0 which later causes vector
      space exhaustion when the NVMe driver tries to allocate managed interrupts
      on each CPU for the per CPU queues.
      
      Avoid this by finding the CPU which has the lowest vector allocation count
      to spread out the non managed interrupt accross the possible target CPUs.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos
      a0c9259d
    • D
      bpf: mark dst unknown on inconsistent {s, u}bounds adjustments · 6f16101e
      Daniel Borkmann 提交于
      syzkaller generated a BPF proglet and triggered a warning with
      the following:
      
        0: (b7) r0 = 0
        1: (d5) if r0 s<= 0x0 goto pc+0
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        2: (1f) r0 -= r1
         R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
        verifier internal error: known but bad sbounds
      
      What happens is that in the first insn, r0's min/max value
      are both 0 due to the immediate assignment, later in the jsle
      test the bounds are updated for the min value in the false
      path, meaning, they yield smin_val = 1, smax_val = 0, and when
      ctx pointer is subtracted from r0, verifier bails out with the
      internal error and throwing a WARN since smin_val != smax_val
      for the known constant.
      
      For min_val > max_val scenario it means that reg_set_min_max()
      and reg_set_min_max_inv() (which both refine existing bounds)
      demonstrated that such branch cannot be taken at runtime.
      
      In above scenario for the case where it will be taken, the
      existing [0, 0] bounds are kept intact. Meaning, the rejection
      is not due to a verifier internal error, and therefore the
      WARN() is not necessary either.
      
      We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
      when either known scalars have smin_val != smax_val or
      umin_val != umax_val or any scalar reg with bounds
      smin_val > smax_val or umin_val > umax_val. However, there
      may be a small risk of breakage of buggy programs, so handle
      this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
      just taint the dst reg as unknown scalar when we see ops with
      such kind of src reg.
      
      Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f16101e
  8. 17 1月, 2018 3 次提交
  9. 16 1月, 2018 19 次提交
    • A
      hrtimer: Implement SOFT/HARD clock base selection · 42f42da4
      Anna-Maria Gleixner 提交于
      All prerequisites to handle hrtimers for expiry in either hard or soft
      interrupt context are in place.
      
      Add the missing bit in hrtimer_init() which associates the timer to the
      hard or the softirq clock base.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-30-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      42f42da4
    • A
      hrtimer: Implement support for softirq based hrtimers · 5da70160
      Anna-Maria Gleixner 提交于
      hrtimer callbacks are always invoked in hard interrupt context. Several
      users in tree require soft interrupt context for their callbacks and
      achieve this by combining a hrtimer with a tasklet. The hrtimer schedules
      the tasklet in hard interrupt context and the tasklet callback gets invoked
      in softirq context later.
      
      That's suboptimal and aside of that the real-time patch moves most of the
      hrtimers into softirq context. So adding native support for hrtimers
      expiring in softirq context is a valuable extension for both mainline and
      the RT patch set.
      
      Each valid hrtimer clock id has two associated hrtimer clock bases: one for
      timers expiring in hardirq context and one for timers expiring in softirq
      context.
      
      Implement the functionality to associate a hrtimer with the hard or softirq
      related clock bases and update the relevant functions to take them into
      account when the next expiry time needs to be evaluated.
      
      Add a check into the hard interrupt context handler functions to check
      whether the first expiring softirq based timer has expired. If it's expired
      the softirq is raised and the accounting of softirq based timers to
      evaluate the next expiry time for programming the timer hardware is skipped
      until the softirq processing has finished. At the end of the softirq
      processing the regular processing is resumed.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-29-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5da70160
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
    • A
      hrtimer: Prepare handling of hard and softirq based hrtimers · c458b1d1
      Anna-Maria Gleixner 提交于
      The softirq based hrtimer can utilize most of the existing hrtimers
      functions, but need to operate on a different data set.
      
      Add an 'active_mask' parameter to various functions so the hard and soft bases
      can be selected. Fixup the existing callers and hand in the ACTIVE_HARD
      mask.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-28-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c458b1d1
    • A
      hrtimer: Add clock bases and hrtimer mode for softirq context · 98ecadd4
      Anna-Maria Gleixner 提交于
      Currently hrtimer callback functions are always executed in hard interrupt
      context. Users of hrtimers, which need their timer function to be executed
      in soft interrupt context, make use of tasklets to get the proper context.
      
      Add additional hrtimer clock bases for timers which must expire in softirq
      context, so the detour via the tasklet can be avoided. This is also
      required for RT, where the majority of hrtimer is moved into softirq
      hrtimer context.
      
      The selection of the expiry mode happens via a mode bit. Introduce
      HRTIMER_MODE_SOFT and the matching combinations with the ABS/REL/PINNED
      bits and update the decoding of hrtimer_mode in tracepoints.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-27-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      98ecadd4
    • A
      hrtimer: Use irqsave/irqrestore around __run_hrtimer() · dd934aa8
      Anna-Maria Gleixner 提交于
      __run_hrtimer() is called with the hrtimer_cpu_base.lock held and
      interrupts disabled. Before invoking the timer callback the base lock is
      dropped, but interrupts stay disabled.
      
      The upcoming support for softirq based hrtimers requires that interrupts
      are enabled before the timer callback is invoked.
      
      To avoid code duplication, take hrtimer_cpu_base.lock with
      raw_spin_lock_irqsave(flags) at the call site and hand in the flags as
      a parameter. So raw_spin_unlock_irqrestore() before the callback invocation
      will either keep interrupts disabled in interrupt context or restore to
      interrupt enabled state when called from softirq context.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-26-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dd934aa8
    • A
      hrtimer: Factor out __hrtimer_next_event_base() · ad38f596
      Anna-Maria Gleixner 提交于
      Preparatory patch for softirq based hrtimers to avoid code duplication.
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-25-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad38f596
    • A
      hrtimer: Factor out __hrtimer_start_range_ns() · 138a6b7a
      Anna-Maria Gleixner 提交于
      Preparatory patch for softirq based hrtimers to avoid code duplication,
      factor out the __hrtimer_start_range_ns() function from hrtimer_start_range_ns().
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-24-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      138a6b7a
    • A
      hrtimer: Remove the 'base' parameter from hrtimer_reprogram() · 3ec7a3ee
      Anna-Maria Gleixner 提交于
      hrtimer_reprogram() must have access to the hrtimer_clock_base of the new
      first expiring timer to access hrtimer_clock_base.offset for adjusting the
      expiry time to CLOCK_MONOTONIC. This is required to evaluate whether the
      new left most timer in the hrtimer_clock_base is the first expiring timer
      of all clock bases in a hrtimer_cpu_base.
      
      The only user of hrtimer_reprogram() is hrtimer_start_range_ns(), which has
      a pointer to hrtimer_clock_base() already and hands it in as a parameter. But
      hrtimer_start_range_ns() will be split for the upcoming support for softirq
      based hrtimers to avoid code duplication and will lose the direct access to
      the clock base pointer.
      
      Instead of handing in timer and timer->base as a parameter remove the base
      parameter from hrtimer_reprogram() instead and retrieve the clock base internally.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-23-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ec7a3ee
    • A
      hrtimer: Make remote enqueue decision less restrictive · 2ac2dccc
      Anna-Maria Gleixner 提交于
      The current decision whether a timer can be queued on a remote CPU checks
      for timer->expiry <= remote_cpu_base.expires_next.
      
      This is too restrictive because a timer with the same expiry time as an
      existing timer will be enqueued on right-hand size of the existing timer
      inside the rbtree, i.e. behind the first expiring timer.
      
      So its safe to allow enqueuing timers with the same expiry time as the
      first expiring timer on a remote CPU base.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-22-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2ac2dccc
    • A
      hrtimer: Unify remote enqueue handling · 14c80341
      Anna-Maria Gleixner 提交于
      hrtimer_reprogram() is conditionally invoked from hrtimer_start_range_ns()
      when hrtimer_cpu_base.hres_active is true.
      
      In the !hres_active case there is a special condition for the nohz_active
      case:
      
        If the newly enqueued timer expires before the first expiring timer on a
        remote CPU then the remote CPU needs to be notified and woken up from a
        NOHZ idle sleep to take the new first expiring timer into account.
      
      Previous changes have already established the prerequisites to make the
      remote enqueue behaviour the same whether high resolution mode is active or
      not:
      
        If the to be enqueued timer expires before the first expiring timer on a
        remote CPU, then it cannot be enqueued there.
      
      This was done for the high resolution mode because there is no way to
      access the remote CPU timer hardware. The same is true for NOHZ, but was
      handled differently by unconditionally enqueuing the timer and waking up
      the remote CPU so it can reprogram its timer. Again there is no compelling
      reason for this difference.
      
      hrtimer_check_target(), which makes the 'can remote enqueue' decision is
      already unconditional, but not yet functional because nothing updates
      hrtimer_cpu_base.expires_next in the !hres_active case.
      
      To unify this the following changes are required:
      
       1) Make the store of the new first expiry time unconditonal in
          hrtimer_reprogram() and check __hrtimer_hres_active() before proceeding
          to the actual hardware access. This check also lets the compiler
          eliminate the rest of the function in case of CONFIG_HIGH_RES_TIMERS=n.
      
       2) Invoke hrtimer_reprogram() unconditionally from
          hrtimer_start_range_ns()
      
       3) Remove the remote wakeup special case for the !high_res && nohz_active
          case.
      
      Confine the timers_nohz_active static key to timer.c which is the only user
      now.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-21-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      14c80341
    • A
      hrtimer: Unify hrtimer removal handling · 61bb4bcb
      Anna-Maria Gleixner 提交于
      When the first hrtimer on the current CPU is removed,
      hrtimer_force_reprogram() is invoked but only when
      CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active is set.
      
      hrtimer_force_reprogram() updates hrtimer_cpu_base.expires_next and
      reprograms the clock event device. When CONFIG_HIGH_RES_TIMERS=y and
      hrtimer_cpu_base.hres_active is set, a pointless hrtimer interrupt can be
      prevented.
      
      hrtimer_check_target() makes the 'can remote enqueue' decision. As soon as
      hrtimer_check_target() is unconditionally available and
      hrtimer_cpu_base.expires_next is updated by hrtimer_reprogram(),
      hrtimer_force_reprogram() needs to be available unconditionally as well to
      prevent the following scenario with CONFIG_HIGH_RES_TIMERS=n:
      
      - the first hrtimer on this CPU is removed and hrtimer_force_reprogram() is
        not executed
      
      - CPU goes idle (next timer is calculated and hrtimers are taken into
        account)
      
      - a hrtimer is enqueued remote on the idle CPU: hrtimer_check_target()
        compares expiry value and hrtimer_cpu_base.expires_next. The expiry value
        is after expires_next, so the hrtimer is enqueued. This timer will fire
        late, if it expires before the effective first hrtimer on this CPU and
        the comparison was with an outdated expires_next value.
      
      To prevent this scenario, make hrtimer_force_reprogram() unconditional
      except the effective reprogramming part, which gets eliminated by the
      compiler in the CONFIG_HIGH_RES_TIMERS=n case.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-20-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      61bb4bcb
    • A
      hrtimer: Make hrtimer_force_reprogramm() unconditionally available · ebba2c72
      Anna-Maria Gleixner 提交于
      hrtimer_force_reprogram() needs to be available unconditionally for softirq
      based hrtimers. Move the function and all required struct members out of
      the CONFIG_HIGH_RES_TIMERS #ifdef.
      
      There is no functional change because hrtimer_force_reprogram() is only
      invoked when hrtimer_cpu_base.hres_active is true and
      CONFIG_HIGH_RES_TIMERS=y.
      
      Making it unconditional increases the text size for the
      CONFIG_HIGH_RES_TIMERS=n case slightly, but avoids replication of that code
      for the upcoming softirq based hrtimers support. Most of the code gets
      eliminated in the CONFIG_HIGH_RES_TIMERS=n case by the compiler.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-19-anna-maria@linutronix.de
      [ Made it build on !CONFIG_HIGH_RES_TIMERS ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ebba2c72
    • A
      hrtimer: Make hrtimer_reprogramm() unconditional · 11a9fe06
      Anna-Maria Gleixner 提交于
      hrtimer_reprogram() needs to be available unconditionally for softirq based
      hrtimers. Move the function and all required struct members out of the
      CONFIG_HIGH_RES_TIMERS #ifdef.
      
      There is no functional change because hrtimer_reprogram() is only invoked
      when hrtimer_cpu_base.hres_active is true. Making it unconditional
      increases the text size for the CONFIG_HIGH_RES_TIMERS=n case, but avoids
      replication of that code for the upcoming softirq based hrtimers support.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-18-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11a9fe06
    • A
      hrtimer: Make hrtimer_cpu_base.next_timer handling unconditional · eb27926b
      Anna-Maria Gleixner 提交于
      hrtimer_cpu_base.next_timer stores the pointer to the next expiring timer
      in a CPU base.
      
      This pointer cannot be dereferenced and is solely used to check whether a
      hrtimer which is removed is the hrtimer which is the first to expire in the
      CPU base. If this is the case, then the timer hardware needs to be
      reprogrammed to avoid an extra interrupt for nothing.
      
      Again, this is conditional functionality, but there is no compelling reason
      to make this conditional. As a preparation, hrtimer_cpu_base.next_timer
      needs to be available unconditonally.
      
      Aside of that the upcoming support for softirq based hrtimers requires access
      to this pointer unconditionally as well, so our motivation is not entirely
      simplicity based.
      
      Make the update of hrtimer_cpu_base.next_timer unconditional and remove the
      #ifdef cruft. The impact on CONFIG_HIGH_RES_TIMERS=n && CONFIG_NOHZ=n is
      marginal as it's just a store on an already dirtied cacheline.
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-17-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eb27926b
    • A
      hrtimer: Make the remote enqueue check unconditional · 07a9a7ea
      Anna-Maria Gleixner 提交于
      hrtimer_cpu_base.expires_next is used to cache the next event armed in the
      timer hardware. The value is used to check whether an hrtimer can be
      enqueued remotely. If the new hrtimer is expiring before expires_next, then
      remote enqueue is not possible as the remote hrtimer hardware cannot be
      accessed for reprogramming to an earlier expiry time.
      
      The remote enqueue check is currently conditional on
      CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active. There is no
      compelling reason to make this conditional.
      
      Move hrtimer_cpu_base.expires_next out of the CONFIG_HIGH_RES_TIMERS=y
      guarded area and remove the conditionals in hrtimer_check_target().
      
      The check is currently a NOOP for the CONFIG_HIGH_RES_TIMERS=n and the
      !hrtimer_cpu_base.hres_active case because in these cases nothing updates
      hrtimer_cpu_base.expires_next yet. This will be changed with later patches
      which further reduce the #ifdef zoo in this code.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-16-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      07a9a7ea
    • A
      hrtimer: Use accesor functions instead of direct access · 851cff8c
      Anna-Maria Gleixner 提交于
      __hrtimer_hres_active() is now available unconditionally, so replace open
      coded direct accesses to hrtimer_cpu_base.hres_active.
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-15-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      851cff8c
    • A
      hrtimer: Make the hrtimer_cpu_base::hres_active field unconditional, to simplify the code · 28bfd18b
      Anna-Maria Gleixner 提交于
      The hrtimer_cpu_base::hres_active_member field depends on CONFIG_HIGH_RES_TIMERS=y
      currently, and all related functions to this member are conditional as well.
      
      To simplify the code make it unconditional and set it to zero during initialization.
      
      (This will also help with the upcoming softirq based hrtimers code.)
      
      The conditional code sections can be avoided by adding IS_ENABLED(HIGHRES)
      conditionals into common functions, which ensures dead code elimination.
      
      There is no functional change.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-14-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28bfd18b
    • A
      hrtimer: Store running timer in hrtimer_clock_base · 3f0b9e8e
      Anna-Maria Gleixner 提交于
      The pointer to the currently running timer is stored in hrtimer_cpu_base
      before the base lock is dropped and the callback is invoked.
      
      This results in two levels of indirections and the upcoming support for
      softirq based hrtimer requires splitting the "running" storage into soft
      and hard IRQ context expiry.
      
      Storing both in the cpu base would require conditionals in all code paths
      accessing that information.
      
      It's possible to have a per clock base sequence count and running pointer
      without changing the semantics of the related mechanisms because the timer
      base pointer cannot be changed while a timer is running the callback.
      
      Unfortunately this makes cpu_clock base larger than 32 bytes on 32-bit
      kernels. Instead of having huge gaps due to alignment, remove the alignment
      and let the compiler pack CPU base for 32-bit kernels. The resulting cache access
      patterns are fortunately not really different from the current
      behaviour. On 64-bit kernels the 64-byte alignment stays and the behaviour is
      unchanged. This was determined by analyzing the resulting layout and
      looking at the number of cache lines involved for the frequently used
      clocks.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Link: http://lkml.kernel.org/r/20171221104205.7269-12-anna-maria@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3f0b9e8e