1. 09 4月, 2018 3 次提交
    • F
      nohz: Gather tick_sched booleans under a common flag field · 2bc629a6
      Frederic Weisbecker 提交于
      Optimize the space and leave plenty of room for further flags.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      [ rjw: Do not use __this_cpu_read() to access tick_stopped and add
             got_idle_tick to avoid overloading inidle ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2bc629a6
    • R
      cpuidle: menu: Refine idle state selection for running tick · 296bb1e5
      Rafael J. Wysocki 提交于
      If the tick isn't stopped, the target residency of the state selected
      by the menu governor may be greater than the actual time to the next
      tick and that means lost energy.
      
      To avoid that, make tick_nohz_get_sleep_length() return the current
      time to the next event (before stopping the tick) in addition to the
      estimated one via an extra pointer argument and make menu_select()
      use that value to refine the state selection when necessary.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      296bb1e5
    • R
      sched: idle: Select idle state before stopping the tick · 554c8aa8
      Rafael J. Wysocki 提交于
      In order to address the issue with short idle duration predictions
      by the idle governor after the scheduler tick has been stopped,
      reorder the code in cpuidle_idle_call() so that the governor idle
      state selection runs before tick_nohz_idle_go_idle() and use the
      "nohz" hint returned by cpuidle_select() to decide whether or not
      to stop the tick.
      
      This isn't straightforward, because menu_select() invokes
      tick_nohz_get_sleep_length() to get the time to the next timer
      event and the number returned by the latter comes from
      __tick_nohz_idle_stop_tick().  Fortunately, however, it is possible
      to compute that number without actually stopping the tick and with
      the help of the existing code.
      
      Namely, tick_nohz_get_sleep_length() can be made call
      tick_nohz_next_event(), introduced earlier, to get the time to the
      next non-highres timer event.  If that happens, tick_nohz_next_event()
      need not be called by __tick_nohz_idle_stop_tick() again.
      
      If it turns out that the scheduler tick cannot be stopped going
      forward or the next timer event is too close for the tick to be
      stopped, tick_nohz_get_sleep_length() can simply return the time to
      the next event currently programmed into the corresponding clock
      event device.
      
      In addition to knowing the return value of tick_nohz_next_event(),
      however, tick_nohz_get_sleep_length() needs to know the time to the
      next highres timer event, but with the scheduler tick timer excluded,
      which can be computed with the help of hrtimer_get_next_event().
      
      That minimum of that number and the tick_nohz_next_event() return
      value is the total time to the next timer event with the assumption
      that the tick will be stopped.  It can be returned to the idle
      governor which can use it for predicting idle duration (under the
      assumption that the tick will be stopped) and deciding whether or
      not it makes sense to stop the tick before putting the CPU into the
      selected idle state.
      
      With the above, the sleep_length field in struct tick_sched is not
      necessary any more, so drop it.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227Reported-by: NDoug Smythies <dsmythies@telus.net>
      Reported-by: NThomas Ilsche <thomas.ilsche@tu-dresden.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      554c8aa8
  2. 08 4月, 2018 1 次提交
    • R
      time: tick-sched: Split tick_nohz_stop_sched_tick() · 23a8d888
      Rafael J. Wysocki 提交于
      In order to address the issue with short idle duration predictions
      by the idle governor after the scheduler tick has been stopped, split
      tick_nohz_stop_sched_tick() into two separate routines, one computing
      the time to the next timer event and the other simply stopping the
      tick when the time to the next timer event is known.
      
      Prepare these two routines to be called separately, as one of them
      will be called by the idle governor in the cpuidle_select() code
      path after subsequent changes.
      
      Update the former callers of tick_nohz_stop_sched_tick() to use
      the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(),
      instead of it and move the updates of the sleep_length field in
      struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't
      need to be updated anywhere else.
      
      There should be no intentional visible changes in functionality
      resulting from this change.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      23a8d888
  3. 06 4月, 2018 3 次提交
    • R
      cpuidle: Return nohz hint from cpuidle_select() · 45f1ff59
      Rafael J. Wysocki 提交于
      Add a new pointer argument to cpuidle_select() and to the ->select
      cpuidle governor callback to allow a boolean value indicating
      whether or not the tick should be stopped before entering the
      selected state to be returned from there.
      
      Make the ladder governor ignore that pointer (to preserve its
      current behavior) and make the menu governor return 'false" through
      it if:
       (1) the idle exit latency is constrained at 0, or
       (2) the selected state is a polling one, or
       (3) the expected idle period duration is within the tick period
           range.
      
      In addition to that, the correction factor computations in the menu
      governor need to take the possibility that the tick may not be
      stopped into account to avoid artificially small correction factor
      values.  To that end, add a mechanism to record tick wakeups, as
      suggested by Peter Zijlstra, and use it to modify the menu_update()
      behavior when tick wakeup occurs.  Namely, if the CPU is woken up by
      the tick and the return value of tick_nohz_get_sleep_length() is not
      within the tick boundary, the predicted idle duration is likely too
      short, so make menu_update() try to compensate for that by updating
      the governor statistics as though the CPU was idle for a long time.
      
      Since the value returned through the new argument pointer of
      cpuidle_select() is not used by its caller yet, this change by
      itself is not expected to alter the functionality of the code.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      45f1ff59
    • R
      sched: idle: Do not stop the tick upfront in the idle loop · 2aaf709a
      Rafael J. Wysocki 提交于
      Push the decision whether or not to stop the tick somewhat deeper
      into the idle loop.
      
      Stopping the tick upfront leads to unpleasant outcomes in case the
      idle governor doesn't agree with the nohz code on the duration of the
      upcoming idle period.  Specifically, if the tick has been stopped and
      the idle governor predicts short idle, the situation is bad regardless
      of whether or not the prediction is accurate.  If it is accurate, the
      tick has been stopped unnecessarily which means excessive overhead.
      If it is not accurate, the CPU is likely to spend too much time in
      the (shallow, because short idle has been predicted) idle state
      selected by the governor [1].
      
      As the first step towards addressing this problem, change the code
      to make the tick stopping decision inside of the loop in do_idle().
      In particular, do not stop the tick in the cpu_idle_poll() code path.
      Also don't do that in tick_nohz_irq_exit() which doesn't really have
      enough information on whether or not to stop the tick.
      
      Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
      Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdfSuggested-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      2aaf709a
    • R
      time: tick-sched: Reorganize idle tick management code · 0e776768
      Rafael J. Wysocki 提交于
      Prepare the scheduler tick code for reworking the idle loop to
      avoid stopping the tick in some cases.
      
      The idea is to split the nohz idle entry call to decouple the idle
      time stats accounting and preparatory work from the actual tick stop
      code, in order to later be able to delay the tick stop once we reach
      more power-knowledgeable callers.
      
      Move away the tick_nohz_start_idle() invocation from
      __tick_nohz_idle_enter(), rename the latter to
      __tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
      as a wrapper around it for calling it from the outside.
      
      Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
      of calling the entire __tick_nohz_idle_enter(), add another wrapper
      disabling and enabling interrupts around tick_nohz_idle_stop_tick()
      and make the current callers of tick_nohz_idle_enter() call it too
      to retain their current functionality.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      0e776768
  4. 09 3月, 2018 1 次提交
    • P
      sched/nohz: Clean up nohz enter/exit · 00357f5e
      Peter Zijlstra 提交于
      The primary observation is that nohz enter/exit is always from the
      current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
      an atomic.
      
      Secondary is that we appear to have 2 nearly identical hooks in the
      nohz enter code, set_cpu_sd_state_idle() and
      nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
      nohz_balance_{enter,exit}_idle.
      
      Removes an atomic op from both enter and exit paths.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      00357f5e
  5. 21 2月, 2018 3 次提交
  6. 16 2月, 2018 1 次提交
    • P
      sched/isolation: Eliminate NO_HZ_FULL_ALL · a7c8655b
      Paul E. McKenney 提交于
      Commit 6f1982fe ("sched/isolation: Handle the nohz_full= parameter")
      broke CONFIG_NO_HZ_FULL_ALL=y kernels.  This breakage is due to the code
      under CONFIG_NO_HZ_FULL_ALL failing to invoke the shiny new housekeeping
      functions.  This means that rcutorture scenario TREE04 now emits RCU CPU
      stall warnings due to the RCU grace-period kthreads not being awakened
      at a time of their choosing, or perhaps even not at all:
      
      [   27.731422] rcu_bh kthread starved for 21001 jiffies! g18446744073709551369 c18446744073709551368 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=3
      [   27.731423] rcu_bh          I14936     9      2 0x80080000
      [   27.731435] Call Trace:
      [   27.731440]  __schedule+0x31a/0x6d0
      [   27.731442]  schedule+0x31/0x80
      [   27.731446]  schedule_timeout+0x15a/0x320
      [   27.731453]  ? call_timer_fn+0x130/0x130
      [   27.731457]  rcu_gp_kthread+0x66c/0xea0
      [   27.731458]  ? rcu_gp_kthread+0x66c/0xea0
      
      Because no one has complained about CONFIG_NO_HZ_FULL_ALL=y being broken,
      I hypothesize that no one is in fact using it, other than rcutorture.
      This commit therefore eliminates CONFIG_NO_HZ_FULL_ALL and updates
      rcutorture's config files to instead use the nohz_full= kernel parameter
      to put the desired CPUs into nohz_full mode.
      
      Fixes: 6f1982fe ("sched/isolation: Handle the nohz_full= parameter")
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      a7c8655b
  7. 16 1月, 2018 1 次提交
  8. 30 12月, 2017 1 次提交
    • T
      nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick() · 5d62c183
      Thomas Gleixner 提交于
      The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
      subsequently invokes tick_nohz_stop_sched_tick() are:
      
        if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))
      
      If need_resched() is not set, but a timer softirq is pending then this is
      an indication that the softirq code punted and delegated the execution to
      softirqd. need_resched() is not true because the current interrupted task
      takes precedence over softirqd.
      
      Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
      timer interrupts because the timer wheel contains an expired timer, but
      softirqs are not yet executed. So it returns an immediate expiry request,
      which causes the timer to fire immediately again. Lather, rinse and
      repeat....
      
      Prevent that by adding a check for a pending timer soft interrupt to the
      conditions in tick_nohz_stop_sched_tick() which avoid calling
      get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
      prevents a repetitive programming of an already expired timer.
      Reported-by: NSebastian Siewior <bigeasy@linutronix.d>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
      5d62c183
  9. 28 12月, 2017 1 次提交
  10. 08 11月, 2017 1 次提交
  11. 27 10月, 2017 2 次提交
  12. 10 10月, 2017 1 次提交
  13. 22 6月, 2017 2 次提交
  14. 13 6月, 2017 1 次提交
    • F
      nohz: Fix spurious warning when hrtimer and clockevent get out of sync · d4af6d93
      Frederic Weisbecker 提交于
      The sanity check ensuring that the tick expiry cache (ts->next_tick)
      is actually in sync with the hardware clock (dev->next_event) makes the
      wrong assumption that the clock can't be programmed later than the
      hrtimer deadline.
      
      In fact the clock hardware can be programmed later on some conditions
      such as:
      
          * The hrtimer deadline is already in the past.
          * The hrtimer deadline is earlier than the minimum delay supported
            by the hardware.
      
      Such conditions can be met when we program the tick, for example if the
      last jiffies update hasn't been seen by the current CPU yet, we may
      program the hrtimer to a deadline that is earlier than ktime_get()
      because last_jiffies_update is our timestamp base to compute the next
      tick.
      
      As a result, we can randomly observe such warning:
      
      	WARNING: CPU: 5 PID: 0 at kernel/time/tick-sched.c:794 tick_nohz_stop_sched_tick kernel/time/tick-sched.c:791 [inline]
      	Call Trace:
      	 tick_nohz_irq_exit
      	 tick_irq_exit
      	 irq_exit
      	 exiting_irq
      	 smp_call_function_interrupt
      	 smp_call_function_single_interrupt
      	 call_function_single_interrupt
      
      Therefore, let's rather make sure that the tick expiry cache is sync'ed
      with the tick hrtimer deadline, against which it is not supposed to
      drift away. The clock hardware instead has its own will and can't be
      used as a reliable comparison point.
      Reported-and-tested-by: NSasha Levin <alexander.levin@verizon.com>
      Reported-and-tested-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: James Hartsock <hartsjc@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Wright <tim@binbash.co.uk>
      Link: http://lkml.kernel.org/r/1497326654-14122-1-git-send-email-fweisbec@gmail.com
      [ Minor readability edit. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d4af6d93
  15. 05 6月, 2017 1 次提交
    • F
      nohz: Fix buggy tick delay on IRQ storms · f99973e1
      Frederic Weisbecker 提交于
      When the tick is stopped and we reach the dynticks evaluation code on
      IRQ exit, we perform a soft tick restart if we observe an expired timer
      from there. It means we program the nearest possible tick but we stay in
      dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
      again after that expired timer is handled.
      
      Now this solution works most of the time but if we suffer an IRQ storm
      and those interrupts trigger faster than the hardware clockevents min
      delay, our tick won't fire until that IRQ storm is finished.
      
      Here is the problem: on IRQ exit we reprog the timer to at least
      NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
      reschedule again to NOW() + min_clockevents_delay, etc... The tick
      is eternally rescheduled min_clockevents_delay ahead.
      
      A solution is to simply remove this soft tick restart. After all
      the normal dynticks evaluation path can handle 0 delay just fine. And
      by doing that we benefit from the optimization branch which avoids
      clock reprogramming if the clockevents deadline hasn't changed since
      the last reprog. This fixes our issue because we don't do repetitive
      clock reprog that always add hardware min delay.
      
      As a side effect it should even optimize the 0 delay path in general.
      Reported-and-tested-by: NOctavian Purdila <octavian.purdila@nxp.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1496328429-13317-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f99973e1
  16. 31 5月, 2017 1 次提交
  17. 17 5月, 2017 2 次提交
    • F
      nohz: Fix collision between tick and other hrtimers, again · 411fe24e
      Frederic Weisbecker 提交于
      This restores commit:
      
        24b91e36: ("nohz: Fix collision between tick and other hrtimers")
      
      ... which got reverted by commit:
      
        558e8e27: ('Revert "nohz: Fix collision between tick and other hrtimers"')
      
      ... due to a regression where CPUs spuriously stopped ticking.
      
      The bug happened when a tick fired too early past its expected expiration:
      on IRQ exit the tick was scheduled again to the same deadline but skipped
      reprogramming because ts->next_tick still kept in cache the deadline.
      This has been fixed now with resetting ts->next_tick from the tick
      itself. Extra care has also been taken to prevent from obsolete values
      throughout CPU hotplug operations.
      
      When the tick is stopped and an interrupt occurs afterward, we check on
      that interrupt exit if the next tick needs to be rescheduled. If it
      doesn't need any update, we don't want to do anything.
      
      In order to check if the tick needs an update, we compare it against the
      clockevent device deadline. Now that's a problem because the clockevent
      device is at a lower level than the tick itself if it is implemented
      on top of hrtimer.
      
      Every hrtimer share this clockevent device. So comparing the next tick
      deadline against the clockevent device deadline is wrong because the
      device may be programmed for another hrtimer whose deadline collides
      with the tick. As a result we may end up not reprogramming the tick
      accidentally.
      
      In a worst case scenario under full dynticks mode, the tick stops firing
      as it is supposed to every 1hz, leaving /proc/stat stalled:
      
            Task in a full dynticks CPU
            ----------------------------
      
            * hrtimer A is queued 2 seconds ahead
            * the tick is stopped, scheduled 1 second ahead
            * tick fires 1 second later
            * on tick exit, nohz schedules the tick 1 second ahead but sees
              the clockevent device is already programmed to that deadline,
              fooled by hrtimer A, the tick isn't rescheduled.
            * hrtimer A is cancelled before its deadline
            * tick never fires again until an interrupt happens...
      
      In order to fix this, store the next tick deadline to the tick_sched
      local structure and reuse that value later to check whether we need to
      reprogram the clock after an interrupt.
      
      On the other hand, ts->sleep_length still wants to know about the next
      clock event and not just the tick, so we want to improve the related
      comment to avoid confusion.
      Reported-and-tested-by: NTim Wright <tim@binbash.co.uk>
      Reported-and-tested-by: NPavel Machek <pavel@ucw.cz>
      Reported-by: NJames Hartsock <hartsjc@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1492783255-5051-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      411fe24e
    • F
      nohz: Add hrtimer sanity check · ce6cf9a1
      Frederic Weisbecker 提交于
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ce6cf9a1
  18. 15 5月, 2017 1 次提交
  19. 23 3月, 2017 1 次提交
    • R
      cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely · b7eaf1aa
      Rafael J. Wysocki 提交于
      The way the schedutil governor uses the PELT metric causes it to
      underestimate the CPU utilization in some cases.
      
      That can be easily demonstrated by running kernel compilation on
      a Sandy Bridge Intel processor, running turbostat in parallel with
      it and looking at the values written to the MSR_IA32_PERF_CTL
      register.  Namely, the expected result would be that when all CPUs
      were 100% busy, all of them would be requested to run in the maximum
      P-state, but observation shows that this clearly isn't the case.
      The CPUs run in the maximum P-state for a while and then are
      requested to run slower and go back to the maximum P-state after
      a while again.  That causes the actual frequency of the processor to
      visibly oscillate below the sustainable maximum in a jittery fashion
      which clearly is not desirable.
      
      That has been attributed to CPU utilization metric updates on task
      migration that cause the total utilization value for the CPU to be
      reduced by the utilization of the migrated task.  If that happens,
      the schedutil governor may see a CPU utilization reduction and will
      attempt to reduce the CPU frequency accordingly right away.  That
      may be premature, though, for example if the system is generally
      busy and there are other runnable tasks waiting to be run on that
      CPU already.
      
      This is unlikely to be an issue on systems where cpufreq policies are
      shared between multiple CPUs, because in those cases the policy
      utilization is computed as the maximum of the CPU utilization values
      over the whole policy and if that turns out to be low, reducing the
      frequency for the policy most likely is a good idea anyway.  On
      systems with one CPU per policy, however, it may affect performance
      adversely and even lead to increased energy consumption in some cases.
      
      On those systems it may be addressed by taking another utilization
      metric into consideration, like whether or not the CPU whose
      frequency is about to be reduced has been idle recently, because if
      that's not the case, the CPU is likely to be busy in the near future
      and its frequency should not be reduced.
      
      To that end, use the counter of idle calls in the timekeeping code.
      Namely, make the schedutil governor look at that counter for the
      current CPU every time before its frequency is about to be reduced.
      If the counter has not changed since the previous iteration of the
      governor computations for that CPU, the CPU has been busy for all
      that time and its frequency should not be decreased, so if the new
      frequency would be lower than the one set previously, the governor
      will skip the frequency update.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NJoel Fernandes <joelaf@google.com>
      b7eaf1aa
  20. 02 3月, 2017 5 次提交
  21. 17 2月, 2017 1 次提交
    • L
      Revert "nohz: Fix collision between tick and other hrtimers" · 558e8e27
      Linus Torvalds 提交于
      This reverts commit 24b91e36 and commit
      7bdb59f1 ("tick/nohz: Fix possible missing clock reprog after tick
      soft restart") that depends on it,
      
      Pavel reports that it causes occasional boot hangs for him that seem to
      depend on just how the machine was booted.  In particular, his machine
      hangs at around the PCI fixups of the EHCI USB host controller, but only
      hangs from cold boot, not from a warm boot.
      
      Thomas Gleixner suspecs it's a CPU hotplug interaction, particularly
      since Pavel also saw suspend/resume issues that seem to be related.
      We're reverting for now while trying to figure out the root cause.
      Reported-bisected-and-tested-by: NPavel Machek <pavel@ucw.cz>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org  # reverted commits were marked for stable
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      558e8e27
  22. 10 2月, 2017 1 次提交
    • F
      tick/nohz: Fix possible missing clock reprog after tick soft restart · 7bdb59f1
      Frederic Weisbecker 提交于
      ts->next_tick keeps track of the next tick deadline in order to optimize
      clock programmation on irq exit and avoid redundant clock device writes.
      
      Now if ts->next_tick missed an update, we may spuriously miss a clock
      reprog later as the nohz code is fooled by an obsolete next_tick value.
      
      This is what happens here on a specific path: when we observe an
      expired timer from the nohz update code on irq exit, we perform a soft
      tick restart which simply fires the closest possible tick without
      actually exiting the nohz mode and restoring a periodic state. But we
      forget to update ts->next_tick accordingly.
      
      As a result, after the next tick resulting from such soft tick restart,
      the nohz code sees a stale value on ts->next_tick which doesn't match
      the clock deadline that just expired. If that obsolete ts->next_tick
      value happens to collide with the actual next tick deadline to be
      scheduled, we may spuriously bypass the clock reprogramming. In the
      worst case, the tick may never fire again.
      
      Fix this with a ts->next_tick reset on soft tick restart.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed: Wanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1486485894-29173-1-git-send-email-fweisbec@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7bdb59f1
  23. 11 1月, 2017 1 次提交
    • F
      nohz: Fix collision between tick and other hrtimers · 24b91e36
      Frederic Weisbecker 提交于
      When the tick is stopped and an interrupt occurs afterward, we check on
      that interrupt exit if the next tick needs to be rescheduled. If it
      doesn't need any update, we don't want to do anything.
      
      In order to check if the tick needs an update, we compare it against the
      clockevent device deadline. Now that's a problem because the clockevent
      device is at a lower level than the tick itself if it is implemented
      on top of hrtimer.
      
      Every hrtimer share this clockevent device. So comparing the next tick
      deadline against the clockevent device deadline is wrong because the
      device may be programmed for another hrtimer whose deadline collides
      with the tick. As a result we may end up not reprogramming the tick
      accidentally.
      
      In a worst case scenario under full dynticks mode, the tick stops firing
      as it is supposed to every 1hz, leaving /proc/stat stalled:
      
            Task in a full dynticks CPU
            ----------------------------
      
            * hrtimer A is queued 2 seconds ahead
            * the tick is stopped, scheduled 1 second ahead
            * tick fires 1 second later
            * on tick exit, nohz schedules the tick 1 second ahead but sees
              the clockevent device is already programmed to that deadline,
              fooled by hrtimer A, the tick isn't rescheduled.
            * hrtimer A is cancelled before its deadline
            * tick never fires again until an interrupt happens...
      
      In order to fix this, store the next tick deadline to the tick_sched
      local structure and reuse that value later to check whether we need to
      reprogram the clock after an interrupt.
      
      On the other hand, ts->sleep_length still wants to know about the next
      clock event and not just the tick, so we want to improve the related
      comment to avoid confusion.
      Reported-by: NJames Hartsock <hartsjc@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      24b91e36
  24. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  25. 23 11月, 2016 1 次提交
  26. 13 9月, 2016 1 次提交
  27. 02 9月, 2016 1 次提交
    • W
      tick/nohz: Fix softlockup on scheduler stalls in kvm guest · 08d07259
      Wanpeng Li 提交于
      tick_nohz_start_idle() is prevented to be called if the idle tick can't 
      be stopped since commit 1f3b0f82 ("tick/nohz: Optimize nohz idle 
      enter"). As a result, after suspend/resume the host machine, full dynticks 
      kvm guest will softlockup:
      
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
       Call Trace:
        default_idle+0x31/0x1a0
        arch_cpu_idle+0xf/0x20
        default_idle_call+0x2a/0x50
        cpu_startup_entry+0x39b/0x4d0
        rest_init+0x138/0x140
        ? rest_init+0x5/0x140
        start_kernel+0x4c1/0x4ce
        ? set_init_arg+0x55/0x55
        ? early_idt_handler_array+0x120/0x120
        x86_64_start_reservations+0x24/0x26
        x86_64_start_kernel+0x142/0x14f
      
      In addition, cat /proc/stat | grep cpu in guest or host:
      
      cpu  398 16 5049 15754 5490 0 1 46 0 0
      cpu0 206 5 450 0 0 0 1 14 0 0
      cpu1 81 0 3937 3149 1514 0 0 9 0 0
      cpu2 45 6 332 6052 2243 0 0 11 0 0
      cpu3 65 2 328 6552 1732 0 0 11 0 0
      
      The idle and iowait states are weird 0 for cpu0(housekeeping). 
      
      The bug is present in both guest and host kernels, and they both have 
      cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
      path etc will touch watchdog to avoid the softlockup.
      
      - The watchdog will not be touched in tick_nohz_stop_idle path (need be 
        touched since the scheduler stall is expected) if idle_active flags are 
        not detected.
      - The idle and iowait states will not be accounted when exit idle loop 
        (resched or interrupt) if idle start time and idle_active flags are 
        not set. 
      
      This patch fixes it by reverting commit 1f3b0f82 since can't stop 
      idle tick doesn't mean can't be idle.
      
      Fixes: 1f3b0f82 ("tick/nohz: Optimize nohz idle enter")
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com>
      Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com>
      Cc: stable@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      08d07259