提交 · 61c66d6efa23759f1061d80ced668977fd28337d · openeuler / Kernel

06 3月, 2014 4 次提交

cpuidle: Do not substract exit latency from assumed sleep length · 61c66d6e

由 tuukka.tikkanen@linaro.org 提交于 2月 24, 2014

The menu governor statistics update function tries to determine the
amount of time between entry to low power state and the occurrence
of the wakeup event. However, the time measured by the framework
includes exit latency on top of the desired value. This exit latency
is substracted from the measured value to obtain the desired value.

When measured value is not available, the menu governor assumes
the wakeup was caused by the timer and the time is equal to remaining
timer length. No exit latency should be substracted from this value.

This patch prevents the erroneous substraction and clarifies the
associated comment. It also removes one intermediate variable that
serves no purpose.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

61c66d6e

cpuidle: Ensure menu coefficients stay within domain · 7ac26436

由 tuukka.tikkanen@linaro.org 提交于 2月 24, 2014

The menu governor uses coefficients as one method of actual idle
period length estimation. The coefficients are, as detailed below,
multipliers giving expected idle period length from time until next
timer expiry. The multipliers are supposed to have domain of (0..1].

The coefficients are fractions where only the numerators are stored
and denominators are a shared constant RESOLUTION*DECAY. Since the
value of the coefficient should always be greater than 0 and less
than or equal to 1, the numerator must have a value greater than
0 and less than or equal to RESOLUTION*DECAY.

If the coefficients are updated with measured idle durations exceeding
timer length, the multiplier may reach values exceeding unity (i.e.
the stored numerator exceeds RESOLUTION*DECAY). This patch ensures that
the multipliers are updated with durations capped to timer length.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7ac26436

cpuidle: Use actual state latency in menu governor · 22695ab6

由 tuukka.tikkanen@linaro.org 提交于 2月 24, 2014

Currently menu governor records the exit latency of the state it has
chosen for the idle period. The stored latency value is then later
used to calculate the actual length of the idle period. This value
may however be incorrect, as the entered state may not be the one
chosen by the governor. The entered state information is available,
so we can use that to obtain the real exit latency.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

22695ab6

cpuidle: rename expected_us to next_timer_us in menu governor · 5dc2f5a3

由 tuukka.tikkanen@linaro.org 提交于 2月 24, 2014

The field expected_us is used to store the time remaining until next
timer expiry. The name is inaccurate, as we really do not expect all
wakeups to be caused by timers. In addition, another field with a very
similar name (predicted_us) is used to store the predicted time
remaining until any wakeup source being active.

This patch renames expected_us to next_timer_us in order to better
reflect the contained information.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: NNicolas Pitre <nico@linaro.org>
Acked-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5dc2f5a3

23 8月, 2013 8 次提交

cpuidle: Change struct menu_device field types · 51f245b8

由 Tuukka Tikkanen 提交于 8月 14, 2013

Field predicted_us value can never exceed expected_us value, but it has
a potentially larger type. As there is no need for additional 32 bits of
zeroes on 32 bit plaforms, change the type of predicted_us to match the
type of expected_us.

Field correction_factor is used to store a value that cannot exceed the
product of RESOLUTION and DECAY (default 1024*8 = 8192). The constants
cannot in practice be incremented to such values, that they'd overflow
unsigned int even on 32 bit systems, so the type is changed to avoid
unnecessary 64 bit arithmetic on 32 bit systems.

One multiplication of (now) 32 bit values needs an added cast to avoid
truncation of the result and has been added.

In order to avoid another multiplication from 32 bit domain to 64 bit
domain, the new correction_factor calculation has been changed from
new = old * (DECAY-1) / DECAY
to
new = old - old / DECAY,
which with infinite precision would yeild exactly the same result, but
now changes the direction of rounding. The impact is not significant as
the maximum accumulated difference cannot exceed the value of DECAY,
which is relatively small compared to product of RESOLUTION and DECAY
(8 / 8192).
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

51f245b8

cpuidle: Add a comment warning about possible overflow · decd51bb

由 Tuukka Tikkanen 提交于 8月 14, 2013

The menu governor has a number of tunable constants that may be changed
in the source. If certain combination of values are chosen, an overflow
is possible when the correction_factor is being recalculated.

This patch adds a warning regarding this possibility and describes the
change needed for fixing the issue. The change should not be permanently
enabled, as it will hurt performance when it is not needed.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

decd51bb

cpuidle: Fix variable domains in get_typical_interval() · 0e96d5ad

由 Tuukka Tikkanen 提交于 8月 14, 2013

The menu governor uses a static function get_typical_interval() to
try to detect a repeating pattern of wakeups. The previous interval
durations are stored as an array of unsigned ints, but the arithmetic
in the function is performed exclusively as 64 bit values, even when
the value stored in a variable is known not to exceed unsigned int,
which may be smaller and more efficient on some platforms.

This patch changes the types of varibles used to store some
intermediates, the maximum and and the cutoff threshold to unsigned
ints. Average and standard deviation are still treated as 64 bit values,
even when the values are known to be within the domain of unsigned int,
to avoid casts to ensure correct integer promotion for arithmetic
operations.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

0e96d5ad

cpuidle: Fix menu_device->intervals type · 939e33b7

由 Tuukka Tikkanen 提交于 8月 14, 2013

Struct menu_device member intervals is declared as u32, but the value
stored is (unsigned) int. The type is changed to match the value being
stored.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

939e33b7

cpuidle: CodingStyle: Break up multiple assignments on single line · 4cd46bca

由 Tuukka Tikkanen 提交于 8月 14, 2013

The function get_typical_interval() initializes a number of variables
that are immediately after declarations assigned constant values.
In addition, there are multiple assignments on a single line, which
is explicitly forbidden by Documentation/CodingStyle.

This patch removes redundant initial values for the variables and
breaks up the multiple assignment line.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

4cd46bca

cpuidle: Check called function parameter in get_typical_interval() · 0d6a7ffa

由 Tuukka Tikkanen 提交于 8月 14, 2013

get_typical_interval() uses int_sqrt() in calculation of standard
deviation. The formal parameter of int_sqrt() is unsigned long, which
may on some platforms be smaller than the 64 bit unsigned integer used
as the actual parameter. The overflow can occur frequently when actual
idle period lengths are in hundreds of milliseconds.

This patch adds a check for such overflow and rejects the candidate
average when an overflow would occur.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

0d6a7ffa

cpuidle: Rearrange code and comments in get_typical_interval() · 017099e2

由 Tuukka Tikkanen 提交于 8月 14, 2013

This patch rearranges a if-return-elsif-goto-fi-return sequence into
if-return-fi-if-return-fi-goto sequence. The functionality remains the
same. Also, a lengthy comment that did not describe the functionality
in the order it occurs is split into half and top half is moved closer
to actual implementation it describes.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

017099e2

cpuidle: Ignore interval prediction result when timer is shorter · 330647a9

由 Tuukka Tikkanen 提交于 8月 14, 2013

This patch prevents cpuidle menu governor from using repeating interval
prediction result if the idle period predicted is longer than the one
allowed by shortest running timer.
Signed-off-by: NTuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

330647a9

29 7月, 2013 2 次提交

Revert "cpuidle: Quickly notice prediction failure for repeat mode" · 14851912

由 Rafael J. Wysocki 提交于 7月 27, 2013

Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
repeat mode), because it has been identified as the source of a
significant performance regression in v3.8 and later as explained by
Jeremy Eder:

  We believe we've identified a particular commit to the cpuidle code
  that seems to be impacting performance of variety of workloads.
  The simplest way to reproduce is using netperf TCP_RR test, so
  we're using that, on a pair of Sandy Bridge based servers.  We also
  have data from a large database setup where performance is also
  measurably/positively impacted, though that test data isn't easily
  share-able.

  Included below are test results from 3 test kernels:

  kernel       reverts
  -----------------------------------------------------------
  1) vanilla   upstream (no reverts)

  2) perfteam2 reverts e11538d1

  3) test      reverts 69a37bea
                       e11538d1

  In summary, netperf TCP_RR numbers improve by approximately 4%
  after reverting 69a37bea.  When
  69a37bea is included, C0 residency
  never seems to get above 40%.  Taking that patch out gets C0 near
  100% quite often, and performance increases.

  The below data are histograms representing the %c0 residency @
  1-second sample rates (using turbostat), while under netperf test.

  - If you look at the first 4 histograms, you can see %c0 residency
    almost entirely in the 30,40% bin.
  - The last pair, which reverts 69a37bea,
    shows %c0 in the 80,90,100% bins.

  Below each kernel name are netperf TCP_RR trans/s numbers for the
  particular kernel that can be disclosed publicly, comparing the 3
  test kernels.  We ran a 4th test with the vanilla kernel where
  we've also set /dev/cpu_dma_latency=0 to show overall impact
  boosting single-threaded TCP_RR performance over 11% above
  baseline.

  3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
  TCP_RR trans/s 54323.78

  -----------------------------------------------------------
  3.10-rc2 vanilla RX (no reverts)
  TCP_RR trans/s 48192.47

  Receiver %c0
      0.0000 -    10.0000 [     1]: *
     10.0000 -    20.0000 [     0]:
     20.0000 -    30.0000 [     0]:
     30.0000 -    40.0000 [    59]:
  ***********************************************************
     40.0000 -    50.0000 [     1]: *
     50.0000 -    60.0000 [     0]:
     60.0000 -    70.0000 [     0]:
     70.0000 -    80.0000 [     0]:
     80.0000 -    90.0000 [     0]:
     90.0000 -   100.0000 [     0]:

  Sender %c0
      0.0000 -    10.0000 [     1]: *
     10.0000 -    20.0000 [     0]:
     20.0000 -    30.0000 [     0]:
     30.0000 -    40.0000 [    11]: ***********
     40.0000 -    50.0000 [    49]:
  *************************************************
     50.0000 -    60.0000 [     0]:
     60.0000 -    70.0000 [     0]:
     70.0000 -    80.0000 [     0]:
     80.0000 -    90.0000 [     0]:
     90.0000 -   100.0000 [     0]:

  -----------------------------------------------------------
  3.10-rc2 perfteam2 RX (reverts commit
  e11538d1)
  TCP_RR trans/s 49698.69

  Receiver %c0
      0.0000 -    10.0000 [     1]: *
     10.0000 -    20.0000 [     1]: *
     20.0000 -    30.0000 [     0]:
     30.0000 -    40.0000 [    59]:
  ***********************************************************
     40.0000 -    50.0000 [     0]:
     50.0000 -    60.0000 [     0]:
     60.0000 -    70.0000 [     0]:
     70.0000 -    80.0000 [     0]:
     80.0000 -    90.0000 [     0]:
     90.0000 -   100.0000 [     0]:

  Sender %c0
      0.0000 -    10.0000 [     1]: *
     10.0000 -    20.0000 [     0]:
     20.0000 -    30.0000 [     0]:
     30.0000 -    40.0000 [     2]: **
     40.0000 -    50.0000 [    58]:
  **********************************************************
     50.0000 -    60.0000 [     0]:
     60.0000 -    70.0000 [     0]:
     70.0000 -    80.0000 [     0]:
     80.0000 -    90.0000 [     0]:
     90.0000 -   100.0000 [     0]:

  -----------------------------------------------------------
  3.10-rc2 test RX (reverts 69a37bea
  and e11538d1)
  TCP_RR trans/s 47766.95

  Receiver %c0
      0.0000 -    10.0000 [     1]: *
     10.0000 -    20.0000 [     1]: *
     20.0000 -    30.0000 [     0]:
     30.0000 -    40.0000 [    27]: ***************************
     40.0000 -    50.0000 [     2]: **
     50.0000 -    60.0000 [     0]:
     60.0000 -    70.0000 [     2]: **
     70.0000 -    80.0000 [     0]:
     80.0000 -    90.0000 [     0]:
     90.0000 -   100.0000 [    28]: ****************************

  Sender:
      0.0000 -    10.0000 [     1]: *
     10.0000 -    20.0000 [     0]:
     20.0000 -    30.0000 [     0]:
     30.0000 -    40.0000 [    11]: ***********
     40.0000 -    50.0000 [     0]:
     50.0000 -    60.0000 [     1]: *
     60.0000 -    70.0000 [     0]:
     70.0000 -    80.0000 [     3]: ***
     80.0000 -    90.0000 [     7]: *******
     90.0000 -   100.0000 [    38]: **************************************

  These results demonstrate gaining back the tendency of the CPU to
  stay in more responsive, performant C-states (and thus yield
  measurably better performance), by reverting commit
  69a37bea.
Requested-by: NJeremy Eder <jeder@redhat.com>
Tested-by: NLen Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

14851912

Revert "cpuidle: Quickly notice prediction failure in general case" · 228b3023

由 Rafael J. Wysocki 提交于 7月 27, 2013

Revert commit e11538d1 (cpuidle: Quickly notice prediction failure in
general case), since it depends on commit 69a37bea (cpuidle: Quickly
notice prediction failure for repeat mode) that has been identified
as the source of a significant performance regression in v3.8 and
later.
Requested-by: NJeremy Eder <jeder@redhat.com>
Tested-by: NLen Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

228b3023

15 7月, 2013 1 次提交

cpuidle: Make it clear that governors cannot be modules · 137b944e

由 Daniel Lezcano 提交于 6月 12, 2013

cpufreq governors are defined as modules in the code, but the Kconfig
options do not allow them to be built as modules.  This is not really
a problem, but the cpuidle init ordering is: the cpuidle init
functions (framework and driver) and then the governors.  That leads
to some weirdness in the cpuidle framework.

Namely,  cpuidle_register_device() calls cpuidle_enable_device() which
fails at the first attempt, because governors have not been registered
yet.  When a governor is registered, the framework calls
cpuidle_enable_device() again which runs __cpuidle_register_device()
only then.  Of course, for that to work, the cpuidle_enable_device()
return value has to be ignored by cpuidle_register_device().

Instead of having this cyclic call graph and relying on a positive
side effects of the hackish back and forth cpuidle_enable_device()
calls it is better to fix the cpuidle init ordering.

To that end, replace the module init code with postcore_initcall()
so we have:

 * cpuidle framework : core_initcall
 * cpuidle governors : postcore_initcall
 * cpuidle drivers   : device_initcall

and remove the corresponding module exit code as it is dead anyway
(governors can't be built as modules).

[rjw: Changelog]
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

137b944e

15 1月, 2013 1 次提交

cpuidle: remove the power_specified field in the driver · 8aef33a7

由 Daniel Lezcano 提交于 1月 15, 2013

We realized that the power usage field is never filled and when it
is filled for tegra, the power_specified flag is not set causing all
of these values to be reset when the driver is initialized with
set_power_state().

However, the power_specified flag can be simply removed under the
assumption that the states are always backward sorted, which is the
case with the current code.

This change allows the menu governor select function and the
cpuidle_play_dead() to be simplified.  Moreover, the
set_power_states() function can removed as it does not make sense
any more.

Drop the power_specified flag from struct cpuidle_driver and make
the related changes as described above.

As a consequence, this also fixes the bug where on the dynamic
C-states system, the power fields are not initialized.

[rjw: Changelog]
References: https://bugzilla.kernel.org/show_bug.cgi?id=42870
References: https://bugzilla.kernel.org/show_bug.cgi?id=43349
References: https://lkml.org/lkml/2012/10/16/518Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8aef33a7

03 1月, 2013 1 次提交

cpuidle: Fix finding state with min power_usage · 0e5537b3

由 Sivaram Nair 提交于 12月 18, 2012

Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead
of -1) to init the local copies so that functions that tries to find
cpuidle states with minimum power usage works correctly even if they use
non-negative values.
Signed-off-by: NSivaram Nair <sivaramn@nvidia.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

0e5537b3

23 11月, 2012 1 次提交

cpuidle: fix a suspicious RCU usage in menu governor · a093b93e

由 Li Zhong 提交于 11月 23, 2012

I saw this suspicious RCU usage on the next tree of 11/15

[   67.123404] ===============================
[   67.123413] [ INFO: suspicious RCU usage. ]
[   67.123423] 3.7.0-rc5-next-20121115-dirty #1 Not tainted
[   67.123434] -------------------------------
[   67.123444] include/trace/events/timer.h:186 suspicious rcu_dereference_check() usage!
[   67.123458]
[   67.123458] other info that might help us debug this:
[   67.123458]
[   67.123474]
[   67.123474] RCU used illegally from idle CPU!
[   67.123474] rcu_scheduler_active = 1, debug_locks = 0
[   67.123493] RCU used illegally from extended quiescent state!
[   67.123507] 1 lock held by swapper/1/0:
[   67.123516]  #0:  (&cpu_base->lock){-.-...}, at: [<c0000000000979b0>] .__hrtimer_start_range_ns+0x28c/0x524
[   67.123555]
[   67.123555] stack backtrace:
[   67.123566] Call Trace:
[   67.123576] [c0000001e2ccb920] [c00000000001275c] .show_stack+0x78/0x184 (unreliable)
[   67.123599] [c0000001e2ccb9d0] [c0000000000c15a0] .lockdep_rcu_suspicious+0x120/0x148
[   67.123619] [c0000001e2ccba70] [c00000000009601c] .enqueue_hrtimer+0x1c0/0x1c8
[   67.123639] [c0000001e2ccbb00] [c000000000097aa0] .__hrtimer_start_range_ns+0x37c/0x524
[   67.123660] [c0000001e2ccbc20] [c0000000005c9698] .menu_select+0x508/0x5bc
[   67.123678] [c0000001e2ccbd20] [c0000000005c740c] .cpuidle_idle_call+0xa8/0x6e4
[   67.123699] [c0000001e2ccbdd0] [c0000000000459a0] .pSeries_idle+0x10/0x34
[   67.123717] [c0000001e2ccbe40] [c000000000014dc8] .cpu_idle+0x130/0x280
[   67.123738] [c0000001e2ccbee0] [c0000000006ffa8c] .start_secondary+0x378/0x384
[   67.123758] [c0000001e2ccbf90] [c00000000000936c] .start_secondary_prolog+0x10/0x14

hrtimer_start was added in 198fd638 and ae515197. The patch below tries
to use RCU_NONIDLE around it to avoid the above report.
Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a093b93e

15 11月, 2012 3 次提交

cpuidle: Get typical recent sleep interval · c96ca4fb

由 Youquan Song 提交于 10月 26, 2012

The function detect_repeating_patterns was not very useful for
workloads with alternating long and short pauses, for example
virtual machines handling network requests for each other (say
a web and database server).

Instead, try to find a recent sleep interval that is somewhere
between the median and the mode sleep time, by discarding outliers
to the up side and recalculating the average and standard deviation
until that is no longer required.

This should do something sane with a sleep interval series like:

	200 180 210 10000 30 1000 170 200

The current code would simply discard such a series, while the
new code will guess a typical sleep interval just shy of 200.

The original patch come from Rik van Riel <riel@redhat.com>.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c96ca4fb

cpuidle: Quickly notice prediction failure in general case · e11538d1

由 Youquan Song 提交于 10月 26, 2012

The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

The patch extends to general case that prediction logic get a small predicted
residency, so it choose a shallow C-state though the expected residency is large
. Once the prediction will be fail, the CPU will keep staying at shallow C-state
for a long time. Acutally, the CPU has change enter into deep C-state.
So when the expected residency is long enough but governor choose a shallow
C-state, an timer will be added in order to monitor if the prediction failure.

When C-state is waken up prior to the adding timer, the timer will be cancelled
initiatively. When the timer is triggered and menu governor will quickly notice
prediction failure and re-evaluates deeper C-states possibility.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

e11538d1

cpuidle: Quickly notice prediction failure for repeat mode · 69a37bea

由 Youquan Song 提交于 10月 26, 2012

The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.

There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
 governor will predict it is repeat mode and there is another IPI wake up idle
 CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
 idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.

In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.

Below is another case which will clearly show the patch much benefit:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>

volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;

void usage(void)
{
	fprintf(stderr,
		"Usage: idle_predict [options]\n"
		"  --help	-h  Print this help\n"
		"  --thread	-n  Thread number\n"
		"  --loop     	-l  Loop times in shallow Cstate\n"
		"  --delay	-t  Sleep time (uS)in shallow Cstate\n");
}

void *simple_loop() {
	int idle_num = 1;
	while (!(*shutdown)) {
		*count = *count + 1;

		if (idle_num % loop)
			usleep(delay);
		else {
			/* sleep 1 second */
			usleep(1000000);
			idle_num = 0;
		}
		idle_num++;
	}

}

static void sighand(int sig)
{
	*shutdown = 1;
}

int main(int argc, char *argv[])
{
	sigset_t sigset;
	int signum = SIGALRM;
	int i, c, er = 0, thread_num = 8;
	pthread_t pt[1024];

	static char optstr[] = "n:l:t:h:";

	while ((c = getopt(argc, argv, optstr)) != EOF)
		switch (c) {
			case 'n':
				thread_num = atoi(optarg);
				break;
			case 'l':
				loop = atoi(optarg);
				break;
			case 't':
				delay = atoi(optarg);
				break;
			case 'h':
			default:
				usage();
				exit(1);
		}

	printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
	count = malloc(sizeof(long));
	shutdown = malloc(sizeof(int));
	*count = 0;
	*shutdown = 0;

	sigemptyset(&sigset);
	sigaddset(&sigset, signum);
	sigprocmask (SIG_BLOCK, &sigset, NULL);
	signal(SIGINT, sighand);
	signal(SIGTERM, sighand);

	for(i = 0; i < thread_num ; i++)
		pthread_create(&pt[i], NULL, simple_loop, NULL);

	for (i = 0; i < thread_num; i++)
		pthread_join(pt[i], NULL);

	exit(0);
}

Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop

We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.

While after patched the kernel, we find that deep C-state will keep >99.6%.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

69a37bea

04 7月, 2012 2 次提交

PM / Domains: Add preliminary support for cpuidle, v2 · cbc9ef02

由 Rafael J. Wysocki 提交于 7月 03, 2012

On some systems there are CPU cores located in the same power
domains as I/O devices.  Then, power can only be removed from the
domain if all I/O devices in it are not in use and the CPU core
is idle.  Add preliminary support for that to the generic PM domains
framework.

First, the platform is expected to provide a cpuidle driver with one
extra state designated for use with the generic PM domains code.
This state should be initially disabled and its exit_latency value
should be set to whatever time is needed to bring up the CPU core
itself after restoring power to it, not including the domain's
power on latency.  Its .enter() callback should point to a procedure
that will remove power from the domain containing the CPU core at
the end of the CPU power transition.

The remaining characteristics of the extra cpuidle state, referred to
as the "domain" cpuidle state below, (e.g. power usage, target
residency) should be populated in accordance with the properties of
the hardware.

Next, the platform should execute genpd_attach_cpuidle() on the PM
domain containing the CPU core.  That will cause the generic PM
domains framework to treat that domain in a special way such that:

 * When all devices in the domain have been suspended and it is about
   to be turned off, the states of the devices will be saved, but
   power will not be removed from the domain.  Instead, the "domain"
   cpuidle state will be enabled so that power can be removed from
   the domain when the CPU core is idle and the state has been chosen
   as the target by the cpuidle governor.

 * When the first I/O device in the domain is resumed and
   __pm_genpd_poweron(() is called for the first time after
   power has been removed from the domain, the "domain" cpuidle
   state will be disabled to avoid subsequent surprise power removals
   via cpuidle.

The effective exit_latency value of the "domain" cpuidle state
depends on the time needed to bring up the CPU core itself after
restoring power to it as well as on the power on latency of the
domain containing the CPU core.  Thus the "domain" cpuidle state's
exit_latency has to be recomputed every time the domain's power on
latency is updated, which may happen every time power is restored
to the domain, if the measured power on latency is greater than
the latency stored in the corresponding generic_pm_domain structure.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: NKevin Hilman <khilman@ti.com>

cbc9ef02

cpuidle: move field disable from per-driver to per-cpu · dc7fd275

由 ShuoX Liu 提交于 7月 03, 2012

Andrew J.Schorr raises a question.  When he changes the disable setting on
a single CPU, it affects all the other CPUs.  Basically, currently, the
disable field is per-driver instead of per-cpu.  All the C states of the
same driver are shared by all CPU in the same machine.

The patch changes the `disable' field to per-cpu, so we could set this
separately for each cpu.
Signed-off-by: NShuoX Liu <shuox.liu@intel.com>
Reported-by: NAndrew J.Schorr <aschorr@telemetry-investments.com>
Reviewed-by: NYanmin Zhang <yanmin_zhang@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

dc7fd275

30 3月, 2012 2 次提交

cpuidle: power_usage should be declared signed integer · 02401c06

由 Boris Ostrovsky 提交于 3月 13, 2012

power_usage is always assigned a negative value and should be declared
a signed integer
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

02401c06

cpuidle: add a sysfs entry to disable specific C state for debug purpose. · 3a53396b

由 ShuoX Liu 提交于 3月 28, 2012

Some C states of new CPU might be not good.  One reason is BIOS might
configure them incorrectly.  To help developers root cause it quickly, the
patch adds a new sysfs entry, so developers could disable specific C state
manually.

In addition, C state might have much impact on performance tuning, as it
takes much time to enter/exit C states, which might delay interrupt
processing.  With the new debug option, developers could check if a deep C
state could impact performance and how much impact it could cause.

Also add this option in Documentation/cpuidle/sysfs.txt.

[akpm@linux-foundation.org: check kstrtol return value]
Signed-off-by: NShuoX Liu <shuox.liu@intel.com>
Reviewed-by: NYanmin Zhang <yanmin_zhang@intel.com>
Reviewed-and-Tested-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLen Brown <len.brown@intel.com>

3a53396b

07 11月, 2011 3 次提交

cpuidle: Single/Global registration of idle states · 46bcfad7

由 Deepthi Dharwar 提交于 10月 28, 2011

This patch makes the cpuidle_states structure global (single copy)
instead of per-cpu. The statistics needed on per-cpu basis
by the governor are kept per-cpu. This simplifies the cpuidle
subsystem as state registration is done by single cpu only.
Having single copy of cpuidle_states saves memory. Rare case
of asymmetric C-states can be handled within the cpuidle driver
and architectures such as POWER do not have asymmetric C-states.

Having single/global registration of all the idle states,
dynamic C-state transitions on x86 are handled by
the boot cpu. Here, the boot cpu  would disable all the devices,
re-populate the states and later enable all the devices,
irrespective of the cpu that would receive the notification first.

Reference:
https://lkml.org/lkml/2011/4/25/83Signed-off-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: NTrinabh Gupta <g.trinabh@gmail.com>
Tested-by: NJean Pihet <j-pihet@ti.com>
Reviewed-by: NKevin Hilman <khilman@ti.com>
Acked-by: NArjan van de Ven <arjan@linux.intel.com>
Acked-by: NKevin Hilman <khilman@ti.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

46bcfad7

cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare() · b25edc42

由 Deepthi Dharwar 提交于 10月 28, 2011

The cpuidle_device->prepare() mechanism causes updates to the
cpuidle_state[].flags, setting and clearing CPUIDLE_FLAG_IGNORE
to tell the governor not to chose a state on a per-cpu basis at
run-time. State demotion is now handled by the driver and it returns
the actual state entered. Hence, this mechanism is not required.
Also this removes per-cpu flags from cpuidle_state enabling
it to be made global.

Reference:
https://lkml.org/lkml/2011/3/25/52Signed-off-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm>
Signed-off-by: NTrinabh Gupta <g.trinabh@gmail.com>
Tested-by: NJean Pihet <j-pihet@ti.com>
Acked-by: NArjan van de Ven <arjan@linux.intel.com>
Reviewed-by: NKevin Hilman <khilman@ti.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

b25edc42

cpuidle: Move dev->last_residency update to driver enter routine; remove dev->last_state · e978aa7d

由 Deepthi Dharwar 提交于 10月 28, 2011

Cpuidle governor only suggests the state to enter using the
governor->select() interface, but allows the low level driver to
override the recommended state. The actual entered state
may be different because of software or hardware demotion. Software
demotion is done by the back-end cpuidle driver and can be accounted
correctly. Current cpuidle code uses last_state field to capture the
actual state entered and based on that updates the statistics for the
state entered.

Ideally the driver enter routine should update the counters,
and it should return the state actually entered rather than the time
spent there. The generic cpuidle code should simply handle where
the counters live in the sysfs namespace, not updating the counters.

Reference:
https://lkml.org/lkml/2011/3/25/52Signed-off-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: NTrinabh Gupta <g.trinabh@gmail.com>
Tested-by: NJean Pihet <j-pihet@ti.com>
Reviewed-by: NKevin Hilman <khilman@ti.com>
Acked-by: NArjan van de Ven <arjan@linux.intel.com>
Acked-by: NKevin Hilman <khilman@ti.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

e978aa7d

01 11月, 2011 1 次提交
- P
  cpuidle: Add module.h to drivers/cpuidle files as required. · 884b17e1
  由 Paul Gortmaker 提交于 8月 29, 2011
```
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
```
  884b17e1
25 8月, 2011 1 次提交

PM QoS: Move and rename the implementation files · e8db0be1

由 Jean Pihet 提交于 8月 25, 2011

The PM QoS implementation files are better named
kernel/power/qos.c and include/linux/pm_qos.h.

The PM QoS support is compiled under the CONFIG_PM option.
Signed-off-by: NJean Pihet <j-pihet@ti.com>
Acked-by: Nmarkgross <markgross@thegnar.org>
Reviewed-by: NKevin Hilman <khilman@ti.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

e8db0be1

29 5月, 2011 1 次提交

cpuidle: menu: fixed wrapping timers at 4.294 seconds · 7467571f

由 Tero Kristo 提交于 2月 24, 2011

Cpuidle menu governor is using u32 as a temporary datatype for storing
nanosecond values which wrap around at 4.294 seconds. This causes errors
in predicted sleep times resulting in higher than should be C state
selection and increased power consumption. This also breaks cpuidle
state residency statistics.

cc: stable@kernel.org # .32.x through .39.x
Signed-off-by: NTero Kristo <tero.kristo@nokia.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

7467571f

29 9月, 2010 1 次提交
- L
  cpuidle: Fix typos · 20e3341b
  由 Lucas De Marchi 提交于 9月 07, 2010
```
Signed-off-by: NLen Brown <len.brown@intel.com>
```
  20e3341b
10 8月, 2010 1 次提交

cpuidle: extend cpuidle and menu governor to handle dynamic states · 71abbbf8

由 Ai Li 提交于 8月 09, 2010

On some SoC chips, HW resources may be in use during any particular idle
period. As a consequence, the cpuidle states that the SoC is safe to
enter can change from idle period to idle period. In addition, the
latency and threshold of each cpuidle state can vary, depending on the
operating condition when the CPU becomes idle, e.g. the current cpu
frequency, the current state of the HW blocks, etc.

cpuidle core and the menu governor, in the current form, are geared
towards cpuidle states that are static, i.e. the availabiltiy of the
states, their latencies, their thresholds are non-changing during run
time. cpuidle does not provide any hook that cpuidle drivers can use to
adjust those values on the fly for the current idle period before the menu
governor selects the target cpuidle state.

This patch extends cpuidle core and the menu governor to handle states
that are dynamic. There are three additions in the patch and the patch
maintains backwards-compatibility with existing cpuidle drivers.

1) add prepare() to struct cpuidle_device. A cpuidle driver can hook
into the callback and cpuidle will call prepare() before calling the
governor's select function. The callback gives the cpuidle driver a
chance to update the dynamic information of the cpuidle states for the
current idle period, e.g. state availability, latencies, thresholds,
power values, etc.

2) add CPUIDLE_FLAG_IGNORE as one of the state flags. In the prepare()
function, a cpuidle driver can set/clear the flag to indicate to the
menu governor whether a cpuidle state should be ignored, i.e. not
available, during the current idle period.

3) add power_specified bit to struct cpuidle_device. The menu governor
currently assumes that the cpuidle states are arranged in the order of
increasing latency, threshold, and power savings. This is true or can
be made true for static states. Once the state parameters are dynamic,
the latencies, thresholds, and power savings for the cpuidle states can
increase or decrease by different amounts from idle period to idle
period. So the assumption of increasing latency, threshold, and power
savings from Cn to C(n+1) can no longer be guaranteed.

It can be straightforward to calculate the power consumption of each
available state and to specify it in power_usage for the idle period.
Using the power_usage fields, the menu governor then selects the state
that has the lowest power consumption and that still satisfies all other
critieria. The power_specified bit defaults to 0. For existing cpuidle
drivers, cpuidle detects that power_specified is 0 and fills in a dummy
set of power_usage values.
Signed-off-by: NAi Li <aili@codeaurora.org>
Cc: Len Brown <len.brown@intel.com>
Acked-by: NArjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71abbbf8

01 7月, 2010 1 次提交

sched: Cure nr_iowait_cpu() users · 8c215bd3

由 Peter Zijlstra 提交于 7月 01, 2010

Commit 0224cf4c (sched: Intoduce get_cpu_iowait_time_us())
broke things by not making sure preemption was indeed disabled
by the callers of nr_iowait_cpu() which took the iowait value of
the current cpu.

This resulted in a heap of preempt warnings. Cure this by making
nr_iowait_cpu() take a cpu number and fix up the callers to pass
in the right number.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: linux-pm@lists.linux-foundation.org
LKML-Reference: <1277968037.1868.120.camel@laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8c215bd3

25 5月, 2010 1 次提交

cpuidle: add a repeating pattern detector to the menu governor · 1f85f87d

由 Arjan van de Ven 提交于 5月 24, 2010

Currently, the menu governor uses the (corrected) next timer as key item
for predicting the idle duration.

It turns out that there are specific cases where this breaks down: There
are cases where we have a very repetitive pattern of idle durations, where
the idle period is pretty much the same, for reasons completely unrelated
to the next timer event.  Examples of such repeating patterns are network
loads with irq mitigation, the mouse moving but in theory also the wifi
beacons.

This patch adds a relatively simple detector for such repeating patterns,
where the standard deviation of the last 8 idle periods is compared to a
threshold.

With this extra predictor in place, measurements show that the DECAY
factor can now be increased (the decaying average will now decay slower)
to get an even more stable result.

[arjan@infradead.org: fix bug identified by Frank]
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1f85f87d

11 5月, 2010 1 次提交

PM QOS update · ed77134b

由 Mark Gross 提交于 5月 06, 2010

This patch changes the string based list management to a handle base
implementation to help with the hot path use of pm-qos, it also renames
much of the API to use "request" as opposed to "requirement" that was
used in the initial implementation.  I did this because request more
accurately represents what it actually does.

Also, I added a string based ABI for users wanting to use a string
interface.  So if the user writes 0xDDDDDDDD formatted hex it will be
accepted by the interface.  (someone asked me for it and I don't think
it hurts anything.)

This patch updates some documentation input I got from Randy.
Signed-off-by: Nmarkgross <mgross@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

ed77134b

10 5月, 2010 1 次提交

cpuidle: Fix incorrect optimization · 1c6fe036

由 Arjan van de Ven 提交于 5月 08, 2010

commit 672917dc ("cpuidle: menu governor: reduce latency on exit")
added an optimization, where the analysis on the past idle period moved
from the end of idle, to the beginning of the new idle.

Unfortunately, this optimization had a bug where it zeroed one key
variable for new use, that is needed for the analysis.  The fix is
simple, zero the variable after doing the work from the previous idle.

During the audit of the code that found this issue, another issue was
also found; the ->measured_us data structure member is never set, a
local variable is always used instead.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1c6fe036

07 3月, 2010 1 次提交

cpuidle menu: remove 8 bytes of padding on 64 bit builds · 56e6943b

由 Richard Kennedy 提交于 3月 05, 2010

Reorder struct menu_device to remove 8 bytes of padding on 64 bit builds.
Size drops from 136 to 128 bytes, so possibly needing one fewer cache
lines.
Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56e6943b

12 1月, 2010 1 次提交

drivers/cpuidle/governors/menu.c: fix undefined reference to `__udivdi3' · 5787536e

由 Stephen Hemminger 提交于 1月 08, 2010

menu: use proper 64 bit math

The new menu governor is incorrectly doing a 64 bit divide.  Compile
tested only
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5787536e

22 9月, 2009 1 次提交

cpuidle: menu governor: reduce latency on exit · 672917dc

由 Corrado Zoccolo 提交于 9月 21, 2009

Move the state residency accounting and statistics computation off the hot
exit path.

On exit, the need to recompute statistics is recorded, and new statistics
will be computed when menu_select is called again.

The expected effect is to reduce processor wakeup latency from sleep
(C-states).  We are speaking of few hundreds of cycles reduction out of a
several microseconds latency (determined by the hardware transition), so
it is difficult to measure.
Signed-off-by: NCorrado Zoccolo <czoccolo@gmail.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Adam Belay <abelay@novell.com
Acked-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

672917dc

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功