提交 · a240c4aa5d0f9c8bb0db380ab9c1ec1f68b923ad · openeuler / Kernel

06 3月, 2017 3 次提交

cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy · a240c4aa

由 Rafael J. Wysocki 提交于 3月 02, 2017

If the current P-state selection algorithm is set to "performance"
in intel_pstate_set_policy(), the limits may be initialized from
scratch, but only if no_turbo is not set and the maximum frequency
allowed for the given CPU (i.e. the policy object representing it)
is at least equal to the max frequency supported by the CPU.  In all
of the other cases, the limits will not be updated.

For example, the following can happen:

 # cat intel_pstate/status
 active
 # echo performance > cpufreq/policy0/scaling_governor
 # cat intel_pstate/min_perf_pct
 100
 # echo 94 > intel_pstate/min_perf_pct
 # cat intel_pstate/min_perf_pct
 100
 # cat cpufreq/policy0/scaling_max_freq
 3100000
 echo 3000000 > cpufreq/policy0/scaling_max_freq
 # cat intel_pstate/min_perf_pct
 94
 # echo 95 > intel_pstate/min_perf_pct
 # cat intel_pstate/min_perf_pct
 95

That is confusing for two reasons.  First, the initial attempt to
change min_perf_pct to 94 seems to have no effect, even though
setting the global limits should always work.  Second, after
changing scaling_max_freq for policy0 the global min_perf_pct
attribute shows 94, even though it should have not been affected
by that operation in principle.

Moreover, the final attempt to change min_perf_pct to 95 worked
as expected, because scaling_max_freq for the only policy with
scaling_governor equal to "performance" was different from the
maximum at that time.

To make all that confusion go away, modify intel_pstate_set_policy()
so that it doesn't reinitialize the limits at all.

At the same time, change intel_pstate_set_performance_limits() to
set min_sysfs_pct to 100 in the "performance" limits set so that
switching the P-state selection algorithm to "performance" causes
intel_pstate/min_perf_pct in sysfs to go to 100 (or whatever value
min_sysfs_pct in the "performance" limits is set to later).

That requires per-CPU limits to be initialized explicitly rather
than by copying the global limits to avoid setting min_sysfs_pct
in the per-CPU limits to 100.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a240c4aa

cpufreq: intel_pstate: Fix intel_pstate_verify_policy() · d74b1992

由 Rafael J. Wysocki 提交于 3月 01, 2017

The code added to intel_pstate_verify_policy() by commit 1443ebba
(cpufreq: intel_pstate: Fix sysfs limits enforcement for performance
policy) should use perf_limits instead of limits, because otherwise
setting global limits via sysfs may affect policies inconsistently.

For example, in the sequence of shell commands below, the
scaling_min_freq attribute for policy1 and policy2 should be
affected in the same way, because scaling_governor is set in
the same way for both of them:

 # cat cpufreq/policy1/scaling_governor
 powersave
 # cat cpufreq/policy2/scaling_governor
 powersave
 # echo performance > cpufreq/policy0/scaling_governor
 # echo 94 > intel_pstate/min_perf_pct
 # cat cpufreq/policy0/scaling_min_freq
 2914000
 # cat cpufreq/policy1/scaling_min_freq
 2914000
 # cat cpufreq/policy2/scaling_min_freq
 800000

The are affected differently, because intel_pstate_verify_policy()
is invoked with limits set to &performance_limits (left behind by
policy0) for policy1 and with limits set to &powersave_limits (left
behind by policy1) for policy2.  Since perf_limits is set to the
set of limits matching the policy being updated, using it instead
of limits fixes the inconsistency.

Fixes: 1443ebba (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

d74b1992

cpufreq: intel_pstate: Fix global settings in active mode · cd59b4be

由 Rafael J. Wysocki 提交于 3月 01, 2017

Commit 111b8b3f (cpufreq: intel_pstate: Always keep all
limits settings in sync) changed intel_pstate to invoke
cpufreq_update_policy() for every registered CPU on global sysfs
attributes updates, but that led to undesirable effects in the
active mode if the "performance" P-state selection algorithm is
configufred for one CPU and the "powersave" one is chosen for
all of the other CPUs.

Namely, in that case, the following is possible:

 # cd /sys/devices/system/cpu/
 # cat intel_pstate/max_perf_pct
 100
 # cat intel_pstate/min_perf_pct
 26
 # echo performance > cpufreq/policy0/scaling_governor
 # cat intel_pstate/max_perf_pct
 100
 # cat intel_pstate/min_perf_pct
 100
 # echo 94 > intel_pstate/min_perf_pct
 # cat intel_pstate/min_perf_pct
 26

The reason why this happens is because intel_pstate attempts to
maintain two sets of global limits in the active mode, one for
the "performance" P-state selection algorithm and one for the
"powersave"  P-state selection algorithm, but the P-state selection
algorithms are set per policy, so the global limits cannot reflect
all of them at the same time if they are different for different
policies.

In the particular situation above, the attempt to change
min_perf_pct to 94 caused cpufreq_update_policy() to be run
for a CPU with the "powersave"  P-state selection algorithm
and intel_pstate_set_policy() called by it silently switched the
global limits to the "powersave" set which finally was reflected
by the sysfs interface.

To prevent that from happening, modify intel_pstate_update_policies()
to always switch back to the set of limits that was used right before
it has been invoked.

Fixes: 111b8b3f (cpufreq: intel_pstate: Always keep all limits settings in sync)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

cd59b4be

04 3月, 2017 3 次提交

cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily · 64078299

由 Rafael J. Wysocki 提交于 3月 03, 2017

In the passive mode the cpu_frequency trace event is already
triggered by the cpufreq core or by scaling governors, so
intel_pstate should not trigger it once again for the same
P-state updates.

In addition to that, the frequency returned by
intel_cpufreq_fast_switch() and passed via freqs.new from
intel_cpufreq_target() to cpufreq_freq_transition_end() should
reflect the P-state actually set, so make that happen.

Fixes: 001c76f0 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

64078299

cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy() · 7f17326f

由 Rafael J. Wysocki 提交于 3月 02, 2017

The intel_pstate_update_perf_limits() called from
intel_cpufreq_verify_policy() may cause global P-state limits
to change which is generally confusing and unnecessary.

In the passive mode the global limits are only applied to the
frequency selected by the scaling governor (they are not taken
into account by governors when making decisions anyway), so making
them follow the per-policy limits serves no purpose and may go
against user expectations (as it generally causes the global
attributes in sysfs to change even though they have not been
written to in some cases).

Fix that by dropping the intel_pstate_update_perf_limits()
invocation from intel_cpufreq_verify_policy() (which also
reduces the code size by a few lines).

This change does not affect the per-CPU limits case, because those
limits allow any P-state to be set by default in the passive mode
and it removes the only piece of code updating them in that mode,
so the per-policy settings will be the only ones taken into account
in that case as expected.

Fixes: 001c76f0 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7f17326f

cpufreq: intel_pstate: Do not use performance_limits in passive mode · 2bc756e7

由 Rafael J. Wysocki 提交于 3月 02, 2017

Using performance_limits in the passive mode doesn't make
sense, because in that mode the global limits are applied to the
frequency selected by the scaling governor.

The maximum and minimum P-state limits in performance_limits are both
set to 100 percent which will put all CPUs into the turbo range
regardless of what governor is used and what frequencies are
selected by it (that is particularly undesirable on CPUs with the
generic powersave governor attached).

For this reason, make intel_pstate_register_driver() always point
limits to powersave_limits in the passive mode.

Fixes: 001c76f0 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

2bc756e7

28 2月, 2017 1 次提交

cpufreq: intel_pstate: Fix limits issue with operation mode switching · c3a49c89

由 Rafael J. Wysocki 提交于 2月 28, 2017

There is a problem with intel_pstate operation mode switching
introduced by commit fb1fe104 (cpufreq: intel_pstate: Operation
mode control from sysfs), because the global sysfs limits are
preserved across operation modes while per-policy limits are
reinitialized from scratch on a mode switch and both sets of limits
may get out of sync this way.

Fix that by always reinitializing the global limits upon the
registration of the driver.

Fixes: fb1fe104 (cpufreq: intel_pstate: Operation mode control from sysfs)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

c3a49c89

04 2月, 2017 6 次提交

cpufreq: intel_pstate: Disable energy efficiency optimization · 6e978b22

由 Srinivas Pandruvada 提交于 2月 03, 2017

Some Kabylake desktop processors may not reach max turbo when running in
HWP mode, even if running under sustained 100% utilization.

This occurs when the HWP.EPP (Energy Performance Preference) is set to
"balance_power" (0x80) -- the default on most systems.

It occurs because the platform BIOS may erroneously enable an
energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not
recommended to be enabled on this SKU.

On the failing systems, this BIOS issue was not discovered when the
desktop motherboard was tested with Windows, because the BIOS also
neglects to provide the ACPI/CPPC table, that Windows requires to enable
HWP, and so Windows runs in legacy P-state mode, where this setting has
no effect.

Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and
so it runs in HWP mode, exposing this incorrect BIOS configuration.

There are several ways to address this problem.

First, Linux can also run in legacy P-state mode on this system.
As intel_pstate is how Linux enables HWP, booting with
"intel_pstate=disable"
will run in acpi-cpufreq/ondemand legacy p-state mode.

Or second, the "performance" governor can be used with intel_pstate,
which will modify HWP.EPP to 0.

Or third, starting in 4.10, the
/sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
attribute in can be updated from "balance_power" to "performance".

Or fourth, apply this patch, which fixes the erroneous setting of
MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default
configuration to function as designed.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: NLen Brown <len.brown@intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

6e978b22

cpufreq: intel_pstate: Calculate guaranteed performance for HWP · 8fc7554a

由 Srinivas Pandruvada 提交于 1月 19, 2017

When HWP is active, turbo activation ratio is not used to calculate max
non turbo ratio. But on these systems the max non turbo ratio is decided
by config TDP settings.

This change removes usage of MSR_TURBO_ACTIVATION_RATIO for HWP systems,
instead directly use TDP ratios, when more than one TDPs are available.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8fc7554a

cpufreq: intel_pstate: Make HWP limits compatible with legacy · 4e5d3f71

由 Srinivas Pandruvada 提交于 1月 18, 2017

Under HWP the performance limits are calculated using max_perf_pct
and min_perf_pct using possible performance, not available performance.
The available performance can be reduced by no_turbo setting. To make
compatible with legacy mode, use max/min performance percentage with
respect to available performance.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

4e5d3f71

cpufreq: intel_pstate: Lower frequency than expected under no_turbo · 7d9a8a9f

由 Srinivas Pandruvada 提交于 1月 18, 2017

When turbo is not disabled by BIOS, but user disabled from intel P-State
sysfs and changes max/min using cpufreq sysfs, the resultant frequency
is lower than what user requested.

The reason for this, when the perf limits are calculated in set_policy()
callback, they are with reference to max cpu frequency (turbo frequency
), but when enforced in the intel_pstate_get_min_max() they are with
reference to max available performance as documented in the intel_pstate
documentation (in this case max non turbo P-State).

This needs similar change as done in intel_cpufreq_verify_policy() for
passive mode. Set policy->cpuinfo.max_freq based on the turbo status.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7d9a8a9f

cpufreq: intel_pstate: Operation mode control from sysfs · fb1fe104

由 Rafael J. Wysocki 提交于 1月 05, 2017

Make it possible to change the operation mode of intel_pstate with
the help of a new sysfs attribute called "status".

There are three possible configurations that can be selected using
this attribute:

 "off"     - The driver is not in use at this time.
 "active"  - The driver works as a P-state governor (default).
 "passive" - The driver works as a regular cpufreq one and collaborates
             with the generic cpufreq governors (it sets P-states as
             requested by those governors).  [This is the same mode
             the driver can be started in by passing intel_pstate=passive
             in the kernel command line.]

The current setting is returned by reads from this attribute.  Writing
one of the above strings to it changes the operation mode as indicated
by that string, if possible.

If HW-managed P-states (HWP) feature is enabled, it is not possible
to change the driver's operation mode and attempts to write to this
attribute will fail.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

fb1fe104

cpufreq: intel_pstate: Expose global sysfs attributes upfront · 0c30b65b

由 Rafael J. Wysocki 提交于 1月 11, 2017

Expose the intel_pstate's global sysfs attributes before registering
the driver to prepare for the addition of an attribute that also will
have to work if the driver is not registered.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

0c30b65b

20 1月, 2017 1 次提交

cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy · 1443ebba

由 Srinivas Pandruvada 提交于 1月 18, 2017

A side effect of keeping intel_pstate sysfs limits in sync with cpufreq
is that the now sysfs limits can't enforced under performance policy.

For example, if the max_perf_pct is changed from 100 to 80, this will call
intel_pstate_set_policy(), which will change the max_perf to 100 again for
performance policy. Same issue happens, when no_turbo is set.

This change calculates max and min frequency using sysfs performance
limits in intel_pstate_verify_policy() and adjusts policy limits by
calling cpufreq_verify_within_limits().

Also, it causes the setting of performance limits to be skipped if
no_turbo is set.

Fixes: 111b8b3f (cpufreq: intel_pstate: Always keep all limits settings in sync)
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

1443ebba

01 1月, 2017 3 次提交

cpufreq: intel_pstate: Always keep all limits settings in sync · 111b8b3f

由 Rafael J. Wysocki 提交于 12月 30, 2016

Make intel_pstate update per-logical-CPU limits when the global
settings are changed to ensure that they are always in sync and
users will not see confusing values in per-logical-CPU sysfs
attributes.

This also fixes the problem that setting the "no_turbo" global
attribute to 1 in the "passive" mode (ie. when intel_pstate acts
as a regular cpufreq driver) when scaling_governor is set to
"performance" has no effect.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

111b8b3f

cpufreq: intel_pstate: Use locking in intel_cpufreq_verify_policy() · cad30467

由 Rafael J. Wysocki 提交于 12月 30, 2016

Race conditions are possible if intel_cpufreq_verify_policy()
is executed in parallel with global limits updates from sysfs,
so the invocation of intel_pstate_update_perf_limits() in it
should be carried out under intel_pstate_limits_lock.

Make that happen.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cad30467

cpufreq: intel_pstate: Use locking in intel_pstate_resume() · aa439248

由 Rafael J. Wysocki 提交于 12月 30, 2016

Theoretically, intel_pstate_resume() may be executed in parallel
with intel_pstate_set_policy(), if the latter is invoked via
cpufreq_update_policy() as a result of a notification, so use
intel_pstate_limits_lock in there too to avoid race conditions.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

aa439248

27 12月, 2016 1 次提交

cpufreq: intel_pstate: Do not expose PID parameters in passive mode · 366430b5

由 Rafael J. Wysocki 提交于 12月 24, 2016

If intel_pstate works in the passive mode in which it acts as
a regular cpufreq driver and collaborates with generic cpufreq
governors, the PID parameters are not used, so do not expose
them via debugfs in that case.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

366430b5

08 12月, 2016 2 次提交

cpufreq: intel_pstate: Support for energy performance hints with HWP · 984edbdc

由 Srinivas Pandruvada 提交于 12月 06, 2016

It is possible to provide hints to the HWP algorithms in the processor
to be more performance centric to more energy centric. These hints are
provided by using HWP energy performance preference (EPP) or energy
performance bias (EPB) settings.

The scope of these settings is per logical processor, which means that
each of the logical processors in the package can be programmed with a
different value.

This change provides cpufreq sysfs interface to provide hint. For each
policy, two additional attributes will be available to check and provide
hint. These attributes will only be present when the intel_pstate driver
is using HWP mode.

These attributes are:
 - energy_performance_available_preferences
 - energy_performance_preference

To get list of supported hints:
$ cat energy_performance_available_preferences
default performance balance_performance balance_power power

The current preference can be read or changed via cpufreq sysfs
attribute "energy_performance_preference". Reading from this attribute
will display current effective setting changed via any method. User can
write any of the valid preference string to this attribute. User can
always restore to power-on default by writing "default".

Implementation
Since these hints can be provided by direct MSR write or using some tools
like x86_energy_perf_policy, the driver internally doesn't maintain any
state. The user operation will result in direct read/write of MSR: 0x774
(HWP_REQUEST_MSR). Also driver use read modify write to update other
fields in this MSR.

Summary of changes:
 - struct cpudata field epp_saved is renamed to epp_powersave, as this
   stores the value to restore once policy is switched from performance
   to powersave to restore original powersave EPP value.
 - A new struct cpudata field epp_saved is used to store the raw MSR
   EPP/EPB value when a CPU goes offline or on suspend and restore on
   online/resume. This ensures that EPP value is restored to correct
   value irrespective of the means used to set.
 - EPP/EPB value ranges are fixed for each preference, which can be
   set for the cpufreq sysfs, so user request is mapped to/from this
   range.
 - New attributes are only added when HWP is present.
 - Since EPP value of 0 is valid the fields are initialized to
   -EINVAL when not valid. The field epp_default is read only once
   after powerup to avoid reading on subsequent CPU online operation
 - New suspend callback to store epp on suspend operation
 - Don't invalidate old epp_saved field on resume and online as now
   we can restore last epp value on suspend and this field can still
   have old EPP value sampled during switch to performance from
   powersave.
 - While here optimized setting of cpu_data->epp_powersave = epp in
   intel_pstate_hwp_set() as this was done in both true and false
   paths.
 - epp/epb set function returns error to caller on failure to pass
   on to user space for display.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

984edbdc

cpufreq: intel_pstate: Add locking around HWP requests · b59fe540

由 Srinivas Pandruvada 提交于 12月 06, 2016

To avoid race conditions from multiple threads, increase the scope
of intel_pstate_limits_lock to include HWP requests also.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

b59fe540

01 12月, 2016 1 次提交

cpufreq: intel_pstate: Add Knights Mill CPUID · 58bf4542

由 Piotr Luc 提交于 10月 13, 2016

Add Knights Mill (KNM) to the list of CPUIDs supported by intel_pstate.
Signed-off-by: NPiotr Luc <piotr.luc@intel.com>
Reviewed-by: NDave Hansen <dave.hansen@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

58bf4542

28 11月, 2016 2 次提交

cpufreq: intel_pstate: fix intel_pstate_exit_perf_limits() prototype · 7a3ba767

由 Arnd Bergmann 提交于 11月 25, 2016

The addition of the generic governor support marked the
intel_pstate_exit_perf_limits as inline(), which fixed a warning,
but it introduced another warning:

drivers/cpufreq/intel_pstate.c: In function ‘intel_pstate_exit_perf_limits’:
drivers/cpufreq/intel_pstate.c:483:1: error: no return statement in function returning non-void [-Werror=return-type]

This changes it back to a 'void' return type, and changes the
corresponding intel_pstate_init_acpi_perf_limits() function to
be inline as well for consistency.

Fixes: 001c76f0 (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7a3ba767

cpufreq: intel_pstate: Set EPP/EPB to 0 in performance mode · 8442885f

由 Srinivas Pandruvada 提交于 11月 24, 2016

When user has selected performance policy, then set the EPP (Energy
Performance Preference) or EPB (Energy Performance Bias) to maximum
performance mode.

Also when user switch back to powersave, then restore EPP/EPB to last
EPP/EPB value before entering performance mode. If user has not changed
EPP/EPB manually then it will be power on default value.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8442885f

25 11月, 2016 1 次提交

cpufreq/intel_pstate: Use CPPC to get max performance · 17669006

由 Rafael J. Wysocki 提交于 11月 22, 2016

Use the acpi cppc_lib interface to get CPPC performance limits and update
the per cpu priority for the ITMT scheduler. If the highest performance of
CPUs differs the ITMT feature is enabled.
Co-developed-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/0998b98943bcdec7d1ddd4ff27358da555ea8e92.1479844244.git.tim.c.chen@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

17669006

22 11月, 2016 2 次提交

cpufreq: intel_pstate: increase precision of performance limits · d5dd33d9

由 Srinivas Pandruvada 提交于 11月 21, 2016

Even with round up of limits->min_perf and limits->max_perf, in some
cases resultant performance is 100 MHz less than the desired.

For example when the maximum frequency is 3.50 GHz, setting
scaling_min_frequency to 2.3 GHz always results in 2.2 GHz minimum.

Currently the fixed floating point operation uses 8 bit precision for
calculating limits->min_perf and limits->max_perf. For some operations
in this driver the 14 bit precision is used. Using the 14 bit precision
also for calculating limits->min_perf and limits->max_perf, addresses
this issue.

Introduced fp_ext_toint() equivalent to fp_toint() and int_ext_tofp()
equivalent to int_tofp() with 14 bit precision.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

d5dd33d9

cpufreq: intel_pstate: round up min_perf limits · 46992d6b

由 Srinivas Pandruvada 提交于 11月 21, 2016

In some use cases, user wants to enforce a minimum performance limit on
CPUs. But because of simple division the resultant performance is 100 MHz
less than the desired in some cases.

For example when the maximum frequency is 3.50 GHz, setting
scaling_min_frequency to 1.6 GHz always results in 1.5 GHz minimum. With
simple round up, the frequency can be set to 1.6 GHz to minimum in this
case. This round up is already done to max_policy_pct and max_perf, so do
the same for min_policy_pct and min_perf.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

46992d6b

21 11月, 2016 1 次提交

cpufreq: intel_pstate: Generic governors support · 001c76f0

由 Rafael J. Wysocki 提交于 11月 17, 2016

There may be reasons to use generic cpufreq governors (eg. schedutil)
on Intel platforms instead of the intel_pstate driver's internal
governor.  However, that currently can only be done by disabling
intel_pstate altogether and using the acpi-cpufreq driver instead
of it, which is subject to limitations.

First of all, acpi-cpufreq only works on systems where the _PSS
object is present in the ACPI tables for all logical CPUs.  Second,
on those systems acpi-cpufreq will only use frequencies listed by
_PSS which may be suboptimal.  In particular, by convention, the
whole turbo range is represented in _PSS as a single P-state and
the frequency assigned to it is greater by 1 MHz than the greatest
non-turbo frequency listed by _PSS.  That may confuse governors to
use turbo frequencies less frequently which may lead to suboptimal
performance.

For this reason, make it possible to use the intel_pstate driver
with generic cpufreq governors as a "normal" cpufreq driver.  That
mode is enforced by adding intel_pstate=passive to the kernel
command line and cannot be disabled at run time.  In that mode,
intel_pstate provides a cpufreq driver interface including
the ->target() and ->fast_switch() callbacks and is listed in
scaling_driver as "intel_cpufreq".
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NDoug Smythies <dsmythies@telus.net>

001c76f0

18 11月, 2016 1 次提交

cpufreq: intel_pstate: Request P-states control from SMM if needed · d0ea59e1

由 Rafael J. Wysocki 提交于 11月 17, 2016

Currently, intel_pstate is unable to control P-states on my
IvyBridge-based Acer Aspire S5, because they are controlled by SMM
on that machine by default and it is necessary to request OS control
of P-states from it via the SMI Command register exposed in the ACPI
FADT.  intel_pstate doesn't do that now, but acpi-cpufreq and other
cpufreq drivers for x86 platforms do.

Address this problem by making intel_pstate use the ACPI-defined
mechanism as well.  However, intel_pstate is not modular and it
doesn't need the module refcount tricks played by
acpi_processor_notify_smm(), so export the core of this function
to it as acpi_processor_pstate_control() and make it call that.
[The changes in processor_perflib.c related to this should not
make any functional difference for the acpi_processor_notify_smm()
users].

To be safe, only call acpi_processor_notify_smm() from intel_pstate
if ACPI _PPC support is enabled in it.
Suggested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

d0ea59e1

15 11月, 2016 1 次提交

cpufreq: intel_pstate: Use CPU load based algorithm for PM_MOBILE · 7f7a516e

由 Srinivas Pandruvada 提交于 11月 14, 2016

Use get_target_pstate_use_cpu_load() to calculate target P-State for
devices, with the preferred power management profile in ACPI FADT
set to PM_MOBILE.

This may help in resolving some thermal issues caused by low sustained
cpu bound workloads. The current algorithm tend to over provision in this
case as it doesn't look at the CPU busyness.

Also included the fix from Arnd Bergmann <arnd@arndb.de> to solve compile
issue, when CONFIG_ACPI is not defined.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7f7a516e

01 11月, 2016 3 次提交

cpufreq: intel_pstate: protect limits variable · a410c03d

由 Srinivas Pandruvada 提交于 10月 28, 2016

The limits variable gets modified from intel_pstate sysfs and also gets
modified from cpufreq sysfs. So protect with a mutex to keep data
integrity, when they are getting modified from multiple threads.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a410c03d

cpufreq: intel_pstate: Reduce impact due to rounding error · 5879f877

由 Srinivas Pandruvada 提交于 10月 25, 2016

When policy->max and policy->min are same, in some cases they don't
result in the same frequency cap. The max_policy_pct is rounded up but
not min_perf_pct. So even when they are same, results in different
percentage or maximum and minimum.
Since minimum is a conservative value for power, a lower value without
rounding is better in most of the cases, unless user wants
policy->max = policy->min.
This change uses use the same policy percentage when policy->max and
policy->min are same.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5879f877

cpufreq: intel_pstate: Per CPU P-State limits · eae48f04

由 Srinivas Pandruvada 提交于 10月 25, 2016

Intel P-State offers two interface to set performance limits:
- Intel P-State sysfs
	/sys/devices/system/cpu/intel_pstate/max_perf_pct
	/sys/devices/system/cpu/intel_pstate/min_perf_pct
- cpufreq
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq

In the current implementation both of the above methods, change limits
to every CPU in the system. Moreover the limits placed using cpufreq
policy interface also presented in the Intel P-State sysfs via modified
max_perf_pct and min_per_pct during sysfs reads. This allows to check
percent of reduced/increased performance, irrespective of method used to
limit.

There are some new generations of processors, where it is possible to
have limits placed on individual CPU cores. Using cpufreq interface it
is possible to set limits on each CPU. But the current processing will
use last limits placed on all CPUs. So the per core limit feature of
CPUs can't be used.

This change brings in capability to set P-States limits for each CPU,
with some limitations. In this case what should be the read of
max_perf_pct and min_perf_pct? It can be most restrictive limits placed
on any CPU or max possible performance on any given CPU on which no
limits are placed. In either case someone will have issue.

So the consensus is, we can't have both sysfs controls present when user
wants to use limit per core limits.
- By default per-core-control feature is not enabled. So no one will
notice any difference.
- The way to enable is by kernel command line
intel_pstate=per_cpu_perf_limits
- When the per-core-controls are enabled there is no display of for both
read and write on
	/sys/devices/system/cpu/intel_pstate/max_perf_pct
	/sys/devices/system/cpu/intel_pstate/min_perf_pct
- User can change limits using
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
	/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
- User can still observe turbo percent and number of P-States from
	/sys/devices/system/cpu/intel_pstate/turbo_pct
	/sys/devices/system/cpu/intel_pstate/num_pstates
- User can read write system wide turbo status
	/sys/devices/system/cpu/no_turbo

While changing this BUG_ON is changed to WARN_ON, as they are not fatal
errors for the system.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

eae48f04

25 10月, 2016 1 次提交

cpufreq: intel_pstate: Always set max P-state in performance mode · 2f1d407a

由 Rafael J. Wysocki 提交于 10月 24, 2016

The only times at which intel_pstate checks the policy set for
a given CPU is the initialization of that CPU and updates of its
policy settings from cpufreq when intel_pstate_set_policy() is
invoked.

That is insufficient, however, because intel_pstate uses the same
P-state selection function for all CPUs regardless of the policy
setting for each of them and the P-state limits are shared between
them.  Thus if the policy is set to "performance" for a particular
CPU, it may not behave as expected if the cpufreq settings are
changed subsequently for another CPU.

That can be easily demonstrated by writing "performance" to
scaling_governor for all CPUs and then switching it to "powersave"
for one of them in which case all of the CPUs will behave as though
their scaling_governor were all "powersave" (even though the policy
still appears to be "performance" for the remaining CPUs).

Fix this problem by modifying intel_pstate_adjust_busy_pstate() to
always set the P-state to the maximum allowed by the current limits
for all CPUs whose policy is set to "performance".

Note that it still is recommended to always change the policy setting
in the same way for all CPUs even with this fix applied to avoid
confusion.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

2f1d407a

22 10月, 2016 3 次提交

cpufreq: intel_pstate: Set P-state upfront in performance mode · a6c6ead1

由 Rafael J. Wysocki 提交于 10月 19, 2016

After commit a4675fbc (cpufreq: intel_pstate: Replace timers with
utilization update callbacks) the cpufreq governor callbacks may not
be invoked on NOHZ_FULL CPUs and, in particular, switching to the
"performance" policy via sysfs may not have any effect on them. That
is a problem, because it usually is desirable to squeeze the last
bit of performance out of those CPUs, so work around it by setting
the maximum P-state (within the limits) in intel_pstate_set_policy()
upfront when the policy is CPUFREQ_POLICY_PERFORMANCE.

Fixes: a4675fbc (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

a6c6ead1

cpufreq: intel_pstate: Remove PID debugfs when not used · 185d8245

由 Srinivas Pandruvada 提交于 10月 21, 2016

When target state is calculated using get_target_pstate_use_cpu_load(),
PID controller is not used, hence it has no effect on performance.
So don't present debugfs entries to tune PID controller.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

185d8245

cpufreq: intel_pstate: Drop boost_iowait flag · 1d29815e

由 Rafael J. Wysocki 提交于 10月 19, 2016

The "IOwait boosting" mechanism is only used by the
get_target_pstate_use_cpu_load() governor function and the
boost_iowait flag in pid_params is always set when that function
is in use (and it is never set otherwise).  This means that the
boost_iowait flag is in fact redundant and may be dropped.

For this reason, replace the boost_iowait flag check in
intel_pstate_update_util() with an equivalent check against
pstate_funcs.get_target_pstate and drop that flag.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

1d29815e

13 10月, 2016 2 次提交

cpufreq: intel_pstate: Fix struct pstate_adjust_policy kerneldoc · 3954517e

由 Rafael J. Wysocki 提交于 10月 11, 2016

It looks like the name of struct pstate_adjust_policy was updated
without updating its kerneldoc comment accordingly, so fix that
mistake.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

3954517e

cpufreq: intel_pstate: Proportional algorithm for Atom · 0843e83c

由 Rafael J. Wysocki 提交于 10月 06, 2016

The PID algorithm used by the intel_pstate driver tends to drive
performance to the minimum for workloads with utilization below the
setpoint, which is undesirable, so replace it with a modified
"proportional" algorithm on Atom.

The new algorithm will set the new P-state to be 1.25 times the
available maximum times the (frequency-invariant) utilization during
the previous sampling period except when the target P-state computed
this way is lower than the average P-state during the previous
sampling period. In the latter case, it will increase the target by
50% of the difference between it and the average P-state to prevent
performance from dropping down too fast in some cases.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

0843e83c

10 10月, 2016 2 次提交

cpufreq: intel_pstate: Clarify comment in get_target_pstate_use_performance() · f00593a4

由 Rafael J. Wysocki 提交于 9月 30, 2016

Make the comment explaining the meaning of the perf_scaled variable
in get_target_pstate_use_performance() more straightforward.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

f00593a4

cpufreq: intel_pstate: Fix unsafe HWP MSR access · f9f4872d

由 Srinivas Pandruvada 提交于 10月 08, 2016

This is a requirement that MSR MSR_PM_ENABLE must be set to 0x01 before
reading MSR_HWP_CAPABILITIES on a given CPU. If cpufreq init() is
scheduled on a CPU which is not same as policy->cpu or migrates to a
different CPU before calling msr read for MSR_HWP_CAPABILITIES, it
is possible that MSR_PM_ENABLE was not to set to 0x01 on that CPU.
This will cause GP fault. So like other places in this path
rdmsrl_on_cpu should be used instead of rdmsrl.

Moreover the scope of MSR_HWP_CAPABILITIES is on per thread basis, so it
should be read from the same CPU, for which MSR MSR_HWP_REQUEST is
getting set.

dmesg dump or warning:

[   22.014488] WARNING: CPU: 139 PID: 1 at arch/x86/mm/extable.c:50 ex_handler_rdmsr_unsafe+0x68/0x70
[   22.014492] unchecked MSR access error: RDMSR from 0x771
[   22.014493] Modules linked in:
[   22.014507] CPU: 139 PID: 1 Comm: swapper/0 Not tainted 4.7.5+ #1
...
...
[   22.014516] Call Trace:
[   22.014542]  [<ffffffff813d7dd1>] dump_stack+0x63/0x82
[   22.014558]  [<ffffffff8107bc8b>] __warn+0xcb/0xf0
[   22.014561]  [<ffffffff8107bcff>] warn_slowpath_fmt+0x4f/0x60
[   22.014563]  [<ffffffff810676f8>] ex_handler_rdmsr_unsafe+0x68/0x70
[   22.014564]  [<ffffffff810677d9>] fixup_exception+0x39/0x50
[   22.014604]  [<ffffffff8102e400>] do_general_protection+0x80/0x150
[   22.014610]  [<ffffffff817f9ec8>] general_protection+0x28/0x30
[   22.014635]  [<ffffffff81687940>] ? get_target_pstate_use_performance+0xb0/0xb0
[   22.014642]  [<ffffffff810600c7>] ? native_read_msr+0x7/0x40
[   22.014657]  [<ffffffff81688123>] intel_pstate_hwp_set+0x23/0x130
[   22.014660]  [<ffffffff81688406>] intel_pstate_set_policy+0x1b6/0x340
[   22.014662]  [<ffffffff816829bb>] cpufreq_set_policy+0xeb/0x2c0
[   22.014664]  [<ffffffff81682f39>] cpufreq_init_policy+0x79/0xe0
[   22.014666]  [<ffffffff81682cb0>] ? cpufreq_update_policy+0x120/0x120
[   22.014669]  [<ffffffff816833a6>] cpufreq_online+0x406/0x820
[   22.014671]  [<ffffffff8168381f>] cpufreq_add_dev+0x5f/0x90
[   22.014717]  [<ffffffff81530ac8>] subsys_interface_register+0xb8/0x100
[   22.014719]  [<ffffffff816821bc>] cpufreq_register_driver+0x14c/0x210
[   22.014749]  [<ffffffff81fe1d90>] intel_pstate_init+0x39d/0x4d5
[   22.014751]  [<ffffffff81fe13f2>] ? cpufreq_gov_dbs_init+0x12/0x12

Cc: 4.3+ <stable@vger.kernel.org> # 4.3+
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

f9f4872d

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功