提交 · c8b4accf860203fcb380f5d15b90a7646912d9c2 · openeuler / Kernel

01 3月, 2023 1 次提交

cpufreq: intel_pstate: remove MODULE_LICENSE in non-modules · 5bd289f6

由 Nick Alcock 提交于 2月 24, 2023

Since commit 8b41fc44 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.

So remove it in the files in this commit, none of which can be built as
modules.
Signed-off-by: NNick Alcock <nick.alcock@oracle.com>
Suggested-by: NLuis Chamberlain <mcgrof@kernel.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5bd289f6

24 2月, 2023 1 次提交

cpufreq: intel_pstate: Adjust balance_performance EPP for Sapphire Rapids · 60675225

由 Srinivas Pandruvada 提交于 2月 21, 2023

While the majority of server OS distributions are deployed with the
"performance" governor as the default, some distributions like Ubuntu use
the "powersave" governor by default.

While using the "powersave" governor in its default configuration on
Sapphire Rapids systems leads to much lower power, the performance is
lower by more than 25% for several workloads relative to the
"performance" governor.

A 37% difference has been reported by www.Phoronix.com [1].

This is a consequence of using a relatively high EPP value in the
default configuration of the "powersave" governor and the performance
can be made much closer to the "performance" governor's level by
adjusting the default EPP value. Based on experiments, with EPP of 0x00,
0x10, 0x20, the performance delta between the "powersave" governor and
the "performance" one is around 12%. However, the EPP of 0x20 reduces
average power by 18% with respect to the lower EPP values.

[Note that raising min_perf_pct in sysfs as high as 50% in addition to
 adjusting EPP does not improve the performance any further.]

For this reason, change the EPP value corresponding to the the default
balance_performance setting for Sapphire Rapids to 0x20, which is
straightforward, because analogous default EPP adjustment has been
applied to Alder Lake and there is a way to set the balance_performance
EPP value in intel_pstate based on the processor model already.

The goal here is to limit the mean performance delta between the
"powersave" governor in the default configuration and the "performance"
governor for a wide variety of server workloadsto to around 10-12%. For
some bursty workloads, this delta can be still large, as the frequency
ramp-up will still lag when the "powersave" governor is in use
irrespective of the EPP setting, because the performance governor always
requests the maximum possible frequency.

Link: https://www.phoronix.com/review/centos-clear-spr/6 # [1]
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

60675225

31 12月, 2022 1 次提交

cpufreq: intel_pstate: Drop ACPI _PSS states table patching · e8a0e30b

由 Rafael J. Wysocki 提交于 12月 28, 2022

After making acpi_processor_get_platform_limit() use the "no limit"
value for its frequency QoS request when _PPC returns 0, it is not
necessary to replace the frequency corresponding to the first _PSS
return package entry with the maximum turbo frequency of the given
CPU in intel_pstate_init_acpi_perf_limits() any more, so drop the
code doing that along with the comment explaining it.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

e8a0e30b

01 12月, 2022 1 次提交

cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode · df51f287

由 Giovanni Gherdovich 提交于 11月 21, 2022

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

See also the following past commits:

commit d8de7a44 ("cpufreq: intel_pstate: Add Skylake servers support")
commit 706c5328 ("cpufreq: intel_pstate: Add Cometlake support in
no-HWP mode")
commit fbdc21e9 ("cpufreq: intel_pstate: Add Icelake servers support in
no-HWP mode")
commit 71bb5c82 ("cpufreq: intel_pstate: Add Tigerlake support in
no-HWP mode")
Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

df51f287

04 11月, 2022 1 次提交

cpufreq: intel_pstate: Allow EPP 0x80 setting by the firmware · 21cdb6c1

由 Srinivas Pandruvada 提交于 10月 27, 2022

With the
"commit 3d13058e ("cpufreq: intel_pstate: Use firmware default EPP")"
the firmware can set an EPP, which driver will not overwrite. But the
driver has a valid range check for:
0x40 > firmware epp < 0x80.
Hence firmware can't specify EPP of 0x80.

If the firmware didn't specify in the valid range, the driver has a
hard coded EPP of 102. But some Chrome hardware vendors don't want
this overwrite and wants to boot with chipset default EPP of 0x80 as
this improves battery life.

In this case they want to have capability to specify EPP of 0x80 via
the firmware. This require the valid range to include 0x80 also.
But here the valid range can't be simply extended to include 0x80 as
this is the chipset default EPP. Even without any firmware specifying
EPP, the chipset will always boot with EPP of 0x80.

To make sure that firmware specified EPP of 0x80 and not by the
chipset default, it will require additional check to make sure HWP
was enabled by the firmware before boot. Only way the firmware can
update EPP, is to enable HWP and update EPP via MSR_HWP_REQUEST.

This driver already checks, if the HWP is enabled by the firmware.
Use the same flag and extend valid range to include 0x80.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

21cdb6c1

25 10月, 2022 2 次提交

cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores · f5c8cf2a

由 Rafael J. Wysocki 提交于 10月 24, 2022

Commit 46573fd6 ("cpufreq: intel_pstate: hybrid: Rework HWP
calibration") attempted to use the information from CPPC (the nominal
performance in particular) to obtain the scaling factor allowing the
frequency to be computed if the HWP performance level of the given CPU
is known or vice versa.

However, it turns out that on some platforms this doesn't work, because
the CPPC information on them does not align with the contents of the
MSR_HWP_CAPABILITIES registers.

This basically means that the only way to make intel_pstate work on all
of the hybrid platforms to date is to use the observation that on all
of them the scaling factor between the HWP performance levels and
frequency for P-cores is 78741 (approximately 100000/1.27). For
E-cores it is 100000, which is the same as for all of the non-hybrid
"core" platforms and does not require any changes.

Accordingly, make intel_pstate use 78741 as the scaling factor between
HWP performance levels and frequency for P-cores on all hybrid platforms
and drop the dependency of the HWP calibration code on CPPC.

Fixes: 46573fd6 ("cpufreq: intel_pstate: hybrid: Rework HWP calibration")
Reported-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

f5c8cf2a

cpufreq: intel_pstate: Read all MSRs on the target CPU · 8dbab94d

由 Rafael J. Wysocki 提交于 10月 24, 2022

Some of the MSR accesses in intel_pstate are carried out on the CPU
that is running the code, but the values coming from them are used
for the performance scaling of the other CPUs.

This is problematic, for example, on hybrid platforms where
MSR_TURBO_RATIO_LIMIT for P-cores and E-cores is different, so the
values read from it on a P-core are generally not applicable to E-cores
and the other way around.

For this reason, make the driver access all MSRs on the target CPU on
platforms using the "core" pstate_funcs callbacks which is the case for
all of the hybrid platforms released to date.  For this purpose, pass
a CPU argument to the ->get_max(), ->get_max_physical(), ->get_min()
and ->get_turbo() pstate_funcs callbacks and from there pass it to
rdmsrl_on_cpu() or rdmsrl_safe_on_cpu() to access the MSR on the target
CPU.

Fixes: 46573fd6 ("cpufreq: intel_pstate: hybrid: Rework HWP calibration")
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8dbab94d

11 9月, 2022 1 次提交

cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode · 71bb5c82

由 Doug Smythies 提交于 9月 06, 2022

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

Add TIGERLAKE to the list of CPUs that can register intel_pstate while not
advertising the HWP capability. Without this change, an TIGERLAKE in no-HWP
mode could only use the acpi_cpufreq frequency scaling driver.

See also commits:
d8de7a44: cpufreq: intel_pstate: Add Skylake servers support
fbdc21e9: cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode
706c5328: cpufreq: intel_pstate: Add Cometlake support in no-HWP mode

Reported by: M. Cargi Ari <cagriari@pm.me>
Signed-off-by: NDoug Smythies <dsmythies@telus.net>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

71bb5c82

12 5月, 2022 1 次提交

cpufreq: intel_pstate: Support Sapphire Rapids OOB mode · bbd67f1b

由 Srinivas Pandruvada 提交于 5月 02, 2022

Prevent intel_pstate to load when OOB (Out Of Band) P-states mode is
enabled in Sapphire Rapids. The OOB identifying bits are same as the
prior generation CPUs like Ice Lake servers. So, also add Sapphire
Rapids to intel_pstate_cpu_oob_ids list.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

bbd67f1b

13 4月, 2022 1 次提交

cpufreq: intel_pstate: Handle no_turbo in frequency invariance · addca285

由 Chen Yu 提交于 4月 08, 2022

Problem statement:

Once the user has disabled turbo frequency by

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

the cfs_rq's util_avg becomes quite small when compared with
CPU capacity.

Step to reproduce:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99

would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
with a CPU utilization of 99%. [1]

top result:
%Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

check util_avg:
cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
  .util_avg                      : 611

So the util_avg/cpu capacity is 611/1024, which is much smaller than
98.4% shown in the top result.

This might impact some logic in the scheduler. For example,
group_is_overloaded() would compare the group_capacity and group_util
in the sched group, to check if this sched group is overloaded or not.
With this gap, even when there is a nearly 100% workload, the sched
group will not be regarded as overloaded. Besides group_is_overloaded(),
there are also other victims. There is a ongoing work that aims to
optimize the task wakeup in a LLC domain. The main idea is to stop
searching idle CPUs if the sched domain is overloaded[2]. This proposal
also relies on the util_avg/CPU capacity to decide whether the LLC
domain is overloaded.

Analysis:

CPU frequency invariance has caused this difference. In summary,
the util_sum of cfs rq would decay quite fast when the CPU is in
idle, when the CPU frequency invariance is enabled.

The detail is as followed:

As depicted in update_rq_clock_pelt(), when the frequency invariance
is enabled, there would be two clock variables on each rq, clock_task
and clock_pelt:

   The clock_pelt scales the time to reflect the effective amount of
   computation done during the running delta time but then syncs back to
   clock_task when rq is idle.

   absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
   @ max frequency  ------******---------------******---------------
   @ half frequency ------************---------************---------
   clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16

The fast decay of util_sum during idle is due to:

 1. rq->clock_pelt is always behind rq->clock_task
 2. rq->last_update is updated to rq->clock_pelt' after invoking
    ___update_load_sum()
 3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly
    increased a lot to rq->clock_task
 4. Enters ___update_load_sum() again, the idle period is calculated by
    rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
    The lower the CPU frequency is, the larger the delta =
    rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
    used to decay the util_sum only, the util_sum drops significantly during
    idle period.

Proposal:

This symptom is not only caused by disabling turbo frequency, but it
would also appear if the user limits the max frequency at runtime.

Because, if the frequency is always lower than the max frequency,
CPU frequency invariance would decay the util_sum quite fast during
idle.

As some end users would disable turbo after boot up, this patch aims to
present this symptom and deals with turbo scenarios for now.

It might be ideal if CPU frequency invariance is aware of the max CPU
frequency (user specified) at runtime in the future.

Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
Signed-off-by: NChen Yu <yu.c.chen@intel.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

addca285

17 3月, 2022 1 次提交

cpufreq: intel_pstate: Use firmware default EPP · 3d13058e

由 Srinivas Pandruvada 提交于 3月 10, 2022

For some specific platforms (E.g. AlderLake) the balance performance
EPP is updated from the hard coded value in the driver. This acts as
the default and balance_performance EPP. The purpose of this EPP
update is to reach maximum 1 core turbo frequency (when possible) out
of the box.

Although we can achieve the objective by using hard coded value in the
driver, there can be other EPP which can be better in terms of power.
But that will be very subjective based on platform and use cases.
This is not practical to have a per platform specific default hard coded
in the driver.

If a platform wants to specify default EPP, it can be set in the firmware.
If this EPP is not the chipset default of 0x80 (balance_perf_epp unless
driver changed it) and more performance oriented but not 0, the driver
can use this as the default and balanced_perf EPP. In this case no driver
update is required every time there is some new platform and default EPP.

If the firmware didn't update the EPP from the chipset default then
the hard coded value is used as per existing implementation.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

3d13058e

23 12月, 2021 1 次提交

cpufreq: intel_pstate: Update cpuinfo.max_freq on HWP_CAP changes · dfeeedc1

由 Rafael J. Wysocki 提交于 12月 17, 2021

With HWP enabled, when the turbo range of performance levels is
disabled by the platform firmware, the CPU capacity is given by
the "guaranteed performance" field in MSR_HWP_CAPABILITIES which
is generally dynamic. When it changes, the kernel receives an HWP
notification interrupt handled by notify_hwp_interrupt().

When the "guaranteed performance" value changes in the above
configuration, the CPU performance scaling needs to be adjusted so
as to use the new CPU capacity in computations, which means that
the cpuinfo.max_freq value needs to be updated for that CPU.

Accordingly, modify intel_pstate_notify_work() to read
MSR_HWP_CAPABILITIES and update cpuinfo.max_freq to reflect the
new configuration (this update can be carried out even if the
configuration doesn't actually change, because it simply doesn't
matter then and it takes less time to update it than to do extra
checks to decide whether or not a change has really occurred).
Reported-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

dfeeedc1

17 12月, 2021 2 次提交

cpufreq: intel_pstate: Update EPP for AlderLake mobile · b6e6f8be

由 Srinivas Pandruvada 提交于 12月 16, 2021

There is an expectation from users that they can get frequency specified
by cpufreq/cpuinfo_max_freq when conditions permit. But with AlderLake
mobile it may not be possible. This is possible that frequency is clipped
based on the system power-up EPP value. In this case users can update
cpufreq/energy_performance_preference to some performance oriented EPP to
limit clipping of frequencies.

To get out of box behavior as the prior generations of CPUs, update EPP
for AlderLake mobile CPUs on boot. On prior generations of CPUs EPP = 128
was enough to get maximum frequency, but with AlderLake mobile the
equivalent EPP is 102. Since EPP is model specific, this is possible that
they have different meaning on each generation of CPU.

The current EPP string "balance_performance" corresponds to EPP = 128.
Change the EPP corresponding to "balance_performance" to 102 for only
AlderLake mobile CPUs and update this on each CPU during boot.

To implement reuse epp_values[] array and update the modified EPP at the
index for BALANCE_PERFORMANCE. Add a dummy EPP_INDEX_DEFAULT to
epp_values[] to match indexes in the energy_perf_strings[].

After HWP PM is enabled also update EPP when "balance_performance" is
redefined for the very first time after the boot on each CPU. On
subsequent suspend/resume or offline/online the old EPP is restored,
so no specific action is needed.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

b6e6f8be

cpufreq: intel_pstate: Drop redundant intel_pstate_get_hwp_cap() call · 458b03f8

由 Rafael J. Wysocki 提交于 12月 10, 2021

It is not necessary to call intel_pstate_get_hwp_cap() from
intel_pstate_update_perf_limits(), because it gets called from
intel_pstate_verify_cpu_policy() which is either invoked directly
right before intel_pstate_update_perf_limits(), in
intel_cpufreq_verify_policy() in the passive mode, or called
from driver callbacks in a sequence that causes it to be followed
by an immediate intel_pstate_update_perf_limits().

Namely, in the active mode intel_cpufreq_verify_policy() is called
by intel_pstate_verify_policy() which is the ->verify() callback
routine of intel_pstate and gets called by the cpufreq core right
before intel_pstate_set_policy(), which is the driver's ->setoplicy()
callback routine, where intel_pstate_update_perf_limits() is called.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

458b03f8

23 11月, 2021 3 次提交

cpufreq: intel_pstate: ITMT support for overclocked system · 03c83982

由 Srinivas Pandruvada 提交于 11月 18, 2021

On systems with overclocking enabled, CPPC Highest Performance can be
hard coded to 0xff. In this case even if we have cores with different
highest performance, ITMT can't be enabled as the current implementation
depends on CPPC Highest Performance.

On such systems we can use MSR_HWP_CAPABILITIES maximum performance field
when CPPC.Highest Performance is 0xff.

Due to legacy reasons, we can't solely depend on MSR_HWP_CAPABILITIES as
in some older systems CPPC Highest Performance is the only way to identify
different performing cores.
Reported-by: NMichael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: NMichael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

03c83982

cpufreq: intel_pstate: Fix active mode offline/online EPP handling · ed38eb49

由 Rafael J. Wysocki 提交于 11月 17, 2021

After commit 4adcf2e5 ("cpufreq: intel_pstate: Add ->offline and
->online callbacks") the EPP value set by the "performance" scaling
algorithm in the active mode is not restored after an offline/online
cycle which replaces it with the saved EPP value coming from user
space.

Address this issue by forcing intel_pstate_hwp_set() to set a new
EPP value when it runs first time after online.

Fixes: 4adcf2e5 ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
Link: https://lore.kernel.org/linux-pm/adc7132c8655bd4d1c8b6129578e931a14fe1db2.camel@linux.intel.com/Reported-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

ed38eb49

cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs · cd23f02f

由 Adamos Ttofari 提交于 11月 12, 2021

Commit fbdc21e9 ("cpufreq: intel_pstate: Add Icelake servers
support in no-HWP mode") enabled the use of Intel P-State driver
for Ice Lake servers.

But it doesn't cover the case when OS can't control P-States.

Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18
are enabled, then the Intel P-State driver should exit as OS can't
control P-States.

Fixes: fbdc21e9 ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode")
Signed-off-by: NAdamos Ttofari <attofari@amazon.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

cd23f02f

05 11月, 2021 3 次提交

cpufreq: intel_pstate: Clear HWP Status during HWP Interrupt enable · 074d0cdf

由 Srinivas Pandruvada 提交于 11月 04, 2021

It is possible that some performance excursions happened before OS boot
or enable HWP interrupts. So clear MSR_HWP_STATUS bits when we enable
HWP interrupt. In this way a next excursion will results in a HWP
interrupt.

The status bits of MSR_HWP_STATUS must be cleared (0) by software so
that a new status condition change will cause the hardware to set the
bit again and issue the notification.

Fixes: 57577c99 ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

074d0cdf

cpufreq: intel_pstate: Fix unchecked MSR 0x773 access · 55210556

由 Srinivas Pandruvada 提交于 11月 03, 2021

It is possible that on some platforms HWP interrupts are disabled. In
that case accessing MSR 0x773 will result in warning.

So check X86_FEATURE_HWP_NOTIFY feature to access MSR 0x773. The other
places in code where this MSR is accessed, already checks this feature
except during disable path called during cpufreq offline and suspend
callbacks.

Fixes: 57577c99 ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

55210556

cpufreq: intel_pstate: Clear HWP desired on suspend/shutdown and offline · dbea75fe

由 Rafael J. Wysocki 提交于 11月 03, 2021

Commit a365ab6b ("cpufreq: intel_pstate: Implement the
->adjust_perf() callback") caused intel_pstate to use nonzero HWP
desired values in certain usage scenarios, but it did not prevent
them from being leaked into the confugirations in which HWP desired
is expected to be 0.

The failing scenarios are switching the driver from the passive
mode to the active mode and starting a new kernel via kexec() while
intel_pstate is running in the passive mode.

To address this issue, ensure that HWP desired will be cleared on
offline and suspend/shutdown.

Fixes: a365ab6b ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback")
Reported-by: NJulia Lawall <julia.lawall@inria.fr>
Tested-by: NJulia Lawall <julia.lawall@inria.fr>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

dbea75fe

26 10月, 2021 1 次提交

cpufreq: intel_pstate: Fix cpu->pstate.turbo_freq initialization · c72bcf0a

由 Zhang Rui 提交于 10月 26, 2021

Fix a problem in active mode that cpu->pstate.turbo_freq is initialized
only if HWP-to-frequency scaling factor is refined.

In passive mode, this problem is not exposed, because
cpu->pstate.turbo_freq is set again, later in
intel_cpufreq_cpu_init()->intel_pstate_get_hwp_cap().

Fixes: eb3693f0 ("cpufreq: intel_pstate: hybrid: CPU-specific scaling factor")
Signed-off-by: NZhang Rui <rui.zhang@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c72bcf0a

05 10月, 2021 1 次提交

cpufreq: intel_pstate: Process HWP Guaranteed change notification · 57577c99

由 Srinivas Pandruvada 提交于 9月 28, 2021

It is possible that HWP guaranteed ratio is changed in response to
change in power and thermal limits. For example when Intel Speed Select
performance profile is changed or there is change in TDP, hardware can
send notifications. It is possible that the guaranteed ratio is
increased. This creates an issue when turbo is disabled, as the old
limits set in MSR_HWP_REQUEST are still lower and hardware will clip
to older limits.

Although the scope of IA32_HWP_INTERRUPT is per logical cpu, on some
plaforms interrupt is generated on all CPUs. This is particularly a
problem during initialization, when the driver didn't allocated
data for other CPUs. So this change uses a cpumask of enabled CPUs and
process interrupts on those CPUs only.

When the cpufreq offline() or suspend() callback is called, HWP interrupt
is disabled on those CPUs and also cancels any pending work item.

Spin lock is used to protect data and processing shared with interrupt
handler. Here READ_ONCE(), WRITE_ONCE() macros are used to designate
shared data, even though spin lock act as an optimization barrier here.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: pablomh@gmail.com
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

57577c99

14 9月, 2021 1 次提交

cpufreq: intel_pstate: Override parameters if HWP forced by BIOS · d9a7e9df

由 Doug Smythies 提交于 9月 12, 2021

If HWP has been already been enabled by BIOS, it may be
necessary to override some kernel command line parameters.
Once it has been enabled it requires a reset to be disabled.
Suggested-by: NRafael J. Wysocki <rafael@kernel.org>
Signed-off-by: NDoug Smythies <dsmythies@telus.net>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

d9a7e9df

08 9月, 2021 1 次提交

cpufreq: intel_pstate: hybrid: Rework HWP calibration · 46573fd6

由 Rafael J. Wysocki 提交于 9月 04, 2021

The current HWP calibration for hybrid processors in intel_pstate is
fragile, because it depends too much on the information provided by
the platform firmware via CPPC which may not be reliable enough.  It
also need not be so complicated.

In order to improve that mechanism and make it more resistant to
platform firmware issues, make it only use the CPPC nominal_perf
values to compute the HWP-to-frequency scaling factors for all
CPUs and possibly use the HWP_CAP highest_perf values to recompute
them if the ones derived from the CPPC nominal_perf values alone
appear to be too high.

Namely, fetch CPC.nominal_perf for all CPUs present in the system,
find the minimum one and use it as a reference for computing all of
the CPUs' scaling factors (using the observation that for the CPUs
having the minimum CPC.nominal_perf the HWP range of available
performance levels should be the same as the range of available
"legacy" P-states and so the HWP-to-frequency scaling factor for
them should be the same as the corresponding scaling factor used
for representing the P-state values in kHz).
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NZhang Rui <rui.zhang@intel.com>

46573fd6

07 9月, 2021 1 次提交

Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" · dd7c46d6

由 Rafael J. Wysocki 提交于 9月 07, 2021

Revert commit d0e936ad ("cpufreq: intel_pstate: Process HWP
Guaranteed change notification"), because it causes a NULL pointer
dereference to occur on Lenovo X1 gen9 laptops due to an HWP
guaranteed performance change interrupt arriving prematurely.

This feature will be revisited in the next cycle.
Reported-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

dd7c46d6

26 8月, 2021 1 次提交

cpufreq: intel_pstate: Process HWP Guaranteed change notification · d0e936ad

由 Srinivas Pandruvada 提交于 8月 19, 2021

This change enables HWP interrupt and process HWP interrupts. When
guaranteed is changed, calls cpufreq_update_policy() so that driver
callbacks are called to update to new HWP limits. This callback
is called from a delayed workqueue of 10ms to avoid frequent updates.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

d0e936ad

05 8月, 2021 1 次提交

cpufreq: Replace deprecated CPU-hotplug functions · 09681a07

由 Sebastian Andrzej Siewior 提交于 8月 03, 2021

The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

09681a07

01 7月, 2021 1 次提交

cpufreq: intel_pstate: Combine ->stop_cpu() and ->offline() · 49d6feef

由 Rafael J. Wysocki 提交于 6月 30, 2021

Combine the ->stop_cpu() and ->offline() callback routines for
intel_pstate in the active mode so as to avoid setting the
->stop_cpu callback pointer which is going to be dropped from
the framework.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>

49d6feef

07 6月, 2021 1 次提交

cpufreq: intel_pstate: hybrid: Fix build with CONFIG_ACPI unset · 8df71a7d

由 Rafael J. Wysocki 提交于 5月 26, 2021

One of the previous commits introducing hybrid processor support to
intel_pstate broke build with CONFIG_ACPI unset.

Fix that and while at it make empty stubs of two functions related
to ACPI CPPC static inline and fix a spelling mistake in the name of
one of them.

Fixes: eb3693f0 ("cpufreq: intel_pstate: hybrid: CPU-specific scaling factor")
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested

8df71a7d

22 5月, 2021 4 次提交

cpufreq: intel_pstate: Add Cometlake support in no-HWP mode · 706c5328

由 Giovanni Gherdovich 提交于 5月 18, 2021

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

See also commit d8de7a44 ("cpufreq: intel_pstate: Add Skylake servers
support").
Suggested-by: NDoug Smythies <dsmythies@telus.net>
Tested-by: NDoug Smythies <dsmythies@telus.net>
Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

706c5328

cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode · fbdc21e9

由 Giovanni Gherdovich 提交于 5月 18, 2021

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

Add ICELAKE_X to the list of CPUs that can register intel_pstate while not
advertising the HWP capability. Without this change, an ICELAKE_X in no-HWP
mode could only use the acpi_cpufreq frequency scaling driver.

See also commit d8de7a44 ("cpufreq: intel_pstate: Add Skylake servers
support").
Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

fbdc21e9

cpufreq: intel_pstate: hybrid: CPU-specific scaling factor · eb3693f0

由 Rafael J. Wysocki 提交于 5月 12, 2021

The scaling factor between HWP performance levels and CPU frequency
may be different for different types of CPUs in a hybrid processor
and in general the HWP performance levels need not correspond to
"P-states" representing values that would be written to
MSR_IA32_PERF_CTL if HWP was disabled.

However, the policy limits control in cpufreq is defined in terms
of CPU frequency, so it is necessary to map the frequency limits set
through that interface to HWP performance levels with reasonable
accuracy and the behavior of that interface on hybrid processors
has to be compatible with its behavior on non-hybrid ones.

To address this problem, use the observations that (1) on hybrid
processors the sysfs interface can operate by mapping frequency
to "P-states" and translating those "P-states" to specific HWP
performance levels of the given CPU and (2) the scaling factor
between the MSR_IA32_PERF_CTL "P-states" and CPU frequency can be
regarded as a known value.  Moreover, the mapping between the
HWP performance levels and CPU frequency can be assumed to be
linear and such that HWP performance level 0 correspond to the
frequency value of 0, so it is only necessary to know the
frequency corresponding to one specific HWP performance level
to compute the scaling factor applicable to all of them.

One possibility is to take the nominal performance value from CPPC,
if available, and use cpu_khz as the corresponding frequency.  If
the CPPC capabilities interface is not there or the nominal
performance value provided by it is out of range, though, something
else needs to be done.

Namely, the guaranteed performance level either from CPPC or from
MSR_HWP_CAPABILITIES can be used instead, but the corresponding
frequency needs to be determined.  That can be done by computing the
product of the (known) scaling factor between the MSR_IA32_PERF_CTL
P-states and CPU frequency (the PERF_CTL scaling factor) and the
P-state value referred to as the "TDP ratio".

If the HWP-to-frequency scaling factor value obtained in one of the
ways above turns out to be euqal to the PERF_CTL scaling factor, it
can be assumed that the number of HWP performance levels is equal to
the number of P-states and the given CPU can be handled as though
this was not a hybrid processor.

Otherwise, one more adjustment may still need to be made, because the
HWP-to-frequency scaling factor computed so far may not be accurate
enough (e.g. because the CPPC information does not match the exact
behavior of the processor).  Specifically, in that case the frequency
corresponding to the highest HWP performance value from
MSR_HWP_CAPABILITIES (computed as the product of that value and the
HWP-to-frequency scaling factor) cannot exceed the frequency that
corresponds to the maximum 1-core turbo P-state value from
MSR_TURBO_RATIO_LIMIT (computed as the procuct of that value and the
PERF_CTL scaling factor) and the HWP-to-frequency scaling factor may
need to be adjusted accordingly.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

eb3693f0

cpufreq: intel_pstate: hybrid: Avoid exposing two global attributes · c3d175e4

由 Rafael J. Wysocki 提交于 5月 12, 2021

The turbo_pct and num_pstates sysfs attributes represent CPU
properties that may be different for differenty types of CPUs in
a hybrid processor, so avoid exposing them in that case.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c3d175e4

10 5月, 2021 1 次提交

cpufreq: intel_pstate: Use HWP if enabled by platform firmware · e5af36b2

由 Rafael J. Wysocki 提交于 4月 21, 2021

It turns out that there are systems where HWP is enabled during
initialization by the platform firmware (BIOS), but HWP EPP support
is not advertised.

After commit 7aa10312 ("cpufreq: intel_pstate: Avoid enabling HWP
if EPP is not supported") intel_pstate refuses to use HWP on those
systems, but the fallback PERF_CTL interface does not work on them
either because of enabled HWP, and once enabled, HWP cannot be
disabled.  Consequently, the users of those systems cannot control
CPU performance scaling.

Address this issue by making intel_pstate use HWP unconditionally if
it is enabled already when the driver starts.

Fixes: 7aa10312 ("cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported")
Reported-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+

e5af36b2

09 4月, 2021 1 次提交

cpufreq: intel_pstate: Simplify intel_pstate_update_perf_limits() · b989bc0f

由 Rafael J. Wysocki 提交于 4月 07, 2021

Because pstate.max_freq is always equal to the product of
pstate.max_pstate and pstate.scaling and, analogously,
pstate.turbo_freq is always equal to the product of
pstate.turbo_pstate and pstate.scaling, the result of the
max_policy_perf computation in intel_pstate_update_perf_limits() is
always equal to the quotient of policy_max and pstate.scaling,
regardless of whether or not turbo is disabled.  Analogously, the
result of min_policy_perf in intel_pstate_update_perf_limits() is
always equal to the quotient of policy_min and pstate.scaling.

Accordingly, intel_pstate_update_perf_limits() need not check
whether or not turbo is enabled at all and in order to compute
max_policy_perf and min_policy_perf it can always divide policy_max
and policy_min, respectively, by pstate.scaling.  Make it do so.

While at it, move the definition and initialization of the
turbo_max local variable to the code branch using it.

No intentional functional impact.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NChen Yu <yu.c.chen@intel.com>

b989bc0f

24 3月, 2021 1 次提交

cpufreq: intel_pstate: Clean up frequency computations · de5bcf40

由 Rafael J. Wysocki 提交于 3月 16, 2021

Notice that some computations related to frequency in intel_pstate
can be simplified if (a) intel_pstate_get_hwp_max() updates the
relevant members of struct cpudata by itself and (b) the "turbo
disabled" check is moved from it to its callers, so modify the code
accordingly and while at it rename intel_pstate_get_hwp_max() to
intel_pstate_get_hwp_cap() which better reflects its purpose and
provide a simplified variat of it, __intel_pstate_get_hwp_cap(),
suitable for the initialization path.

No intentional functional impact.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NChen Yu <yu.c.chen@intel.com>

de5bcf40

23 1月, 2021 1 次提交

cpufreq: intel_pstate: Remove repeated word · 75a8d877

由 Nigel Christian 提交于 1月 16, 2021

In the comment for trace in passive mode there is an
unnecessary "the". Eradicate it.
Signed-off-by: NNigel Christian <nigel.l.christian@gmail.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

75a8d877

13 1月, 2021 3 次提交

cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available · 6f67e060

由 Chen Yu 提交于 1月 12, 2021

Currently, when turbo is disabled (either by BIOS or by the user),
the intel_pstate driver reads the max non-turbo frequency from the
package-wide MSR_PLATFORM_INFO(0xce) register.

However, on asymmetric platforms it is possible in theory that small
and big core with HWP enabled might have different max non-turbo CPU
frequency, because MSR_HWP_CAPABILITIES is per-CPU scope according
to Intel Software Developer Manual.

The turbo max freq is already per-CPU in current code, so make
similar change to the max non-turbo frequency as well.
Reported-by: NWendy Wang <wendy.wang@intel.com>
Signed-off-by: NChen Yu <yu.c.chen@intel.com>
[ rjw: Subject and changelog edits ]
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: a45ee4d4: cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

6f67e060

cpufreq: intel_pstate: Rename two functions · 597ffbc8

由 Rafael J. Wysocki 提交于 1月 07, 2021

Rename intel_cpufreq_adjust_hwp() and intel_cpufreq_adjust_perf_ctl()
to intel_cpufreq_hwp_update() and intel_cpufreq_perf_ctl_update(),
respectively, to avoid possible confusion with the ->adjist_perf()
callback function, intel_cpufreq_adjust_perf().
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NChen Yu <yu.c.chen@intel.com>

597ffbc8

cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument · a45ee4d4

由 Rafael J. Wysocki 提交于 1月 07, 2021

All of the callers of intel_pstate_get_hwp_max() access the struct
cpudata object that corresponds to the given CPU already and the
function itself needs to access that object (in order to update
hwp_cap_cached), so modify the code to pass a struct cpudata pointer
to it instead of the CPU number.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NChen Yu <yu.c.chen@intel.com>

a45ee4d4

openeuler / Kernel 12 个月 前同步成功

openeuler / Kernel
12 个月前同步成功