提交 · 33068b61f8c041e67b554ddcb44c25ca748d38f6 · openanolis / cloud-kernel

20 3月, 2016 1 次提交

intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts · fdfdb2b1

由 Rafael J. Wysocki 提交于 3月 18, 2016

After commit a4675fbc (cpufreq: intel_pstate: Replace timers with
utilization update callbacks) wrmsrl_on_cpu() cannot be called in the
intel_pstate_adjust_busy_pstate() path as that is executed with
disabled interrupts.  However, atom_set_pstate() called from there
via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the
IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in
smp_call_function_single().

The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is
because intel_pstate_set_pstate() calling it is also invoked during
the initialization and cleanup of the driver and in those cases it is
not guaranteed to be run on the CPU that is being updated.  However,
in the case when intel_pstate_set_pstate() is called by
intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update
the register safely.  Moreover, intel_pstate_set_pstate() already
contains code that only is executed if the function is called by
intel_pstate_adjust_busy_pstate() and there is a special argument
passed to it because of that.

To fix the problem at hand, rearrange the code taking the above
observations into account.

First, replace the ->set() callback in struct pstate_funcs with a
->get_val() one that will return the value to be written to the
IA32_PERF_CTL MSR without updating the register.

Second, split intel_pstate_set_pstate() into two functions,
intel_pstate_update_pstate() to be called by
intel_pstate_adjust_busy_pstate() that will contain all of the
intel_pstate_set_pstate() code which only needs to be executed in
that case and will use wrmsrl() to update the MSR (after obtaining
the value to write to it from the ->get_val() callback), and
intel_pstate_set_min_pstate() to be invoked during the
initialization and cleanup that will set the P-state to the
minimum one and will update the MSR using wrmsrl_on_cpu().

Finally, move the code shared between intel_pstate_update_pstate()
and intel_pstate_set_min_pstate() to a new static inline function
intel_pstate_record_pstate() and make them both call it.

Of course, that unifies the handling of the IA32_PERF_CTL MSR writes
between Atom and Core.

Fixes: a4675fbc (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Reported-and-tested-by: NJosh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

fdfdb2b1

11 3月, 2016 5 次提交

intel_pstate: Do not skip samples partially · 4fec7ad5

由 Rafael J. Wysocki 提交于 3月 10, 2016

If the current value of MPERF or the current value of TSC is the
same as the previous one, respectively, intel_pstate_sample() bails
out early and skips the sample.

However, intel_pstate_adjust_busy_pstate() is still called in that
case which is not correct, so modify intel_pstate_sample() to
return a bool value indicating whether or not the sample has been
taken and use it to decide whether or not to call
intel_pstate_adjust_busy_pstate().

While at it, remove redundant parentheses from the MPERF/TSC
check in intel_pstate_sample().
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

4fec7ad5

intel_pstate: Remove freq calculation from intel_pstate_calc_busy() · 8fa520af

由 Philippe Longepe 提交于 3月 06, 2016

Use a helper function to compute the average pstate and call it only
where it is needed (only when tracing or in intel_pstate_get).
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8fa520af

intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance() · 7349ec04

由 Philippe Longepe 提交于 3月 06, 2016

The cpu_load algorithm doesn't need to invoke intel_pstate_calc_busy(),
so move that call from intel_pstate_sample() to
get_target_pstate_use_performance().
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7349ec04

intel_pstate: Optimize calculation for max/min_perf_adj · a158bed5

由 Philippe Longepe 提交于 3月 06, 2016

mul_fp(int_tofp(A), B) expands to:
((A << FRAC_BITS) * B) >> FRAC_BITS, so the same result can be obtained
via simple multiplication A * B.  Apply this observation to
max_perf * limits->max_perf and max_perf * limits->min_perf in
intel_pstate_get_min_max()."
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a158bed5

intel_pstate: Remove extra conversions in pid calculation · b54a0dfd

由 Philippe Longepe 提交于 3月 08, 2016

pid->setpoint and pid->deadband can be initialized in fixed point, so we
can avoid the int_tofp in pid_calc.
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

b54a0dfd

09 3月, 2016 2 次提交

cpufreq: Reduce cpufreq_update_util() overhead a bit · 08f511fd

由 Rafael J. Wysocki 提交于 3月 04, 2016

Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete.  Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.

In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.

Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

08f511fd

cpufreq: intel_pstate: Replace timers with utilization update callbacks · a4675fbc

由 Rafael J. Wysocki 提交于 2月 05, 2016

Instead of using a per-CPU deferrable timer for utilization sampling
and P-states adjustments, register a utilization update callback that
will be invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the deferrable
timers, so the functional impact of this patch should not be significant.

Based on an earlier patch from Srinivas Pandruvada.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

a4675fbc

27 2月, 2016 2 次提交

cpufreq: intel_pstate: disable HWP notifications · f05c9665

由 Srinivas Pandruvada 提交于 2月 25, 2016

Disable HWP Interrupt notification before enabling HWP. Since we don't
have HWP interrupt handling for possible performance interrupts, there
is not much use of enabling HWP interrupts.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

f05c9665

cpufreq: intel_pstate: Enable HWP by default · 7791e4aa

由 Srinivas Pandruvada 提交于 2月 25, 2016

If the processor supports HWP, enable it by default without checking
for the cpu model. This will allow to enable HWP in all supported
processors without driver change.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7791e4aa

23 2月, 2016 1 次提交

intel_pstate: Update frequencies of policy->cpus only from ->set_policy() · 41cfd64c

由 Viresh Kumar 提交于 2月 22, 2016

The intel-pstate driver is using intel_pstate_hwp_set() from two
separate paths, i.e. ->set_policy() callback and sysfs update path for
the files present in /sys/devices/system/cpu/intel_pstate/ directory.

While an update to the sysfs path applies to all the CPUs being managed
by the driver (which essentially means all the online CPUs), the update
via the ->set_policy() callback applies to a smaller group of CPUs
managed by the policy for which ->set_policy() is called.

And so, intel_pstate_hwp_set() should update frequencies of only the
CPUs that are part of policy->cpus mask, while it is called from
->set_policy() callback.

In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set()
and apply the frequency changes only to the concerned CPUs.

For ->set_policy() path, we are only concerned about policy->cpus, and
so policy->rwsem lock taken by the core prior to calling ->set_policy()
is enough to take care of any races. The larger lock acquired by
get_online_cpus() is required only for the updates to sysfs files.

Add another routine, intel_pstate_hwp_set_online_cpus(), and call it
from the sysfs update paths.

This also fixes a lockdep reported recently, where policy->rwsem and
get_online_cpus() could have been acquired in any order causing an ABBA
deadlock. The sequence of events leading to that was:

intel_pstate_init(...)
	...cpufreq_online(...)
		down_write(&policy->rwsem); // Locks policy->rwsem
		...
		cpufreq_init_policy(policy);
			...intel_pstate_hwp_set();
				get_online_cpus(); // Temporarily locks cpu_hotplug.lock
		...
		up_write(&policy->rwsem);

pm_suspend(...)
	...disable_nonboot_cpus()
		_cpu_down()
			cpu_hotplug_begin(); // Locks cpu_hotplug.lock
			__cpu_notify(CPU_DOWN_PREPARE, ...);
				...cpufreq_offline_prepare();
					down_write(&policy->rwsem); // Locks policy->rwsem
Reported-and-tested-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

41cfd64c

30 1月, 2016 1 次提交

x86/cpufeature: Replace the old static_cpu_has() with safe variant · bc696ca0

由 Borislav Petkov 提交于 1月 26, 2016

So the old one didn't work properly before alternatives had run.
And it was supposed to provide an optimized JMP because the
assumption was that the offset it is jumping to is within a
signed byte and thus a two-byte JMP.

So I did an x86_64 allyesconfig build and dumped all possible
sites where static_cpu_has() was used. The optimization amounted
to all in all 12(!) places where static_cpu_has() had generated
a 2-byte JMP. Which has saved us a whopping 36 bytes!

This clearly is not worth the trouble so we can remove it. The
only place where the optimization might count - in __switch_to()
- we will handle differently. But that's not subject of this
patch.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

bc696ca0

12 12月, 2015 1 次提交

cpufreq: intel_pstate: Minor cleanup for FRAC_BITS · 88b7b7c0

由 Prarit Bhargava 提交于 12月 08, 2015

785ee278 ("cpufreq: intel_pstate: Fix limits->max_perf rounding error")
hardcodes the value of FRAC_BITS. This patch fixes that minor issue.

Fixes: 785ee278 (cpufreq: intel_pstate: Fix limits->max_perf rounding error)
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

88b7b7c0

10 12月, 2015 3 次提交

cpufreq: intel_pstate: Account for IO wait time · 63d1d656

由 Philippe Longepe 提交于 12月 04, 2015

In cases where we have many IOs, the global load becomes low and the
load algorithm will decrease the requested P-State. Because of that,
the IOs overheads will increase and impact the IO performances.

To improve IO bound work, we can count the io-wait time as busy time
in calculating CPU busy.

This change uses get_cpu_iowait_time_us() to obtain the IO wait time value
and converts time into number of cycles spent waiting on IO at the TSC
rate. At the moment, this trick is only used for Atom.
Signed-off-by: NPhilippe Longepe <philippe.longepe@intel.com>
Signed-off-by: NStephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

63d1d656

cpufreq: intel_pstate: Account for non C0 time · e70eed2b

由 Philippe Longepe 提交于 12月 04, 2015

The current function to calculate cpu utilization uses the average P-state
ratio (APerf/Mperf) scaled by the ratio of the current P-state to the
max available non-turbo one. This leads to an overestimation of
utilization which causes higher-performance P-states to be selected more
often and that leads to increased energy consumption.

This is a problem for low-power systems, so it is better to use a
different utilization calculation algorithm for them.

Namely, the Percent Busy value (or load) can be estimated as the ratio of the
MPERF counter that runs at a constant rate only during active periods (C0) to
the time stamp counter (TSC) that also runs (at the same rate) during idle.
That is:

Percent Busy = 100 * (delta_mperf / delta_tsc)

Use this algorithm for platforms with SoCs based on the Airmont and Silvermont
Atom cores.
Signed-off-by: NPhilippe Longepe <philippe.longepe@intel.com>
Signed-off-by: NStephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

e70eed2b

cpufreq: intel_pstate: Configurable algorithm to get target pstate · 157386b6

由 Philippe Longepe 提交于 12月 04, 2015

Target systems using different cpus have different power and performance
requirements. They may use different algorithms to get the next P-state
based on their power or performance preference.

For example, power-constrained systems may not want to use
high-performance P-states as aggressively as a full-size desktop or a
server platform. A server platform may want to run close to the max to
achieve better performance, while laptop-like systems may prefer
sacrificing performance for longer battery lifes.

For the above reasons, modify intel_pstate to allow the target P-state
selection algorithm to be depend on the CPU ID.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NPhilippe Longepe <philippe.longepe@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

157386b6

26 11月, 2015 1 次提交

intel_pstate: Fix "performance" mode behavior with HWP enabled · 584ee3dc

由 Alexandra Yates 提交于 11月 18, 2015

If hardware-driven P-state selection (HWP) is enabled, the
"performance" mode of intel_pstate should only allow the processor
to use the highest-performance P-state available.  That is not
the case currently, so make it actually happen.
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NAlexandra Yates <alexandra.yates@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

584ee3dc

24 11月, 2015 2 次提交

cpufreq: intel_pstate: Fix limits->max_perf rounding error · 785ee278

由 Prarit Bhargava 提交于 11月 20, 2015

A rounding error was found in the calculation of limits->max_perf
in intel_pstate_set_policy(), which is used to calculate the max and min
pstate values in intel_pstate_get_min_max().  In that code,
limits->max_perf is truncated to 2 hex digits such that, for example,
0x169 was incorrectly calculated to 0x16 instead of 0x17.  This resulted in
the pstate being set one level too low.  This patch rounds the value of
limits->max_perf up instead of down so that the correct max pstate can
be reached.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

785ee278

cpufreq: intel_pstate: Fix limits->max_policy_pct rounding error · 8478f539

由 Prarit Bhargava 提交于 11月 20, 2015

I have a Intel (6,63) processor with a "marketing" frequency (from
/proc/cpuinfo) of 2100MHz, and a max turbo frequency of 2600MHz.  I
can execute

cpupower frequency-set -g powersave --min 1200MHz --max 2100MHz

and the max_freq_pct is set to 80.  When adding load to the system I noticed
that the cpu frequency only reached 2000MHZ and not 2100MHz as expected.

This is because limits->max_policy_pct is calculated as 2100 * 100 /2600 = 80.7
and is rounded down to 80 when it should be rounded up to 81.  This patch
adds a DIV_ROUND_UP() which will return the correct value.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8478f539

19 11月, 2015 4 次提交

cpufreq: intel_pstate: Add separate support for Airmont cores · 1421df63

由 Philippe Longepe 提交于 11月 09, 2015

There are two flavors of Atom cores to be supported by intel_pstate,
Silvermont and Airmont, so make the driver distinguish between them by
adding separate frequency tables.

Separate the CPU defaults params for each of them and match the CPU IDs
against them as appropriate.
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: NStephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

1421df63

cpufreq: intel_pstate: Replace BYT with ATOM · 938d21a2

由 Philippe Longepe 提交于 11月 09, 2015

Rename symbol and function names starting with "BYT" or "byt" to
start with "ATOM" or "atom", respectively, so as to make it clear
that they may apply to Atom in general and not just to Baytrail
(the goal is to support several Atoms architectures eventually).

This should not lead to any functional changes.
Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: NStephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Changelog ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

938d21a2

Revert "cpufreq: intel_pstate: Use ACPI perf configuration" · 6ee11e41

由 Rafael J. Wysocki 提交于 11月 19, 2015

Revert commit 37afb000 (cpufreq: intel_pstate: Use ACPI perf
configuration) that is reported to cause a regression to happen
on a system where invalid data are returned by the ACPI _PSS object.

Since that commit makes assumptions regarding the _PSS output
correctness that may turn out to be overly optimistic in general,
there is a concern that it may introduce regression on more
systems, so it's better to revert it now and we'll revisit the
underlying issue in the next cycle with a more robust solution.

Conflicts:
        drivers/cpufreq/intel_pstate.c

Fixes: 37afb000 (cpufreq: intel_pstate: Use ACPI perf configuration)
Reported-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

6ee11e41

Revert "cpufreq: intel_pstate: Avoid calculation for max/min" · 799281a3

由 Rafael J. Wysocki 提交于 11月 18, 2015

Revert commit 4ef45148 (cpufreq: intel_pstate: Avoid calculation for
max/min) as it depends on commit 37afb000 (cpufreq: intel_pstate: Use
ACPI perf configuration) that causes problems to happen and needs to be
reverted.

Conflicts:
	drivers/cpufreq/intel_pstate.c
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

799281a3

02 11月, 2015 1 次提交

intel_pstate: decrease number of "HWP enabled" messages · 539342f6

由 Prarit Bhargava 提交于 10月 22, 2015

When booting an HWP enabled system the kernel displays one "HWP enabled"
message for each cpu.  The messages are superfluous since HWP is globally
enabled across all CPUs. This patch also adds an informational message
when HWP is disabled via intel_pstate=no_hwp.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

539342f6

17 10月, 2015 1 次提交

cpufreq: intel_pstate: Fix intel_pstate powersave min_perf_pct value · 51443fbf

由 Prarit Bhargava 提交于 10月 15, 2015

On systems that initialize the intel_pstate driver with the performance
governor, and then switch to the powersave governor will not transition to
lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
is set to a low value.

The behavior of governor switching changed after commit a0475992
("[cpufreq] intel_pstate: honor user space min_perf_pct override on
 resume").  The commit introduced tracking of performance percentage
changes via sysfs in order to restore userspace changes during
suspend/resume.  The problem occurs because the global values of the newly
introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
change and this causes the powersave governor to inherit the performance
governor's settings.

A simple change would have been to reset max_sysfs_pct to 100 and
min_sysfs_pct to 0 on a governor change, which fixes the problem with
governor switching.  However, since we cannot break userspace[1] the fix
is now to give each governor its own limits storage area so that governor
specific changes are tracked.

I successfully tested this by booting with both the performance governor
and the powersave governor by default, and switching between the two
governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
and looking at the output of cpupower frequency-info).  Suspend/Resume
testing was performed by Doug Smythies.

[1] Systems which suspend/resume using the unmaintained pm-utils package
will always transition to the performance governor before the suspend and
after the resume.  This means a system using the powersave governor will
go from powersave to performance, then suspend/resume, performance to
powersave.  The simple change during governor changes would have been
overwritten when the governor changed before and after the suspend/resume.
I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
against Fedora to remove the 94cpufreq file that causes the problem.  It
should be noted that pm-utils is obsoleted with newer versions of systemd.
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

51443fbf

16 10月, 2015 1 次提交

cpufreq: intel_pstate: Fix divide by zero on Knights Landing (KNL) · 8e601a9f

由 Srinivas Pandruvada 提交于 10月 15, 2015

This is a workaround for KNL platform, where in some cases MPERF counter
will not have updated value before next read of MSR_IA32_MPERF. In this
case divide by zero will occur. This change ignores current sample for
busy calculation in this case.

Fixes: b34ef932 (intel_pstate: Knights Landing support)
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

8e601a9f

15 10月, 2015 4 次提交

cpufreq: intel_pstate: Avoid calculation for max/min · 4ef45148

由 Srinivas Pandruvada 提交于 10月 14, 2015

When requested from cpufreq to set policy, look into _pss and get
control values, instead of using max/min perf calculations. These
calculation misses next control state in boundary conditions.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

4ef45148

cpufreq: intel_pstate: Use ACPI perf configuration · 37afb000

由 Srinivas Pandruvada 提交于 10月 14, 2015

Use ACPI _PSS to limit the Intel P State turbo, max and min ratios.
This driver uses acpi processor perf lib calls to register performance.
The following logic is used to adjust Intel P state driver limits:
- If there is no turbo entry in _PSS, then disable Intel P state turbo
and limit to non turbo max
- If the non turbo max ratio is more than _PSS max non turbo value, then
set the max non turbo ratio to _PSS non turbo max
- If the min ratio is less than _PSS min then change the min ratio
matching _PSS min
- Scale the _PSS turbo frequency to max turbo frequency based on control
value.
This feature can be disabled by using kernel parameters:
intel_pstate=no_acpi
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

37afb000

cpufreq: intel-pstate: Use separate max pstate for scaling · 3bcc6fa9

由 Srinivas Pandruvada 提交于 10月 14, 2015

Systems with configurable TDP have multiple max non turbo p state. Intel
P state uses max non turbo P state for scaling. But using the real max
non turbo p state causes underestimation of next P state. So using
the physical max non turbo P state as before for scaling.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

3bcc6fa9

cpufreq: intel_pstate: get P1 from TAR when available · 6a35fc2d

由 Srinivas Pandruvada 提交于 10月 14, 2015

After Ivybridge, the max non turbo ratio obtained from platform info msr
is not always guaranteed P1 on client platforms. The max non turbo
activation ratio (TAR), determines the max for the current level of TDP.
The ratio in platform info is physical max. The TAR MSR can be locked,
so updating this value is not possible on all platforms.
This change gets this ratio from MSR TURBO_ACTIVATION_RATIO if
available,
but also do some sanity checking to make sure that this value is
correct.
The sanity check involves reading the TDP ratio for the current tdp
control value when platform has configurable TDP present and matching
TAC
with this.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

6a35fc2d

10 9月, 2015 2 次提交

intel_pstate: fix PCT_TO_HWP macro · 74da56ce

由 Kristen Carlson Accardi 提交于 9月 09, 2015

PCT_TO_HWP does not take the actual range of pstates exported
by HWP_CAPABILITIES in account, and is broken on most platforms.
Remove the macro and set the min and max pstate for hwp by
determining the range and adjusting by the min and max percent
limits values.
Signed-off-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

74da56ce

intel_pstate: Fix user input of min/max to legal policy region · 43717aad

由 Chen Yu 提交于 9月 09, 2015

In current code, max_perf_pct might be smaller than min_perf_pct
by improper user input:

$ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100

$ echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

$ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:80
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100

Fix this problem by 2 steps:
 1. Normalize the user input to [min_policy, max_policy].
 2. Make sure max_perf_pct>=min_perf_pct, suggested by Seiichi Ikarashi.
Signed-off-by: NChen Yu <yu.c.chen@intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

43717aad

07 8月, 2015 2 次提交

intel_pstate: append more Oracle OEM table id to vendor bypass list · 5aecc3c8

由 Ethan Zhao 提交于 8月 05, 2015

Append more Oracle X86 servers that have their own power management,

SUN FIRE X4275 M3
SUN FIRE X4170 M3
and
SUN FIRE X6-2
Signed-off-by: NEthan Zhao <ethan.zhao@oracle.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5aecc3c8

intel_pstate: Add SKY-S support · 1c939123

由 Kristen Carlson Accardi 提交于 8月 05, 2015

Whitelist the SKL-S processor
Signed-off-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

1c939123

01 8月, 2015 1 次提交

intel_pstate: Fix possible overflow complained by Coverity · 144c8e17

由 Chen Yu 提交于 7月 29, 2015

Coverity scanning performed on intel_pstate.c shows possible
overflow when doing left shifting:
val = pstate << 8;
since pstate is of type integer, while val is of u64, left shifting
pstate might lead to potential loss of upper bits. Say, if pstate equals
0x4000 0000, after pstate << 8 we will get zero assigned to val.
Although pstate will not likely be that big, this patch cast the left
operand to u64 before performing the left shift, to avoid complaining
from Coverity.
Reported-by: NCoquard, Christophe <christophe.coquard@intel.com>
Signed-off-by: NChen Yu <yu.c.chen@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

144c8e17

27 7月, 2015 1 次提交

intel_pstate: Add get_scaling cpu_defaults param to Knights Landing · 69cefc27

由 Lukasz Anaczkowski 提交于 7月 21, 2015

Scaling for Knights Landing is same as the default scaling (100000).
When Knigts Landing support was added to the pstate driver, this
parameter was omitted resulting in a kernel panic during boot.

Fixes: b34ef932 (intel_pstate: Knights Landing support)
Reported-by: NYasuaki Ishimatsu <yishimat@redhat.com>
Signed-off-by: NDasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: NLukasz Anaczkowski <lukasz.anaczkowski@intel.com>
Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

69cefc27

17 7月, 2015 1 次提交

intel_pstate: enable HWP per CPU · ba88d433

由 Kristen Carlson Accardi 提交于 7月 14, 2015

HWP previously was only enabled at driver load time, on the boot
CPU, however, HWP must be enabled per package. Move the code to
enable HWP to the cpufreq driver init path so that it will be
called per CPU.
Signed-off-by: NKristen Carlson Accardi <kristen@linux.intel.com>
Tested-by: NDavid Zhuang <david.zhuang@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

ba88d433

06 7月, 2015 1 次提交

x86/asm/tsc: Rename native_read_tsc() to rdtsc() · 4ea1636b

由 Andy Lutomirski 提交于 6月 25, 2015

Now that there is no paravirt TSC, the "native" is
inappropriate. The function does RDTSC, so give it the obvious
name: rdtsc().
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/fd43e16281991f096c1e4d21574d9e1402c62d39.1434501121.git.luto@kernel.org
[ Ported it to v4.2-rc1. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

4ea1636b

17 6月, 2015 1 次提交

intel_pstate: Fix overflow in busy_scaled due to long delay · 7180dddf

由 Prarit Bhargava 提交于 6月 15, 2015

The kernel may delay interrupts for a long time which can result in timers
being delayed. If this occurs the intel_pstate driver will crash with a
divide by zero error:

divide error: 0000 [#1] SMP
Modules linked in: btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc arc4 md4 nls_utf8 cifs dns_resolver tcp_lp bnep bluetooth rfkill fuse dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables intel_powerclamp coretemp vfat fat kvm_intel iTCO_wdt iTCO_vendor_support ipmi_devintf sr_mod kvm crct10dif_pclmul
 crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel cdc_ether lrw usbnet cdrom mii gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core pcspkr sb_edac edac_core ipmi_si ipmi_msghandler ioatdma wmi shpchp acpi_pad nfsd auth_rpcgss nfs_acl lockd uinput dm_multipath sunrpc xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common ixgbe mgag200 syscopyarea sysfillrect sysimgblt mdio drm_kms_helper ttm igb drm ptp pps_core dca i2c_algo_bit megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod
CPU: 113 PID: 0 Comm: swapper/113 Tainted: G        W   --------------   3.10.0-229.1.2.el7.x86_64 #1
Hardware name: IBM x3950 X6 -[3837AC2]-/00FN827, BIOS -[A8E112BUS-1.00]- 08/27/2014
task: ffff880fe8abe660 ti: ffff880fe8ae4000 task.ti: ffff880fe8ae4000
RIP: 0010:[<ffffffff814a9279>]  [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0
RSP: 0018:ffff883fff4e3db8  EFLAGS: 00010206
RAX: 0000000027100000 RBX: ffff883fe6965100 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000010 RDI: 000000002e53632d
RBP: ffff883fff4e3e20 R08: 000e6f69a5a125c0 R09: ffff883fe84ec001
R10: 0000000000000002 R11: 0000000000000005 R12: 00000000000049f5
R13: 0000000000271000 R14: 00000000000049f5 R15: 0000000000000246
FS:  0000000000000000(0000) GS:ffff883fff4e0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7668601000 CR3: 000000000190a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff883fff4e3e58 ffffffff81099dc1 0000000000000086 0000000000000071
 ffff883fff4f3680 0000000000000071 fbdc8a965e33afee ffffffff810b69dd
 ffff883fe84ec000 ffff883fe6965108 0000000000000100 ffffffff814a9100
Call Trace:
 <IRQ>

 [<ffffffff81099dc1>] ? run_posix_cpu_timers+0x51/0x840
 [<ffffffff810b69dd>] ? trigger_load_balance+0x5d/0x200
 [<ffffffff814a9100>] ? pid_param_set+0x130/0x130
 [<ffffffff8107df56>] call_timer_fn+0x36/0x110
 [<ffffffff814a9100>] ? pid_param_set+0x130/0x130
 [<ffffffff8107fdcf>] run_timer_softirq+0x21f/0x320
 [<ffffffff81077b2f>] __do_softirq+0xef/0x280
 [<ffffffff816156dc>] call_softirq+0x1c/0x30
 [<ffffffff81015d95>] do_softirq+0x65/0xa0
 [<ffffffff81077ec5>] irq_exit+0x115/0x120
 [<ffffffff81616355>] smp_apic_timer_interrupt+0x45/0x60
 [<ffffffff81614a1d>] apic_timer_interrupt+0x6d/0x80
 <EOI>

 [<ffffffff814a9c32>] ? cpuidle_enter_state+0x52/0xc0
 [<ffffffff814a9c28>] ? cpuidle_enter_state+0x48/0xc0
 [<ffffffff814a9d65>] cpuidle_idle_call+0xc5/0x200
 [<ffffffff8101d14e>] arch_cpu_idle+0xe/0x30
 [<ffffffff810c67c1>] cpu_startup_entry+0xf1/0x290
 [<ffffffff8104228a>] start_secondary+0x1ba/0x230
Code: 42 0f 00 45 89 e6 48 01 c2 43 8d 44 6d 00 39 d0 73 26 49 c1 e5 08 89 d2 4d 63 f4 49 63 c5 48 c1 e2 08 48 c1 e0 08 48 63 ca 48 99 <48> f7 f9 48 98 4c 0f af f0 49 c1 ee 08 8b 43 78 c1 e0 08 44 29
RIP  [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0
 RSP <ffff883fff4e3db8>

The kernel values for cpudata for CPU 113 were:

struct cpudata {
  cpu = 113,
  timer = {
    entry = {
      next = 0x0,
      prev = 0xdead000000200200
    },
    expires = 8357799745,
    base = 0xffff883fe84ec001,
    function = 0xffffffff814a9100 <intel_pstate_timer_func>,
    data = 18446612406765768960,
<snip>
    i_gain = 0,
    d_gain = 0,
    deadband = 0,
    last_err = 22489
  },
  last_sample_time = {
    tv64 = 4063132438017305
  },
  prev_aperf = 287326796397463,
  prev_mperf = 251427432090198,
  sample = {
    core_pct_busy = 23081,
    aperf = 2937407,
    mperf = 3257884,
    freq = 2524484,
    time = {
      tv64 = 4063149215234118
    }
  }
}

which results in the time between samples = last_sample_time - sample.time
= 4063149215234118 - 4063132438017305 = 16777216813 which is 16.777 seconds.

The duration between reads of the APERF and MPERF registers overflowed a s32
sized integer in intel_pstate_get_scaled_busy()'s call to div_fp().  The result
is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0.

While the kernel shouldn't be delaying for a long time, it can and does
happen and the intel_pstate driver should not panic in this situation.  This
patch changes the div_fp() function to use div64_s64() to allow for "long"
division.  This will avoid the overflow condition on long delays.

[v2]: use div64_s64() in div_fp()
Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7180dddf

10 6月, 2015 1 次提交

intel_pstate: Force setting target pstate when required · 6c1e4591

由 Doug Smythies 提交于 6月 01, 2015

During initialization and exit it is possible that the target pstate
might not actually be set. Furthermore, the result can be that the
driver and the processor are out of synch and, under some conditions,
the driver might never send the processor the proper target pstate.

This patch adds a bypass or do_checks flag to the call to
intel_pstate_set_pstate. If bypass, then specifically bypass clamp
checks and the do not send if it is the same as last time check. If
do_checks, then, and as before, do the current policy clamp checks,
and do not do actual send if the new target is the same as the old.
Signed-off-by: NDoug Smythies <dsmythies@telus.net>
Reported-by: NMarien Zwart <marien.zwart@gmail.com>
Reported-by: NAlex Lochmann <alexander.lochmann@tu-dortmund.de>
Reported-by: NPiotr Ko?aczkowski <pkolaczk@gmail.com>
Reported-by: NClemens Eisserer <linuxhippy@gmail.com>
Tested-by: NMarien Zwart <marien.zwart@gmail.com>
Tested-by: NDoug Smythies <dsmythies@telus.net>
[ rjw: Dropped pointless symbol definitions, rebased ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

6c1e4591

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功