1. 28 4月, 2016 3 次提交
    • S
      cpufreq: intel_pstate: Enable PPC enforcement for servers · 2b3ec765
      Srinivas Pandruvada 提交于
      For platforms which are controlled via remove node manager, enable _PPC by
      default. These platforms are mostly categorized as enterprise server or
      performance servers. These platforms needs to go through some
      certifications tests, which tests control via _PPC.
      The relative risk of enabling by default is  low as this is is less likely
      that these systems have broken _PSS table.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2b3ec765
    • S
      cpufreq: intel_pstate: Adjust policy->max · 3be9200d
      Srinivas Pandruvada 提交于
      When policy->max is changed via _PPC or sysfs and is more than the max non
      turbo frequency, it does not really change resulting performance in some
      processors. When policy->max results in a P-State ratio more than the
      turbo activation ratio, then processor can choose any P-State up to max
      turbo. So the user or _PPC setting has no value, but this can cause
      undesirable side effects like:
      - Showing reduced max percentage in Intel P-State sysfs
      - It can cause reduced max performance under certain boundary conditions:
      The requested max scaling frequency either via _PPC or via cpufreq-sysfs,
      will be converted into a fixed floating point max percent scale. In
      majority of the cases this will result in correct max. But not 100% of the
      time. If the _PPC is requested at a point where the calculation lead to a
      lower max, this can result in a lower P-State then expected and it will
      impact performance.
      Example of this condition using a Broadwell laptop with config TDP.
      
      ACPI _PSS table from a Broadwell laptop
      2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000
      1300000 1100000 1000000 900000 800000 600000 500000
      
      The actual results by disabling config TDP so that we can get what is
      requested on or below 2300000Khz.
      
      scaling_max_freq        Max Requested P-State   Resultant scaling
      max
      ---------------------------------------- ----------------------
      2400000                 18                      2900000 (max
      turbo)
      2300000                 17                      2300000 (max
      physical non turbo)
      2200000                 15                      2100000
      2100000                 15                      2100000
      2000000                 13                      1900000
      1900000                 13                      1900000
      1800000                 12                      1800000
      1700000                 11                      1700000
      1600000                 10                      1600000
      1500000                 f                       1500000
      1400000                 e                       1400000
      1300000                 d                       1300000
      1200000                 c                       1200000
      1100000                 a                       1000000
      1000000                 a                       1000000
      900000                  9                        900000
      800000                  8                        800000
      700000                  7                        700000
      600000                  6                        600000
      500000                  5                        500000
      ------------------------------------------------------------------
      
      Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz)
      in BIOS (not every system will let you adjust this).
      The turbo activation ratio will be set to one less than that, which will
      be 0x0a (So any request above 1000000KHz should result in turbo region
      assuming no thermal limits).
      Here _PPC will request max to 1100000KHz (which basically should still
      result in turbo as this is more than the turbo activation ratio up to
      max allowable turbo frequency), but actual calculation resulted in a max
      ceiling P-State which is 0x0a. So under any load condition, this driver
      will not request turbo P-States. This will be a huge performance hit.
      
      When config TDP feature is ON, if the _PPC points to a frequency above
      turbo activation ratio, the performance can still reach max turbo. In this
      case we don't need to treat this as the reduced frequency in set_policy
      callback.
      
      In this change when config TDP is active (by checking if the physical max
      non turbo ratio is more than the current max non turbo ratio), any request
      above current max non turbo is treated as full performance.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      [ rjw : Minor cleanups ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      3be9200d
    • S
      cpufreq: intel_pstate: Enforce _PPC limits · 9522a2ff
      Srinivas Pandruvada 提交于
      Use ACPI _PPC notification to limit max P state driver will request.
      ACPI _PPC change notification is sent by BIOS to limit max P state
      in several cases:
      - Reduce impact of platform thermal condition
      - When Config TDP feature is used, a changed _PPC is sent to
      follow TDP change
      - Remote node managers in server want to control platform power
      via baseboard management controller (BMC)
      
      This change registers with ACPI processor performance lib so that
      _PPC changes are notified to cpufreq core, which in turns will
      result in call to .setpolicy() callback. Also the way _PSS
      table identifies a turbo frequency is not compatible to max turbo
      frequency in intel_pstate, so the very first entry in _PSS needs
      to be adjusted.
      
      This feature can be turned on by using kernel parameters:
      intel_pstate=support_acpi_ppc
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      [ rjw: Minor cleanups ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9522a2ff
  2. 25 4月, 2016 1 次提交
    • P
      cpufreq: intel_pstate: Use average P-State instead of current P-State · bdcaa23f
      Philippe Longepe 提交于
      The result returned by pid_calc() is subtracted from current_pstate
      (which is the P-State requested during the last period) in order to
      obtain the target P-State for the current iteration.
      
      However, current_pstate may not reflect the real current P-State of
      the CPU. In particular, that P-State may be higher because of the
      frequency sharing per module.
      
      The theory is:
       - The load is the percentage of time spent in C0 and is related to
         the average P-State during the same period.
       - The last requested P-State can be completely different than the
         average P-State (because of frequency sharing or throttling).
       - The P-State shift computed by the pid_calc is based on the load
         computed at average P-State, so the shift must be relative to
         this average P-State.
      
      Using the average P-State instead of current P-State improves power
      without significant performance penalty in cases when a task migrates
      from one core to other core sharing frequency and voltage.
      
      Performance and power comparison with this patch on Cherry Trail
      platform using Android:
      
      Benchmark               ?Perf    ?Power
      FishTank                10.45%    3.1%
      SmartBench-Gaming       -0.1%   -10.4%
      SmartBench-Productivity -0.8%   -10.4%
      CandyCrush                n/a   -17.4%
      AngryBirds                n/a    -5.9%
      videoPlayback             n/a   -13.9%
      audioPlayback             n/a    -4.9%
      IcyRocks-20-50           0.0%   -38.4%
      iozone RR               -0.16%  -1.3%
      iozone RW                0.74%  -1.3%
      Signed-off-by: NPhilippe Longepe <philippe.longepe@linux.intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bdcaa23f
  3. 10 4月, 2016 1 次提交
    • R
      intel_pstate: Avoid getting stuck in high P-states when idle · ffb81056
      Rafael J. Wysocki 提交于
      Jörg Otte reports that commit a4675fbc (cpufreq: intel_pstate:
      Replace timers with utilization update callbacks) caused the CPUs in
      his Haswell-based system to stay in the very high frequency region
      even if the system is completely idle.
      
      That turns out to be an existing problem in the intel_pstate driver's
      P-state selection algorithm for Core processors.  Namely, all
      decisions made by that algorithm are based on the average frequency
      of the CPU between sampling events and on the P-state requested on
      the last invocation, so it may get stuck at a very hight frequency
      even if the utilization of the CPU is very low (in fact, it may get
      stuck in a inadequate P-state regardless of the CPU utilization).
      The only way to kick it out of that limbo is a sufficiently long idle
      period (3 times longer than the prescribed sampling interval), but if
      that doesn't happen often enough (eg. due to a timing change like
      after the above commit), the P-state of the CPU may be inadequate
      pretty much all the time.
      
      To address the most egregious manifestations of that issue, reset the
      core_busy value used to determine the next P-state to request if the
      utilization of the CPU, determined with the help of the MPERF
      feedback register and the TSC, is below 1%.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=115771Reported-and-tested-by: NJörg Otte <jrg.otte@gmail.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ffb81056
  4. 09 4月, 2016 2 次提交
  5. 05 4月, 2016 2 次提交
  6. 02 4月, 2016 2 次提交
    • R
      cpufreq: sched: Helpers to add and remove update_util hooks · 0bed612b
      Rafael J. Wysocki 提交于
      Replace the single helper for adding and removing cpufreq utilization
      update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
      cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
      and modify the users of cpufreq_set_update_util_data() accordingly.
      
      With the new helpers, the code using them doesn't need to worry
      about the internals of struct update_util_data and in particular
      it doesn't need to worry about populating the func field in it
      properly upfront.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      0bed612b
    • R
      intel_pstate: Avoid extra invocation of intel_pstate_sample() · febce40f
      Rafael J. Wysocki 提交于
      The initialization of intel_pstate for a given CPU involves populating
      the fields of its struct cpudata that represent the previous sample,
      but currently that is done in a problematic way.
      
      Namely, intel_pstate_init_cpu() makes an extra call to
      intel_pstate_sample() so it reads the current register values that
      will be used to populate the "previous sample" record during the
      next invocation of intel_pstate_sample().  However, after commit
      a4675fbc (cpufreq: intel_pstate: Replace timers with utilization
      update callbacks) that doesn't work for last_sample_time, because
      the time value is passed to intel_pstate_sample() as an argument now.
      Passing 0 to it from intel_pstate_init_cpu() is problematic, because
      that causes cpu->last_sample_time == 0 to be visible in
      get_target_pstate_use_performance() (and hence the extra
      cpu->last_sample_time > 0 check in there) and effectively allows
      the first invocation of intel_pstate_sample() from
      intel_pstate_update_util() to happen immediately after the
      initialization which may lead to a significant "turn on"
      effect in the governor algorithm.
      
      To mitigate that issue, rework the initialization to avoid the
      extra intel_pstate_sample() call from intel_pstate_init_cpu().
      Instead, make intel_pstate_sample() return false if it has been
      called with cpu->sample.time equal to zero, which will make
      intel_pstate_update_util() skip the sample in that case, and
      reset cpu->sample.time from intel_pstate_set_update_util_hook()
      to make the algorithm start properly every time the hook is set.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      febce40f
  7. 31 3月, 2016 1 次提交
    • R
      intel_pstate: Do not set utilization update hook too early · bb6ab52f
      Rafael J. Wysocki 提交于
      The utilization update hook in the intel_pstate driver is set too
      early, as it only should be set after the policy has been fully
      initialized by the core.  That may cause intel_pstate_update_util()
      to use incorrect data and put the CPUs into incorrect P-states as
      a result.
      
      To prevent that from happening, make intel_pstate_set_policy() set
      the utilization update hook instead of intel_pstate_init_cpu() so
      intel_pstate_update_util() only runs when all things have been
      initialized as appropriate.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bb6ab52f
  8. 20 3月, 2016 1 次提交
    • R
      intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts · fdfdb2b1
      Rafael J. Wysocki 提交于
      After commit a4675fbc (cpufreq: intel_pstate: Replace timers with
      utilization update callbacks) wrmsrl_on_cpu() cannot be called in the
      intel_pstate_adjust_busy_pstate() path as that is executed with
      disabled interrupts.  However, atom_set_pstate() called from there
      via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the
      IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in
      smp_call_function_single().
      
      The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is
      because intel_pstate_set_pstate() calling it is also invoked during
      the initialization and cleanup of the driver and in those cases it is
      not guaranteed to be run on the CPU that is being updated.  However,
      in the case when intel_pstate_set_pstate() is called by
      intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update
      the register safely.  Moreover, intel_pstate_set_pstate() already
      contains code that only is executed if the function is called by
      intel_pstate_adjust_busy_pstate() and there is a special argument
      passed to it because of that.
      
      To fix the problem at hand, rearrange the code taking the above
      observations into account.
      
      First, replace the ->set() callback in struct pstate_funcs with a
      ->get_val() one that will return the value to be written to the
      IA32_PERF_CTL MSR without updating the register.
      
      Second, split intel_pstate_set_pstate() into two functions,
      intel_pstate_update_pstate() to be called by
      intel_pstate_adjust_busy_pstate() that will contain all of the
      intel_pstate_set_pstate() code which only needs to be executed in
      that case and will use wrmsrl() to update the MSR (after obtaining
      the value to write to it from the ->get_val() callback), and
      intel_pstate_set_min_pstate() to be invoked during the
      initialization and cleanup that will set the P-state to the
      minimum one and will update the MSR using wrmsrl_on_cpu().
      
      Finally, move the code shared between intel_pstate_update_pstate()
      and intel_pstate_set_min_pstate() to a new static inline function
      intel_pstate_record_pstate() and make them both call it.
      
      Of course, that unifies the handling of the IA32_PERF_CTL MSR writes
      between Atom and Core.
      
      Fixes: a4675fbc (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
      Reported-and-tested-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fdfdb2b1
  9. 11 3月, 2016 5 次提交
  10. 09 3月, 2016 2 次提交
    • R
      cpufreq: Reduce cpufreq_update_util() overhead a bit · 08f511fd
      Rafael J. Wysocki 提交于
      Use the observation that cpufreq_update_util() is only called
      by the scheduler with rq->lock held, so the callers of
      cpufreq_set_update_util_data() can use synchronize_sched()
      instead of synchronize_rcu() to wait for cpufreq_update_util()
      to complete.  Moreover, if they are updated to do that,
      rcu_read_(un)lock() calls in cpufreq_update_util() might be
      replaced with rcu_read_(un)lock_sched(), respectively, but
      those aren't really necessary, because the scheduler calls
      that function from RCU-sched read-side critical sections
      already.
      
      In addition to that, if cpufreq_set_update_util_data() checks
      the func field in the struct update_util_data before setting
      the per-CPU pointer to it, the data->func check may be dropped
      from cpufreq_update_util() as well.
      
      Make the above changes to reduce the overhead from
      cpufreq_update_util() in the scheduler paths invoking it
      and to make the cleanup after removing its callbacks less
      heavy-weight somewhat.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      08f511fd
    • R
      cpufreq: intel_pstate: Replace timers with utilization update callbacks · a4675fbc
      Rafael J. Wysocki 提交于
      Instead of using a per-CPU deferrable timer for utilization sampling
      and P-states adjustments, register a utilization update callback that
      will be invoked from the scheduler on utilization changes.
      
      The sampling rate is still the same as what was used for the deferrable
      timers, so the functional impact of this patch should not be significant.
      
      Based on an earlier patch from Srinivas Pandruvada.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      a4675fbc
  11. 27 2月, 2016 2 次提交
  12. 23 2月, 2016 1 次提交
    • V
      intel_pstate: Update frequencies of policy->cpus only from ->set_policy() · 41cfd64c
      Viresh Kumar 提交于
      The intel-pstate driver is using intel_pstate_hwp_set() from two
      separate paths, i.e. ->set_policy() callback and sysfs update path for
      the files present in /sys/devices/system/cpu/intel_pstate/ directory.
      
      While an update to the sysfs path applies to all the CPUs being managed
      by the driver (which essentially means all the online CPUs), the update
      via the ->set_policy() callback applies to a smaller group of CPUs
      managed by the policy for which ->set_policy() is called.
      
      And so, intel_pstate_hwp_set() should update frequencies of only the
      CPUs that are part of policy->cpus mask, while it is called from
      ->set_policy() callback.
      
      In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set()
      and apply the frequency changes only to the concerned CPUs.
      
      For ->set_policy() path, we are only concerned about policy->cpus, and
      so policy->rwsem lock taken by the core prior to calling ->set_policy()
      is enough to take care of any races. The larger lock acquired by
      get_online_cpus() is required only for the updates to sysfs files.
      
      Add another routine, intel_pstate_hwp_set_online_cpus(), and call it
      from the sysfs update paths.
      
      This also fixes a lockdep reported recently, where policy->rwsem and
      get_online_cpus() could have been acquired in any order causing an ABBA
      deadlock. The sequence of events leading to that was:
      
      intel_pstate_init(...)
      	...cpufreq_online(...)
      		down_write(&policy->rwsem); // Locks policy->rwsem
      		...
      		cpufreq_init_policy(policy);
      			...intel_pstate_hwp_set();
      				get_online_cpus(); // Temporarily locks cpu_hotplug.lock
      		...
      		up_write(&policy->rwsem);
      
      pm_suspend(...)
      	...disable_nonboot_cpus()
      		_cpu_down()
      			cpu_hotplug_begin(); // Locks cpu_hotplug.lock
      			__cpu_notify(CPU_DOWN_PREPARE, ...);
      				...cpufreq_offline_prepare();
      					down_write(&policy->rwsem); // Locks policy->rwsem
      Reported-and-tested-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      41cfd64c
  13. 30 1月, 2016 1 次提交
    • B
      x86/cpufeature: Replace the old static_cpu_has() with safe variant · bc696ca0
      Borislav Petkov 提交于
      So the old one didn't work properly before alternatives had run.
      And it was supposed to provide an optimized JMP because the
      assumption was that the offset it is jumping to is within a
      signed byte and thus a two-byte JMP.
      
      So I did an x86_64 allyesconfig build and dumped all possible
      sites where static_cpu_has() was used. The optimization amounted
      to all in all 12(!) places where static_cpu_has() had generated
      a 2-byte JMP. Which has saved us a whopping 36 bytes!
      
      This clearly is not worth the trouble so we can remove it. The
      only place where the optimization might count - in __switch_to()
      - we will handle differently. But that's not subject of this
      patch.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bc696ca0
  14. 12 12月, 2015 1 次提交
  15. 10 12月, 2015 3 次提交
  16. 26 11月, 2015 1 次提交
  17. 24 11月, 2015 2 次提交
  18. 19 11月, 2015 4 次提交
  19. 02 11月, 2015 1 次提交
  20. 17 10月, 2015 1 次提交
    • P
      cpufreq: intel_pstate: Fix intel_pstate powersave min_perf_pct value · 51443fbf
      Prarit Bhargava 提交于
      On systems that initialize the intel_pstate driver with the performance
      governor, and then switch to the powersave governor will not transition to
      lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
      is set to a low value.
      
      The behavior of governor switching changed after commit a0475992
      ("[cpufreq] intel_pstate: honor user space min_perf_pct override on
       resume").  The commit introduced tracking of performance percentage
      changes via sysfs in order to restore userspace changes during
      suspend/resume.  The problem occurs because the global values of the newly
      introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
      change and this causes the powersave governor to inherit the performance
      governor's settings.
      
      A simple change would have been to reset max_sysfs_pct to 100 and
      min_sysfs_pct to 0 on a governor change, which fixes the problem with
      governor switching.  However, since we cannot break userspace[1] the fix
      is now to give each governor its own limits storage area so that governor
      specific changes are tracked.
      
      I successfully tested this by booting with both the performance governor
      and the powersave governor by default, and switching between the two
      governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
      and looking at the output of cpupower frequency-info).  Suspend/Resume
      testing was performed by Doug Smythies.
      
      [1] Systems which suspend/resume using the unmaintained pm-utils package
      will always transition to the performance governor before the suspend and
      after the resume.  This means a system using the powersave governor will
      go from powersave to performance, then suspend/resume, performance to
      powersave.  The simple change during governor changes would have been
      overwritten when the governor changed before and after the suspend/resume.
      I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
      against Fedora to remove the 94cpufreq file that causes the problem.  It
      should be noted that pm-utils is obsoleted with newer versions of systemd.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      51443fbf
  21. 16 10月, 2015 1 次提交
  22. 15 10月, 2015 2 次提交