1. 10 8月, 2017 2 次提交
  2. 04 8月, 2017 1 次提交
    • S
      cpufreq: intel_pstate: Improve IO performance with per-core P-states · 7bde2d50
      Srinivas Pandruvada 提交于
      In the current implementation, the response latency between seeing
      SCHED_CPUFREQ_IOWAIT set and the actual P-state adjustment can be up
      to 10ms.  It can be reduced by bumping up the P-state to the max at
      the time SCHED_CPUFREQ_IOWAIT is passed to intel_pstate_update_util().
      With this change, the IO performance improves significantly.
      
      For a simple "grep -r . linux" (Here linux is the kernel source
      folder) with caches dropped every time on a Broadwell Xeon workstation
      with per-core P-states, the user and system time is shorter by as much
      as 30% - 40%.
      
      The same performance difference was not observed on clients that don't
      support per-core P-state.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      [ rjw: Changelog ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7bde2d50
  3. 01 8月, 2017 1 次提交
  4. 28 7月, 2017 1 次提交
    • R
      cpufreq: intel_pstate: Drop ->get from intel_pstate structure · 22baebd4
      Rafael J. Wysocki 提交于
      The ->get callback in the intel_pstate structure was mostly there
      for the scaling_cur_freq sysfs attribute to work, but after commit
      f8475cef (x86: use common aperfmperf_khz_on_cpu() to calculate
      KHz using APERF/MPERF) that attribute uses arch_freq_get_on_cpu()
      provided by the x86 arch code on all processors supported by
      intel_pstate, so it doesn't need the ->get callback from the
      driver any more.
      
      Moreover, the very presence of the ->get callback in the intel_pstate
      structure causes the cpuinfo_cur_freq attribute to be present when
      intel_pstate operates in the active mode, which is bogus, because
      the role of that attribute is to return the current CPU frequency
      as seen by the hardware.  For intel_pstate, though, this is just an
      average frequency and not really current, but computed for the
      previous sampling interval (the actual current frequency may be
      way different at the point this value is obtained by reading from
      cpuinfo_cur_freq), and after commit 82b4e03e (intel_pstate: skip
      scheduler hook when in "performance" mode) the value in
      cpuinfo_cur_freq may be stale or just 0, depending on the driver's
      operation mode.  In fact, however, on the hardware supported by
      intel_pstate there is no way to read the current CPU frequency
      from it, so the cpuinfo_cur_freq attribute should not be present
      at all when this driver is in use.
      
      For this reason, drop intel_pstate_get() and clear the ->get
      callback pointer pointing to it, so that the cpuinfo_cur_freq is
      not present for intel_pstate in the active mode any more.
      
      Fixes: 82b4e03e (intel_pstate: skip scheduler hook when in "performance" mode)
      Reported-by: NHuaisheng Ye <yehs1@lenovo.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      22baebd4
  5. 27 7月, 2017 2 次提交
    • R
      cpufreq: intel_pstate: Drop ->update_util from pstate_funcs · c4f3f70c
      Rafael J. Wysocki 提交于
      All systems use the same P-state selection "powersave" algorithm
      in the active mode if HWP is not used, so there's no need to provide
      a pointer for it in struct pstate_funcs any more.
      
      Drop ->update_util from struct pstate_funcs and make
      intel_pstate_set_update_util_hook() use intel_pstate_update_util()
      directly.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c4f3f70c
    • R
      cpufreq: intel_pstate: Do not use PID-based P-state selection · 9d0ef7af
      Rafael J. Wysocki 提交于
      All systems with a defined ACPI preferred profile that are not
      "servers" have been using the load-based P-state selection algorithm
      in intel_pstate since 4.12-rc1 (mobile systems and laptops have been
      using it since 4.10-rc1) and no problems with it have been reported
      to date.  In particular, no regressions with respect to the PID-based
      P-state selection have been reported.  Also testing indicates that
      the P-state selection algorithm based on CPU load is generally on par
      with the PID-based algorithm performance-wise, and for some workloads
      it turns out to be better than the other one, while being more
      straightforward and easier to understand at the same time.
      
      Moreover, the PID-based P-state selection algorithm in intel_pstate
      is known to be unstable in some situation and generally problematic,
      the issues with it are hard to address and it has become a
      significant maintenance burden.
      
      For these reasons, make intel_pstate use the "powersave" P-state
      selection algorithm based on CPU load in the active mode on all
      systems and drop the PID-based P-state selection code along with
      all things related to it from the driver.  Also update the
      documentation accordingly.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9d0ef7af
  6. 14 7月, 2017 1 次提交
    • S
      cpufreq: intel_pstate: Correct the busy calculation for KNL · 6e34e1f2
      Srinivas Pandruvada 提交于
      The busy percent calculated for the Knights Landing (KNL) platform
      is 1024 times smaller than the correct busy value.  This causes
      performance to get stuck at the lowest ratio.
      
      The scaling algorithm used for KNL is performance-based, but it still
      looks at the CPU load to set the scaled busy factor to 0 when the
      load is less than 1 percent.  In this case, since the computed load
      is 1024x smaller than it should be, the scaled busy factor will
      always be 0, irrespective of CPU business.
      
      This needs a fix similar to the turbostat one in commit b2b34dfe
      (tools/power turbostat: KNL workaround for %Busy and Avg_MHz).
      
      For this reason, add one more callback to processor-specific
      callbacks to specify an MPERF multiplier represented by a number of
      bit positions to shift the value of that register to the left to
      copmensate for its rate difference with respect to the TSC.  This
      shift value is used during CPU busy calculations.
      
      Fixes: ffb81056 (intel_pstate: Avoid getting stuck in high P-states when idle)
      Reported-and-tested-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
      [ rjw: Changelog ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      6e34e1f2
  7. 12 7月, 2017 1 次提交
    • S
      cpufreq: intel_pstate: Fix ratio setting for min_perf_pct · d4436c0d
      Srinivas Pandruvada 提交于
      When the minimum performance limit percentage is set to the power-up
      default, it is possible that minimum performance ratio is off by one.
      
      In the set_policy() callback the minimum ratio is calculated by
      applying global.min_perf_pct to turbo_ratio and rounding up, but the
      power-up default global.min_perf_pct is already rounded up to the
      next percent in min_perf_pct_min().  That results in two round up
      operations, so for the default min_perf_pct one of them is not
      required.
      
      It is better to remove rounding up in min_perf_pct_min() as this
      matches the displayed min_perf_pct prior to commit c5a2ee7d
      (cpufreq: intel_pstate: Active mode P-state limits rework) in 4.12.
      
      For example on a platform with max turbo ratio of 37 and minimum
      ratio of 10, the min_perf_pct resulted in 28 with the above commit.
      Before this commit it was 27 and it will be the same after this
      change.
      
      Fixes: 1a4fe38a (cpufreq: intel_pstate: Remove max/min fractions to limit performance)
      Reported-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      d4436c0d
  8. 05 7月, 2017 1 次提交
  9. 30 6月, 2017 1 次提交
  10. 27 6月, 2017 2 次提交
  11. 24 6月, 2017 1 次提交
    • S
      cpufreq: intel_pstate: Remove max/min fractions to limit performance · 1a4fe38a
      Srinivas Pandruvada 提交于
      In the current model the max/min perf limits are a fraction of current
      user space limits to the allowed max_freq or 100% for global limits.
      This results in wrong ratio limits calculation because of rounding
      issues for some user space limits.
      
      Initially we tried to solve this issue by issue by having more shift
      bits to increase precision. Still there are isolated cases where we still
      have error.
      
      This can be avoided by using ratios all together. Since the way we get
      cpuinfo.max_freq is by multiplying scaling factor to max ratio, we can
      easily keep the max/min ratios in terms of ratios and not fractions.
      
      For example:
      if the max ratio = 36
      cpuinfo.max_freq = 36 * 100000 = 3600000
      
      Suppose user space sets a limit of 1200000, then we can calculate
      max ratio limit as
      = 36 * 1200000 / 3600000
      = 12
      This will be correct for any user limits.
      
      The other advantage is that, we don't need to do any calculation in the
      fast path as ratio limit is already calculated via set_policy() callback.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1a4fe38a
  12. 05 6月, 2017 1 次提交
  13. 12 5月, 2017 1 次提交
    • L
      intel_pstate: use updated msr-index.h HWP.EPP values · 3cedbc5a
      Len Brown 提交于
      intel_pstate exports sysfs attributes for setting and observing HWP.EPP.
      These attributes use strings to describe 4 operating states, and
      inside the driver, these strings are mapped to numerical register
      values.
      
      The authorative mapping between the strings and numerical HWP.EPP values
      are now globally defined in msr-index.h, replacing the out-dated
      mapping that were open-coded into intel_pstate.c
      
      new old string
      --- --- ------
        0   0 performance
      128  64 balance_performance
      192 128 balance_power
      255 192 power
      
      Note that the HW and BIOS default value on most system is 128,
      which intel_pstate will now call "balance_performance"
      while it used to call it "balance_power".
      Signed-off-by: NLen Brown <len.brown@intel.com>
      3cedbc5a
  14. 18 4月, 2017 1 次提交
  15. 30 3月, 2017 1 次提交
  16. 29 3月, 2017 16 次提交
  17. 24 3月, 2017 4 次提交
    • R
      cpufreq: intel_pstate: Avoid transient updates of cpuinfo.max_freq · 80b120ca
      Rafael J. Wysocki 提交于
      Both intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
      set policy->cpuinfo.max_freq depending on the turbo status, but the
      updates made by them are discarded by the core, because the policy
      object passed to them by the core is temporary and cpuinfo.max_freq
      from that object is not copied to the final policy object in
      cpufreq_set_policy().
      
      However, cpufreq_set_policy() passes the temporary policy object
      to the ->setpolicy callback of the driver, so intel_pstate_set_policy()
      actually sees the policy->cpuinfo.max_freq value updated by
      intel_pstate_verify_policy() and not the final one.  It also
      updates policy->max sometimes which basically has no effect after
      it returns, because the core discards that update.
      
      To avoid confusion, eliminate policy->cpuinfo.max_freq updates from
      intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
      entirely and check the maximum frequency explicitly in
      intel_pstate_update_perf_limits() instead of relying on the
      transiently updated policy->cpuinfo.max_freq value.
      
      Moreover, move the max->policy adjustment carried out in
      intel_pstate_set_policy() to a separate function and call that
      function from the ->verify driver callbacks to ensure that it will
      actually be effective.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      80b120ca
    • R
      cpufreq: intel_pstate: Active mode P-state limits rework · c5a2ee7d
      Rafael J. Wysocki 提交于
      The coordination of P-state limits used by intel_pstate in the active
      mode (ie. by default) is problematic, because it synchronizes all of
      the limits (ie. the global ones and the per-policy ones) so as to use
      one common pair of P-state limits (min and max) across all CPUs in
      the system.  The drawbacks of that are as follows:
      
       - If P-states are coordinated in hardware, it is not necessary
         to coordinate them in software on top of that, so in that case
         all of the above activity is in vain.
      
       - If P-states are not coordinated in hardware, then the processor
         is actually capable of setting different P-states for different
         CPUs and coordinating them at the software level simply doesn't
         allow that capability to be utilized.
      
       - The coordination works in such a way that setting a per-policy
         limit (eg. scaling_max_freq) for one CPU causes the common
         effective limit to change (and it will affect all of the other
         CPUs too), but subsequent reads from the corresponding sysfs
         attributes for the other CPUs will return stale values (which
         is confusing).
      
       - Reads from the global P-state limit attributes, min_perf_pct and
         max_perf_pct, return the effective common values and not the last
         values set through these attributes.  However, the last values
         set through these attributes become hard limits that cannot be
         exceeded by writes to scaling_min_freq and scaling_max_freq,
         respectively, and they are not exposed, so essentially users
         have to remember what they are.
      
      All of that is painful enough to warrant a change of the management
      of P-state limits in the active mode.
      
      To that end, redesign the active mode P-state limits management in
      intel_pstate in accordance with the following rules:
      
       (1) All CPUs are affected by the global limits (that is, none of
           them can be requested to run faster than the global max and
           none of them can be requested to run slower than the global
           min).
      
       (2) Each individual CPU is affected by its own per-policy limits
           (that is, it cannot be requested to run faster than its own
           per-policy max and it cannot be requested to run slower than
           its own per-policy min).
      
       (3) The global and per-policy limits can be set independently.
      
      Also, the global maximum and minimum P-state limits will be always
      expressed as percentages of the maximum supported turbo P-state.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c5a2ee7d
    • R
      cpufreq: intel_pstate: Use load-based P-state selection more widely · 55395345
      Rafael J. Wysocki 提交于
      Extend the set of systems for which intel_pstate will use the
      "powersave" P-state selection algorithm based on CPU load in the
      active mode by systems with ACPI preferred profile set to "tablet",
      "appliance PC", "desktop", or "workstation" (ie. everything with a
      specified preferred profile that is not a "server").
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55395345
    • R
      cpufreq: intel_pstate: Support HWP processors in all operation modes · eb5139d1
      Rafael J. Wysocki 提交于
      Currently, some processors supporting HWP are only supported by
      intel_pstate if HWP is actually going to be used and not supported
      otherwise which is confusing.
      
      Specifically, they are not supported if "intel_pstate=no_hwp" is
      passed to the kernel in the command line or if the driver is started
      in the passive mode ("intel_pstate=passive").
      
      There is no real reason for that, because everything about those
      processor is known anyway and the driver can work with them in all
      modes, so make that happen, but use the load-based P-state selection
      algorithm for the active mode "powersave" policy with them.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      eb5139d1
  18. 22 3月, 2017 1 次提交
    • R
      cpufreq: intel_pstate: Fix policy data management in passive mode · 64897b20
      Rafael J. Wysocki 提交于
      The policy->cpuinfo.max_freq and policy->max updates in
      intel_cpufreq_turbo_update() are excessive as they are done for no
      good reason and may lead to problems in principle, so they should be
      dropped.  However, after dropping them intel_cpufreq_turbo_update()
      becomes almost entirely pointless, because the check made by it is
      made again down the road in intel_pstate_prepare_request().  The
      only thing in it that still needs to be done is the call to
      update_turbo_state(), so drop intel_cpufreq_turbo_update() altogether
      and make its callers invoke update_turbo_state() directly instead of
      it.
      
      In addition to that, fix intel_cpufreq_verify_policy() so that it
      checks global.no_turbo in addition to global.turbo_disabled when
      updating policy->cpuinfo.max_freq to make it consistent with
      intel_pstate_verify_policy().
      
      Fixes: 001c76f0 (cpufreq: intel_pstate: Generic governors support)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      64897b20
  19. 18 3月, 2017 1 次提交
    • R
      cpufreq: intel_pstate: One set of global limits in active mode · 7de32556
      Rafael J. Wysocki 提交于
      In the active mode intel_pstate currently uses two sets of global
      limits, each associated with one of the possible scaling_governor
      settings in that mode: "powersave" or "performance".
      
      The driver switches over from one of those sets to the other
      depending on the scaling_governor setting for the last CPU whose
      per-policy cpufreq interface in sysfs was last used to change
      parameters exposed in there.  That obviously leads to no end of
      issues when the scaling_governor settings differ between CPUs.
      
      The most recent issue was introduced by commit a240c4aa (cpufreq:
      intel_pstate: Do not reinit performance limits in ->setpolicy)
      that eliminated the reinitialization of "performance" limits in
      intel_pstate_set_policy() preventing the max limit from being set
      to anything below 100, among other things.
      
      Namely, an undesirable side effect of commit a240c4aa is that
      now, after setting scaling_governor to "performance" in the active
      mode, the per-policy limits for the CPU in question go to the highest
      level and stay there even when it is switched back to "powersave"
      later.
      
      As it turns out, some distributions set scaling_governor to
      "performance" temporarily for all CPUs to speed-up system
      initialization, so that change causes them to misbehave later.
      
      To fix that, get rid of the performance/powersave global limits
      split and use just one set of global limits for everything.
      
      From the user's persepctive, after this modification, when
      scaling_governor is switched from "performance" to "powersave"
      or the other way around on one CPU, the limits settings (ie. the
      global max/min_perf_pct and per-policy scaling_max/min_freq for
      any CPUs) will not change.  Still, switching from "performance"
      to "powersave" or the other way around changes the way in which
      P-states are selected and in particular "performance" causes the
      driver to always request the highest P-state it is allowed to ask
      for for the given CPU.
      
      Fixes: a240c4aa (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7de32556