1. 20 3月, 2016 1 次提交
    • R
      intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts · fdfdb2b1
      Rafael J. Wysocki 提交于
      After commit a4675fbc (cpufreq: intel_pstate: Replace timers with
      utilization update callbacks) wrmsrl_on_cpu() cannot be called in the
      intel_pstate_adjust_busy_pstate() path as that is executed with
      disabled interrupts.  However, atom_set_pstate() called from there
      via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the
      IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in
      smp_call_function_single().
      
      The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is
      because intel_pstate_set_pstate() calling it is also invoked during
      the initialization and cleanup of the driver and in those cases it is
      not guaranteed to be run on the CPU that is being updated.  However,
      in the case when intel_pstate_set_pstate() is called by
      intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update
      the register safely.  Moreover, intel_pstate_set_pstate() already
      contains code that only is executed if the function is called by
      intel_pstate_adjust_busy_pstate() and there is a special argument
      passed to it because of that.
      
      To fix the problem at hand, rearrange the code taking the above
      observations into account.
      
      First, replace the ->set() callback in struct pstate_funcs with a
      ->get_val() one that will return the value to be written to the
      IA32_PERF_CTL MSR without updating the register.
      
      Second, split intel_pstate_set_pstate() into two functions,
      intel_pstate_update_pstate() to be called by
      intel_pstate_adjust_busy_pstate() that will contain all of the
      intel_pstate_set_pstate() code which only needs to be executed in
      that case and will use wrmsrl() to update the MSR (after obtaining
      the value to write to it from the ->get_val() callback), and
      intel_pstate_set_min_pstate() to be invoked during the
      initialization and cleanup that will set the P-state to the
      minimum one and will update the MSR using wrmsrl_on_cpu().
      
      Finally, move the code shared between intel_pstate_update_pstate()
      and intel_pstate_set_min_pstate() to a new static inline function
      intel_pstate_record_pstate() and make them both call it.
      
      Of course, that unifies the handling of the IA32_PERF_CTL MSR writes
      between Atom and Core.
      
      Fixes: a4675fbc (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
      Reported-and-tested-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fdfdb2b1
  2. 11 3月, 2016 5 次提交
  3. 09 3月, 2016 2 次提交
    • R
      cpufreq: Reduce cpufreq_update_util() overhead a bit · 08f511fd
      Rafael J. Wysocki 提交于
      Use the observation that cpufreq_update_util() is only called
      by the scheduler with rq->lock held, so the callers of
      cpufreq_set_update_util_data() can use synchronize_sched()
      instead of synchronize_rcu() to wait for cpufreq_update_util()
      to complete.  Moreover, if they are updated to do that,
      rcu_read_(un)lock() calls in cpufreq_update_util() might be
      replaced with rcu_read_(un)lock_sched(), respectively, but
      those aren't really necessary, because the scheduler calls
      that function from RCU-sched read-side critical sections
      already.
      
      In addition to that, if cpufreq_set_update_util_data() checks
      the func field in the struct update_util_data before setting
      the per-CPU pointer to it, the data->func check may be dropped
      from cpufreq_update_util() as well.
      
      Make the above changes to reduce the overhead from
      cpufreq_update_util() in the scheduler paths invoking it
      and to make the cleanup after removing its callbacks less
      heavy-weight somewhat.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      08f511fd
    • R
      cpufreq: intel_pstate: Replace timers with utilization update callbacks · a4675fbc
      Rafael J. Wysocki 提交于
      Instead of using a per-CPU deferrable timer for utilization sampling
      and P-states adjustments, register a utilization update callback that
      will be invoked from the scheduler on utilization changes.
      
      The sampling rate is still the same as what was used for the deferrable
      timers, so the functional impact of this patch should not be significant.
      
      Based on an earlier patch from Srinivas Pandruvada.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      a4675fbc
  4. 27 2月, 2016 2 次提交
  5. 23 2月, 2016 1 次提交
    • V
      intel_pstate: Update frequencies of policy->cpus only from ->set_policy() · 41cfd64c
      Viresh Kumar 提交于
      The intel-pstate driver is using intel_pstate_hwp_set() from two
      separate paths, i.e. ->set_policy() callback and sysfs update path for
      the files present in /sys/devices/system/cpu/intel_pstate/ directory.
      
      While an update to the sysfs path applies to all the CPUs being managed
      by the driver (which essentially means all the online CPUs), the update
      via the ->set_policy() callback applies to a smaller group of CPUs
      managed by the policy for which ->set_policy() is called.
      
      And so, intel_pstate_hwp_set() should update frequencies of only the
      CPUs that are part of policy->cpus mask, while it is called from
      ->set_policy() callback.
      
      In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set()
      and apply the frequency changes only to the concerned CPUs.
      
      For ->set_policy() path, we are only concerned about policy->cpus, and
      so policy->rwsem lock taken by the core prior to calling ->set_policy()
      is enough to take care of any races. The larger lock acquired by
      get_online_cpus() is required only for the updates to sysfs files.
      
      Add another routine, intel_pstate_hwp_set_online_cpus(), and call it
      from the sysfs update paths.
      
      This also fixes a lockdep reported recently, where policy->rwsem and
      get_online_cpus() could have been acquired in any order causing an ABBA
      deadlock. The sequence of events leading to that was:
      
      intel_pstate_init(...)
      	...cpufreq_online(...)
      		down_write(&policy->rwsem); // Locks policy->rwsem
      		...
      		cpufreq_init_policy(policy);
      			...intel_pstate_hwp_set();
      				get_online_cpus(); // Temporarily locks cpu_hotplug.lock
      		...
      		up_write(&policy->rwsem);
      
      pm_suspend(...)
      	...disable_nonboot_cpus()
      		_cpu_down()
      			cpu_hotplug_begin(); // Locks cpu_hotplug.lock
      			__cpu_notify(CPU_DOWN_PREPARE, ...);
      				...cpufreq_offline_prepare();
      					down_write(&policy->rwsem); // Locks policy->rwsem
      Reported-and-tested-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      41cfd64c
  6. 30 1月, 2016 1 次提交
    • B
      x86/cpufeature: Replace the old static_cpu_has() with safe variant · bc696ca0
      Borislav Petkov 提交于
      So the old one didn't work properly before alternatives had run.
      And it was supposed to provide an optimized JMP because the
      assumption was that the offset it is jumping to is within a
      signed byte and thus a two-byte JMP.
      
      So I did an x86_64 allyesconfig build and dumped all possible
      sites where static_cpu_has() was used. The optimization amounted
      to all in all 12(!) places where static_cpu_has() had generated
      a 2-byte JMP. Which has saved us a whopping 36 bytes!
      
      This clearly is not worth the trouble so we can remove it. The
      only place where the optimization might count - in __switch_to()
      - we will handle differently. But that's not subject of this
      patch.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bc696ca0
  7. 12 12月, 2015 1 次提交
  8. 10 12月, 2015 3 次提交
  9. 26 11月, 2015 1 次提交
  10. 24 11月, 2015 2 次提交
  11. 19 11月, 2015 4 次提交
  12. 02 11月, 2015 1 次提交
  13. 17 10月, 2015 1 次提交
    • P
      cpufreq: intel_pstate: Fix intel_pstate powersave min_perf_pct value · 51443fbf
      Prarit Bhargava 提交于
      On systems that initialize the intel_pstate driver with the performance
      governor, and then switch to the powersave governor will not transition to
      lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
      is set to a low value.
      
      The behavior of governor switching changed after commit a0475992
      ("[cpufreq] intel_pstate: honor user space min_perf_pct override on
       resume").  The commit introduced tracking of performance percentage
      changes via sysfs in order to restore userspace changes during
      suspend/resume.  The problem occurs because the global values of the newly
      introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
      change and this causes the powersave governor to inherit the performance
      governor's settings.
      
      A simple change would have been to reset max_sysfs_pct to 100 and
      min_sysfs_pct to 0 on a governor change, which fixes the problem with
      governor switching.  However, since we cannot break userspace[1] the fix
      is now to give each governor its own limits storage area so that governor
      specific changes are tracked.
      
      I successfully tested this by booting with both the performance governor
      and the powersave governor by default, and switching between the two
      governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
      and looking at the output of cpupower frequency-info).  Suspend/Resume
      testing was performed by Doug Smythies.
      
      [1] Systems which suspend/resume using the unmaintained pm-utils package
      will always transition to the performance governor before the suspend and
      after the resume.  This means a system using the powersave governor will
      go from powersave to performance, then suspend/resume, performance to
      powersave.  The simple change during governor changes would have been
      overwritten when the governor changed before and after the suspend/resume.
      I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
      against Fedora to remove the 94cpufreq file that causes the problem.  It
      should be noted that pm-utils is obsoleted with newer versions of systemd.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NKristen Carlson Accardi <kristen@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      51443fbf
  14. 16 10月, 2015 1 次提交
  15. 15 10月, 2015 4 次提交
  16. 10 9月, 2015 2 次提交
  17. 07 8月, 2015 2 次提交
  18. 01 8月, 2015 1 次提交
  19. 27 7月, 2015 1 次提交
  20. 17 7月, 2015 1 次提交
  21. 06 7月, 2015 1 次提交
  22. 17 6月, 2015 1 次提交
    • P
      intel_pstate: Fix overflow in busy_scaled due to long delay · 7180dddf
      Prarit Bhargava 提交于
      The kernel may delay interrupts for a long time which can result in timers
      being delayed. If this occurs the intel_pstate driver will crash with a
      divide by zero error:
      
      divide error: 0000 [#1] SMP
      Modules linked in: btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc arc4 md4 nls_utf8 cifs dns_resolver tcp_lp bnep bluetooth rfkill fuse dm_service_time iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables intel_powerclamp coretemp vfat fat kvm_intel iTCO_wdt iTCO_vendor_support ipmi_devintf sr_mod kvm crct10dif_pclmul
       crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel cdc_ether lrw usbnet cdrom mii gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core pcspkr sb_edac edac_core ipmi_si ipmi_msghandler ioatdma wmi shpchp acpi_pad nfsd auth_rpcgss nfs_acl lockd uinput dm_multipath sunrpc xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common ixgbe mgag200 syscopyarea sysfillrect sysimgblt mdio drm_kms_helper ttm igb drm ptp pps_core dca i2c_algo_bit megaraid_sas i2c_core dm_mirror dm_region_hash dm_log dm_mod
      CPU: 113 PID: 0 Comm: swapper/113 Tainted: G        W   --------------   3.10.0-229.1.2.el7.x86_64 #1
      Hardware name: IBM x3950 X6 -[3837AC2]-/00FN827, BIOS -[A8E112BUS-1.00]- 08/27/2014
      task: ffff880fe8abe660 ti: ffff880fe8ae4000 task.ti: ffff880fe8ae4000
      RIP: 0010:[<ffffffff814a9279>]  [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0
      RSP: 0018:ffff883fff4e3db8  EFLAGS: 00010206
      RAX: 0000000027100000 RBX: ffff883fe6965100 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000010 RDI: 000000002e53632d
      RBP: ffff883fff4e3e20 R08: 000e6f69a5a125c0 R09: ffff883fe84ec001
      R10: 0000000000000002 R11: 0000000000000005 R12: 00000000000049f5
      R13: 0000000000271000 R14: 00000000000049f5 R15: 0000000000000246
      FS:  0000000000000000(0000) GS:ffff883fff4e0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f7668601000 CR3: 000000000190a000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffff883fff4e3e58 ffffffff81099dc1 0000000000000086 0000000000000071
       ffff883fff4f3680 0000000000000071 fbdc8a965e33afee ffffffff810b69dd
       ffff883fe84ec000 ffff883fe6965108 0000000000000100 ffffffff814a9100
      Call Trace:
       <IRQ>
      
       [<ffffffff81099dc1>] ? run_posix_cpu_timers+0x51/0x840
       [<ffffffff810b69dd>] ? trigger_load_balance+0x5d/0x200
       [<ffffffff814a9100>] ? pid_param_set+0x130/0x130
       [<ffffffff8107df56>] call_timer_fn+0x36/0x110
       [<ffffffff814a9100>] ? pid_param_set+0x130/0x130
       [<ffffffff8107fdcf>] run_timer_softirq+0x21f/0x320
       [<ffffffff81077b2f>] __do_softirq+0xef/0x280
       [<ffffffff816156dc>] call_softirq+0x1c/0x30
       [<ffffffff81015d95>] do_softirq+0x65/0xa0
       [<ffffffff81077ec5>] irq_exit+0x115/0x120
       [<ffffffff81616355>] smp_apic_timer_interrupt+0x45/0x60
       [<ffffffff81614a1d>] apic_timer_interrupt+0x6d/0x80
       <EOI>
      
       [<ffffffff814a9c32>] ? cpuidle_enter_state+0x52/0xc0
       [<ffffffff814a9c28>] ? cpuidle_enter_state+0x48/0xc0
       [<ffffffff814a9d65>] cpuidle_idle_call+0xc5/0x200
       [<ffffffff8101d14e>] arch_cpu_idle+0xe/0x30
       [<ffffffff810c67c1>] cpu_startup_entry+0xf1/0x290
       [<ffffffff8104228a>] start_secondary+0x1ba/0x230
      Code: 42 0f 00 45 89 e6 48 01 c2 43 8d 44 6d 00 39 d0 73 26 49 c1 e5 08 89 d2 4d 63 f4 49 63 c5 48 c1 e2 08 48 c1 e0 08 48 63 ca 48 99 <48> f7 f9 48 98 4c 0f af f0 49 c1 ee 08 8b 43 78 c1 e0 08 44 29
      RIP  [<ffffffff814a9279>] intel_pstate_timer_func+0x179/0x3d0
       RSP <ffff883fff4e3db8>
      
      The kernel values for cpudata for CPU 113 were:
      
      struct cpudata {
        cpu = 113,
        timer = {
          entry = {
            next = 0x0,
            prev = 0xdead000000200200
          },
          expires = 8357799745,
          base = 0xffff883fe84ec001,
          function = 0xffffffff814a9100 <intel_pstate_timer_func>,
          data = 18446612406765768960,
      <snip>
          i_gain = 0,
          d_gain = 0,
          deadband = 0,
          last_err = 22489
        },
        last_sample_time = {
          tv64 = 4063132438017305
        },
        prev_aperf = 287326796397463,
        prev_mperf = 251427432090198,
        sample = {
          core_pct_busy = 23081,
          aperf = 2937407,
          mperf = 3257884,
          freq = 2524484,
          time = {
            tv64 = 4063149215234118
          }
        }
      }
      
      which results in the time between samples = last_sample_time - sample.time
      = 4063149215234118 - 4063132438017305 = 16777216813 which is 16.777 seconds.
      
      The duration between reads of the APERF and MPERF registers overflowed a s32
      sized integer in intel_pstate_get_scaled_busy()'s call to div_fp().  The result
      is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0.
      
      While the kernel shouldn't be delaying for a long time, it can and does
      happen and the intel_pstate driver should not panic in this situation.  This
      patch changes the div_fp() function to use div64_s64() to allow for "long"
      division.  This will avoid the overflow condition on long delays.
      
      [v2]: use div64_s64() in div_fp()
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7180dddf
  23. 10 6月, 2015 1 次提交