1. 09 3月, 2016 6 次提交
  2. 28 1月, 2016 1 次提交
    • V
      cpufreq: Fix NULL reference crash while accessing policy->governor_data · e4b133cc
      Viresh Kumar 提交于
      There is a race discovered by Juri, where we are able to:
      - create and read a sysfs file before policy->governor_data is being set
        to a non NULL value.
        OR
      - set policy->governor_data to NULL, and reading a file before being
        destroyed.
      
      And so such a crash is reported:
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000c
      pgd = edfc8000
      [0000000c] *pgd=bfc8c835
      Internal error: Oops: 17 [#1] SMP ARM
      Modules linked in:
      CPU: 4 PID: 1730 Comm: cat Not tainted 4.5.0-rc1+ #463
      Hardware name: ARM-Versatile Express
      task: ee8e8480 ti: ee930000 task.ti: ee930000
      PC is at show_ignore_nice_load_gov_pol+0x24/0x34
      LR is at show+0x4c/0x60
      pc : [<c058f1bc>]    lr : [<c058ae88>]    psr: a0070013
      sp : ee931dd0  ip : ee931de0  fp : ee931ddc
      r10: ee4bc290  r9 : 00001000  r8 : ef2cb000
      r7 : ee4bc200  r6 : ef2cb000  r5 : c0af57b0  r4 : ee4bc2e0
      r3 : 00000000  r2 : 00000000  r1 : c0928df4  r0 : ef2cb000
      Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
      Control: 10c5387d  Table: adfc806a  DAC: 00000051
      Process cat (pid: 1730, stack limit = 0xee930210)
      Stack: (0xee931dd0 to 0xee932000)
      1dc0:                                     ee931dfc ee931de0 c058ae88 c058f1a4
      1de0: edce3bc0 c07bfca4 edce3ac0 00001000 ee931e24 ee931e00 c01fcb90 c058ae48
      1e00: 00000001 edce3bc0 00000000 00000001 ee931e50 ee8ff480 ee931e34 ee931e28
      1e20: c01fb33c c01fcb0c ee931e8c ee931e38 c01a5210 c01fb314 ee931e9c ee931e48
      1e40: 00000000 edce3bf0 befe4a00 ee931f78 00000000 00000000 000001e4 00000000
      1e60: c00545a8 edce3ac0 00001000 00001000 befe4a00 ee931f78 00000000 00001000
      1e80: ee931ed4 ee931e90 c01fbed8 c01a5038 ed085a58 00020000 00000000 00000000
      1ea0: c0ad72e4 ee931f78 ee8ff488 ee8ff480 c077f3fc 00001000 befe4a00 ee931f78
      1ec0: 00000000 00001000 ee931f44 ee931ed8 c017c328 c01fbdc4 00001000 00000000
      1ee0: ee8ff480 00001000 ee931f44 ee931ef8 c017c65c c03deb10 ee931fac ee931f08
      1f00: c0009270 c001f290 c0a8d968 ef2cb000 ef2cb000 ee8ff480 00000020 ee8ff480
      1f20: ee8ff480 befe4a00 00001000 ee931f78 00000000 00000000 ee931f74 ee931f48
      1f40: c017d1ec c017c2f8 c019c724 c019c684 ee8ff480 ee8ff480 00001000 befe4a00
      1f60: 00000000 00000000 ee931fa4 ee931f78 c017d2a8 c017d160 00000000 00000000
      1f80: 000a9f20 00001000 befe4a00 00000003 c000ffe4 ee930000 00000000 ee931fa8
      1fa0: c000fe40 c017d264 000a9f20 00001000 00000003 befe4a00 00001000 00000000
      Unable to handle kernel NULL pointer dereference at virtual address 0000000c
      1fc0: 000a9f20 00001000 befe4a00 00000003 00000000 00000000 00000003 00000001
      pgd = edfc4000
      [0000000c] *pgd=bfcac835
      1fe0: 00000000 befe49dc 000197f8 b6e35dfc 60070010 00000003 3065b49d 134ac2c9
      
      [<c058f1bc>] (show_ignore_nice_load_gov_pol) from [<c058ae88>] (show+0x4c/0x60)
      [<c058ae88>] (show) from [<c01fcb90>] (sysfs_kf_seq_show+0x90/0xfc)
      [<c01fcb90>] (sysfs_kf_seq_show) from [<c01fb33c>] (kernfs_seq_show+0x34/0x38)
      [<c01fb33c>] (kernfs_seq_show) from [<c01a5210>] (seq_read+0x1e4/0x4e4)
      [<c01a5210>] (seq_read) from [<c01fbed8>] (kernfs_fop_read+0x120/0x1a0)
      [<c01fbed8>] (kernfs_fop_read) from [<c017c328>] (__vfs_read+0x3c/0xe0)
      [<c017c328>] (__vfs_read) from [<c017d1ec>] (vfs_read+0x98/0x104)
      [<c017d1ec>] (vfs_read) from [<c017d2a8>] (SyS_read+0x50/0x90)
      [<c017d2a8>] (SyS_read) from [<c000fe40>] (ret_fast_syscall+0x0/0x1c)
      Code: e5903044 e1a00001 e3081df4 e34c1092 (e593300c)
      ---[ end trace 5994b9a5111f35ee ]---
      
      Fix that by making sure, policy->governor_data is updated at the right
      places only.
      
      Cc: 4.2+ <stable@vger.kernel.org> # 4.2+
      Reported-and-tested-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e4b133cc
  3. 05 1月, 2016 1 次提交
    • C
      cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC · 0df35026
      Chen Yu 提交于
      It is reported that, with CONFIG_HZ_PERIODIC=y cpu stays at the
      lowest frequency even if the usage goes to 100%, neither ondemand
      nor conservative governor works, however performance and
      userspace work as expected. If set with CONFIG_NO_HZ_FULL=y,
      everything goes well.
      
      This problem is caused by improper calculation of the idle_time
      when the load is extremely high(near 100%). Firstly, cpufreq_governor
      uses get_cpu_idle_time to get the total idle time for specific cpu, then:
      
      1.If the system is configured with CONFIG_NO_HZ_FULL, the idle time is
        returned by ktime_get, which is always increasing, it's OK.
      2.However, if the system is configured with CONFIG_HZ_PERIODIC,
        get_cpu_idle_time might not guarantee to be always increasing,
        because it will leverage get_cpu_idle_time_jiffy to calculate the
        idle_time, consider the following scenario:
      
      At T1:
      idle_tick_1 = total_tick_1 - user_tick_1
      
      sample period(80ms)...
      
      At T2: ( T2 = T1 + 80ms):
      idle_tick_2 = total_tick_2 - user_tick_2
      
      Currently the algorithm is using (idle_tick_2 - idle_tick_1) to
      get the delta idle_time during the past sample period, however
      it CAN NOT guarantee that idle_tick_2 >= idle_tick_1, especially
      when cpu load is high.
      (Yes, total_tick_2 >= total_tick_1, and user_tick_2 >= user_tick_1,
      but how about idle_tick_2 and idle_tick_1? No guarantee.)
      So governor might get a negative value of idle_time during the past
      sample period, which might mislead the system that the idle time is
      very big(converted to unsigned int), and the busy time is nearly zero,
      which causes the governor to always choose the lowest cpufreq,
      then cause this problem.
      
      In theory there are two solutions:
      
      1.The logic should not rely on the idle tick during every sample period,
        but be based on the busy tick directly, as this is how 'top' is
        implemented.
      
      2.Or the logic must make sure that the idle_time is strictly increasing
        during each sample period, then there would be no negative idle_time
        anymore. This solution requires minimum modification to current code
        and this patch uses method 2.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821Reported-by: NJan Fikar <j.fikar@gmail.com>
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      0df35026
  4. 10 12月, 2015 2 次提交
    • R
      cpufreq: governor: Use lockless timer function · 2dd3e724
      Rafael J. Wysocki 提交于
      It is possible to get rid of the timer_lock spinlock used by the
      governor timer function for synchronization, but a couple of races
      need to be avoided.
      
      The first race is between multiple dbs_timer_handler() instances
      that may be running in parallel with each other on different
      CPUs.  Namely, one of them has to queue up the work item, but it
      cannot be queued up more than once.  To achieve that,
      atomic_inc_return() can be used on the skip_work field of
      struct cpu_common_dbs_info.
      
      The second race is between an already running dbs_timer_handler()
      and gov_cancel_work().  In that case the dbs_timer_handler() might
      not notice the skip_work incrementation in gov_cancel_work() and
      it might queue up its work item after gov_cancel_work() had
      returned (and that work item would corrupt skip_work going
      forward).  To prevent that from happening, gov_cancel_work()
      can be made wait for the timer function to complete (on all CPUs)
      right after skip_work has been incremented.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      2dd3e724
    • V
      cpufreq: governor: replace per-CPU delayed work with timers · 70f43e5e
      Viresh Kumar 提交于
      cpufreq governors evaluate load at sampling rate and based on that they
      update frequency for a group of CPUs belonging to the same cpufreq
      policy.
      
      This is required to be done in a single thread for all policy->cpus, but
      because we don't want to wakeup idle CPUs to do just that, we use
      deferrable work for this. If we would have used a single delayed
      deferrable work for the entire policy, there were chances that the CPU
      required to run the handler can be in idle and we might end up not
      changing the frequency for the entire group with load variations.
      
      And so we were forced to keep per-cpu works, and only the one that
      expires first need to do the real work and others are rescheduled for
      next sampling time.
      
      We have been using the more complex solution until now, where we used a
      delayed deferrable work for this, which is a combination of a timer and
      a work.
      
      This could be made lightweight by keeping per-cpu deferred timers with a
      single work item, which is scheduled by the first timer that expires.
      
      This patch does just that and here are important changes:
      - The timer handler will run in irq context and so we need to use a
        spin_lock instead of the timer_mutex. And so a separate timer_lock is
        created. This also makes the use of the mutex and lock quite clear, as
        we know what exactly they are protecting.
      - A new field 'skip_work' is added to track when the timer handlers can
        queue a work. More comments present in code.
      Suggested-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NAshwin Chaugule <ashwin.chaugule@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      70f43e5e
  5. 07 12月, 2015 2 次提交
  6. 02 11月, 2015 1 次提交
    • V
      cpufreq: governor: Quit work-handlers early if governor is stopped · 3a91b069
      Viresh Kumar 提交于
      gov_queue_work() acquires cpufreq_governor_lock to allow
      cpufreq_governor_stop() to drain delayed work items possibly scheduled
      on CPUs that share the policy with a CPU being taken offline.
      
      However, the same goal may be achieved in a more straightforward way if
      the policy pointer in the struct cpu_dbs_info matching the policy CPU is
      reset upfront by cpufreq_governor_stop() under the timer_mutex belonging
      to it and checked against NULL, under the same lock, at the beginning of
      dbs_timer().
      
      In that case every instance of dbs_timer() run for a struct cpu_dbs_info
      sharing the policy pointer in question after cpufreq_governor_stop() has
      started will notice that that pointer is NULL and bail out immediately
      without queuing up any new work items.  In turn, gov_cancel_work()
      called by cpufreq_governor_stop() before destroying timer_mutex will
      wait for all of the delayed work items currently running on the CPUs
      sharing the policy to drop the mutex, so it may be destroyed safely.
      
      Make cpufreq_governor_stop() and dbs_timer() work as described and
      modify gov_queue_work() so it does not acquire cpufreq_governor_lock any
      more.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      3a91b069
  7. 28 10月, 2015 1 次提交
  8. 26 9月, 2015 1 次提交
  9. 21 7月, 2015 4 次提交
  10. 18 7月, 2015 5 次提交
  11. 15 6月, 2015 3 次提交
    • V
      cpufreq: governor: Serialize governor callbacks · 732b6d61
      Viresh Kumar 提交于
      There are several races reported in cpufreq core around governors (only
      ondemand and conservative) by different people.
      
      There are at least two race scenarios present in governor code:
       (a) Concurrent access/updates of governor internal structures.
      
       It is possible that fields such as 'dbs_data->usage_count', etc.  are
       accessed simultaneously for different policies using same governor
       structure (i.e. CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag unset). And
       because of this we can dereference bad pointers.
      
       For example consider a system with two CPUs with separate 'struct
       cpufreq_policy' instances. CPU0 governor: ondemand and CPU1: powersave.
       CPU0 switching to powersave and CPU1 to ondemand:
      	CPU0				CPU1
      
      	store*				store*
      
      	cpufreq_governor_exit()		cpufreq_governor_init()
      					dbs_data = cdata->gdbs_data;
      
      	if (!--dbs_data->usage_count)
      		kfree(dbs_data);
      
      					dbs_data->usage_count++;
      					*Bad pointer dereference*
      
       There are other races possible between EXIT and START/STOP/LIMIT as
       well. Its really complicated.
      
       (b) Switching governor state in bad sequence:
      
       For example trying to switch a governor to START state, when the
       governor is in EXIT state. There are some checks present in
       __cpufreq_governor() but they aren't sufficient as they compare events
       against 'policy->governor_enabled', where as we need to take governor's
       state into account, which can be used by multiple policies.
      
      These two issues need to be solved separately and the responsibility
      should be properly divided between cpufreq and governor core.
      
      The first problem is more about the governor core, as it needs to
      protect its structures properly. And the second problem should be fixed
      in cpufreq core instead of governor, as its all about sequence of
      events.
      
      This patch is trying to solve only the first problem.
      
      There are two types of data we need to protect,
      - 'struct common_dbs_data': No matter what, there is going to be a
        single copy of this per governor.
      - 'struct dbs_data': With CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag set, we
        will have per-policy copy of this data, otherwise a single copy.
      
      Because of such complexities, the mutex present in 'struct dbs_data' is
      insufficient to solve our problem. For example we need to protect
      fetching of 'dbs_data' from different structures at the beginning of
      cpufreq_governor_dbs(), to make sure it isn't currently being updated.
      
      This can be fixed if we can guarantee serialization of event parsing
      code for an individual governor. This is best solved with a mutex per
      governor, and the placeholder for that is 'struct common_dbs_data'.
      
      And so this patch moves the mutex from 'struct dbs_data' to 'struct
      common_dbs_data' and takes it at the beginning and drops it at the end
      of cpufreq_governor_dbs().
      
      Tested with and without following configuration options:
      
      CONFIG_LOCKDEP_SUPPORT=y
      CONFIG_DEBUG_RT_MUTEXES=y
      CONFIG_DEBUG_PI_LIST=y
      CONFIG_DEBUG_SPINLOCK=y
      CONFIG_DEBUG_MUTEXES=y
      CONFIG_DEBUG_LOCK_ALLOC=y
      CONFIG_PROVE_LOCKING=y
      CONFIG_LOCKDEP=y
      CONFIG_DEBUG_ATOMIC_SLEEP=y
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      732b6d61
    • V
      cpufreq: governor: split cpufreq_governor_dbs() · 714a2d9c
      Viresh Kumar 提交于
      cpufreq_governor_dbs() is hardly readable, it is just too big and
      complicated. Lets make it more readable by splitting out event specific
      routines.
      
      Order of statements is changed at few places, but that shouldn't bring
      any functional change.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      714a2d9c
    • V
      cpufreq: governor: register notifier from cs_init() · 8e0484d2
      Viresh Kumar 提交于
      Notifiers are required only for conservative governor and the common
      governor code is unnecessarily polluted with that. Handle that from
      cs_init/exit() instead of cpufreq_governor_dbs().
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      8e0484d2
  12. 09 6月, 2014 1 次提交
    • V
      cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info' · c8ae481b
      Viresh Kumar 提交于
      'copy_prev_load' was recently added by commit: 18b46abd (cpufreq: governor: Be
      friendly towards latency-sensitive bursty workloads).
      
      It actually is a bit redundant as we also have 'prev_load' which can store any
      integer value and can be used instead of 'copy_prev_load' by setting it zero.
      
      True load can also turn out to be zero during long idle intervals (and hence the
      actual value of 'prev_load' and the overloaded value can clash). However this is
      not a problem because, if the true load was really zero in the previous
      interval, it makes sense to evaluate the load afresh for the current interval
      rather than copying the previous load.
      
      So, drop 'copy_prev_load' and use 'prev_load' instead.
      
      Update comments as well to make it more clear.
      
      There is another change here which was probably missed by Srivatsa during the
      last version of updates he made. The unlikely in the 'if' statement was covering
      only half of the condition and the whole line should actually come under it.
      
      Also checkpatch is made more silent as it was reporting this (--strict option):
      
      CHECK: Alignment should match open parenthesis
      +		if (unlikely(wall_time > (2 * sampling_rate) &&
      +						j_cdbs->prev_load)) {
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c8ae481b
  13. 08 6月, 2014 1 次提交
    • S
      cpufreq: governor: Be friendly towards latency-sensitive bursty workloads · 18b46abd
      Srivatsa S. Bhat 提交于
      Cpufreq governors like the ondemand governor calculate the load on the CPU
      periodically by employing deferrable timers. A deferrable timer won't fire
      if the CPU is completely idle (and there are no other timers to be run), in
      order to avoid unnecessary wakeups and thus save CPU power.
      
      However, the load calculation logic is agnostic to all this, and this can
      lead to the problem described below.
      
      Time (ms)               CPU 1
      
      100                Task-A running
      
      110                Governor's timer fires, finds load as 100% in the last
                         10ms interval and increases the CPU frequency.
      
      110.5              Task-A running
      
      120		   Governor's timer fires, finds load as 100% in the last
      		   10ms interval and increases the CPU frequency.
      
      125		   Task-A went to sleep. With nothing else to do, CPU 1
      		   went completely idle.
      
      200		   Task-A woke up and started running again.
      
      200.5		   Governor's deferred timer (which was originally programmed
      		   to fire at time 130) fires now. It calculates load for the
      		   time period 120 to 200.5, and finds the load is almost zero.
      		   Hence it decreases the CPU frequency to the minimum.
      
      210		   Governor's timer fires, finds load as 100% in the last
      		   10ms interval and increases the CPU frequency.
      
      So, after the workload woke up and started running, the frequency was suddenly
      dropped to absolute minimum, and after that, there was an unnecessary delay of
      10ms (sampling period) to increase the CPU frequency back to a reasonable value.
      And this pattern repeats for every wake-up-from-cpu-idle for that workload.
      This can be quite undesirable for latency- or response-time sensitive bursty
      workloads. So we need to fix the governor's logic to detect such wake-up-from-
      cpu-idle scenarios and start the workload at a reasonably high CPU frequency.
      
      One extreme solution would be to fake a load of 100% in such scenarios. But
      that might lead to undesirable side-effects such as frequency spikes (which
      might also need voltage changes) especially if the previous frequency happened
      to be very low.
      
      We just want to avoid the stupidity of dropping down the frequency to a minimum
      and then enduring a needless (and long) delay before ramping it up back again.
      So, let us simply carry forward the previous load - that is, let us just pretend
      that the 'load' for the current time-window is the same as the load for the
      previous window. That way, the frequency and voltage will continue to be set
      to whatever values they were set at previously. This means that bursty workloads
      will get a chance to influence the CPU frequency at which they wake up from
      cpu-idle, based on their past execution history. Thus, they might be able to
      avoid suffering from slow wakeups and long response-times.
      
      However, we should take care not to over-do this. For example, such a "copy
      previous load" logic will benefit cases like this: (where # represents busy
      and . represents idle)
      
      ##########.........#########.........###########...........##########........
      
      but it will be detrimental in cases like the one shown below, because it will
      retain the high frequency (copied from the previous interval) even in a mostly
      idle system:
      
      ##########.........#.................#.....................#...............
      
      (i.e., the workload finished and the remaining tasks are such that their busy
      periods are smaller than the sampling interval, which causes the timer to
      always get deferred. So, this will make the copy-previous-load logic copy
      the initial high load to subsequent idle periods over and over again, thus
      keeping the frequency high unnecessarily).
      
      So, we modify this copy-previous-load logic such that it is used only once
      upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the
      previous load won't get blindly copied over; cpufreq will freshly evaluate the
      load in the second idle interval, thus ensuring that the system comes back to
      its normal state.
      
      [ The right way to solve this whole problem is to teach the CPU frequency
      governors to also track load on a per-task basis, not just a per-CPU basis,
      and then use both the data sources intelligently to set the appropriate
      frequency on the CPUs. But that involves redesigning the cpufreq subsystem,
      so this patch should make the situation bearable until then. ]
      
      Experimental results:
      +-------------------+
      
      I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in
      between its execution such that its total utilization can be a user-defined
      value, say 10% or 20% (higher the utilization specified, lesser the amount of
      sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8.
      
      Behavior observed with tracing (sample taken from 40% utilization runs):
      ------------------------------------------------------------------------
      
      Without patch:
      ~~~~~~~~~~~~~~
      kworker/8:2-12137  416.335742: cpu_frequency: state=2061000 cpu_id=8
      kworker/8:2-12137  416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40753  416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      kworker/8:2-12137  416.345744: cpu_frequency: state=4123000 cpu_id=8
      kworker/8:2-12137  416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40753  416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      <snip>  ---------------------------------------------------------------------  <snip>
            <...>-40753  416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
           <idle>-0      416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
            <...>-40753  416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      kworker/8:2-12137  416.505739: cpu_frequency: state=2061000 cpu_id=8
      kworker/8:2-12137  416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40753  416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      kworker/8:2-12137  416.515742: cpu_frequency: state=4123000 cpu_id=8
      kworker/8:2-12137  416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
      
      Observation: Ebizzy went idle at 416.402202, and started running again at
      416.502130. But cpufreq noticed the long idle period, and dropped the frequency
      at 416.505739, only to increase it back again at 416.515742, realizing that the
      workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency
      for almost 13 milliseconds (almost 1 full sample period), and this pattern
      repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite
      a lot.
      
      With patch:
      ~~~~~~~~~~~
      
      kworker/8:2-29802  464.832535: cpu_frequency: state=2061000 cpu_id=8
      <snip>  ---------------------------------------------------------------------  <snip>
      kworker/8:2-29802  464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40738  464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      kworker/8:2-29802  464.972536: cpu_frequency: state=4123000 cpu_id=8
      kworker/8:2-29802  464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40738  464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      <snip>  ---------------------------------------------------------------------  <snip>
      kworker/8:2-29802  465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40738  465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      kworker/8:2-29802  465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40738  465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8
           <idle>-0      465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy
            <...>-40738  465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      kworker/8:2-29802  465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy
            <...>-40738  465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2
      
      Observation: Ebizzy went idle at 465.035797, and started running again at
      465.240178. Since ebizzy was the only real workload running on this CPU,
      cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no
      matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy
      got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared
      to the run without the patch) and this boost gave a modest improvement in total
      throughput, as shown below.
      
      Sleeping-ebizzy records-per-second:
      -----------------------------------
      
      Utilization  Without patch  With patch  Difference (Absolute and % values)
          10%         274767        277046        +  2279 (+0.829%)
          20%         543429        553484        + 10055 (+1.850%)
          40%        1090744       1107959        + 17215 (+1.578%)
          60%        1634908       1662018        + 27110 (+1.658%)
      
      A rudimentary and somewhat approximately latency-sensitive workload such as
      sleeping-ebizzy itself showed a consistent, noticeable performance improvement
      with this patch. Hence, workloads that are truly latency-sensitive will benefit
      quite a bit from this change. Moreover, this is an overall win-win since this
      patch does not hurt power-savings at all (because, this patch does not reduce
      the idle time or idle residency; and the high frequency of the CPU when it goes
      to cpu-idle does not affect/hurt the power-savings of deep idle states).
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      18b46abd
  14. 20 5月, 2014 1 次提交
    • B
      cpufreq: remove race while accessing cur_policy · c5450db8
      Bibek Basu 提交于
      While accessing cur_policy during executing events
      CPUFREQ_GOV_START, CPUFREQ_GOV_STOP, CPUFREQ_GOV_LIMITS,
      same mutex lock is not taken, dbs_data->mutex, which leads
      to race and data corruption while running continious suspend
      resume test. This is seen with ondemand governor with suspend
      resume test using rtcwake.
      
       Unable to handle kernel NULL pointer dereference at virtual address 00000028
       pgd = ed610000
       [00000028] *pgd=adf11831, *pte=00000000, *ppte=00000000
       Internal error: Oops: 17 [#1] PREEMPT SMP ARM
       Modules linked in: nvhost_vi
       CPU: 1 PID: 3243 Comm: rtcwake Not tainted 3.10.24-gf5cf9e5 #1
       task: ee708040 ti: ed61c000 task.ti: ed61c000
       PC is at cpufreq_governor_dbs+0x400/0x634
       LR is at cpufreq_governor_dbs+0x3f8/0x634
       pc : [<c05652b8>] lr : [<c05652b0>] psr: 600f0013
       sp : ed61dcb0 ip : 000493e0 fp : c1cc14f0
       r10: 00000000 r9 : 00000000 r8 : 00000000
       r7 : eb725280 r6 : c1cc1560 r5 : eb575200 r4 : ebad7740
       r3 : ee708040 r2 : ed61dca8 r1 : 001ebd24 r0 : 00000000
       Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
       Control: 10c5387d Table: ad61006a DAC: 00000015
       [<c05652b8>] (cpufreq_governor_dbs+0x400/0x634) from [<c055f700>] (__cpufreq_governor+0x98/0x1b4)
       [<c055f700>] (__cpufreq_governor+0x98/0x1b4) from [<c0560770>] (__cpufreq_set_policy+0x250/0x320)
       [<c0560770>] (__cpufreq_set_policy+0x250/0x320) from [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168)
       [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168) from [<c0561ed0>] (cpu_freq_notify+0x68/0xdc)
       [<c0561ed0>] (cpu_freq_notify+0x68/0xdc) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c)
       [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68)
       [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28)
       [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310)
       [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310) from [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70)
       [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70) from [<c004b4b8>] (tegra_pm_notify+0x114/0x134)
       [<c004b4b8>] (tegra_pm_notify+0x114/0x134) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c)
       [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68)
       [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28)
       [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34)
       [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34) from [<c00ad38c>] (enter_state+0xec/0x128)
       [<c00ad38c>] (enter_state+0xec/0x128) from [<c00ad400>] (pm_suspend+0x38/0xa4)
       [<c00ad400>] (pm_suspend+0x38/0xa4) from [<c00ac114>] (state_store+0x70/0xc0)
       [<c00ac114>] (state_store+0x70/0xc0) from [<c027b1e8>] (kobj_attr_store+0x14/0x20)
       [<c027b1e8>] (kobj_attr_store+0x14/0x20) from [<c019cd9c>] (sysfs_write_file+0x104/0x184)
       [<c019cd9c>] (sysfs_write_file+0x104/0x184) from [<c0143038>] (vfs_write+0xd0/0x19c)
       [<c0143038>] (vfs_write+0xd0/0x19c) from [<c0143414>] (SyS_write+0x4c/0x78)
       [<c0143414>] (SyS_write+0x4c/0x78) from [<c000f080>] (ret_fast_syscall+0x0/0x30)
       Code: e1a00006 eb084346 e59b0020 e5951024 (e5903028)
       ---[ end trace 0488523c8f6b0f9d ]---
      Signed-off-by: NBibek Basu <bbasu@nvidia.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c5450db8
  15. 06 1月, 2014 1 次提交
    • J
      cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled · 6f1e4efd
      Jane Li 提交于
      When a CPU is hot removed we'll cancel all the delayed work items via
      gov_cancel_work(). Sometimes the delayed work function determines that
      it should adjust the delay for all other CPUs that the policy is
      managing. If this scenario occurs, the canceling CPU will cancel its own
      work but queue up the other CPUs works to run.
      
      Commit 3617f2 (cpufreq: Fix timer/workqueue corruption due to double
      queueing) has tried to fix this, but reading governor_enabled is not
      protected by cpufreq_governor_lock. Even though od_dbs_timer() checks
      governor_enabled before gov_queue_work(), this scenario may occur. For
      example:
      
       CPU0                                        CPU1
       ----                                        ----
       cpu_down()
        ...                                        <work runs>
        __cpufreq_remove_dev()                     od_dbs_timer()
         __cpufreq_governor()                       policy->governor_enabled
          policy->governor_enabled = false;
          cpufreq_governor_dbs()
           case CPUFREQ_GOV_STOP:
            gov_cancel_work(dbs_data, policy);
             cpu0 work is canceled
              timer is canceled
              cpu1 work is canceled
              <waits for cpu1>
                                                    gov_queue_work(*, *, true);
                                                     cpu0 work queued
                                                     cpu1 work queued
                                                     cpu2 work queued
                                                     ...
              cpu1 work is canceled
              cpu2 work is canceled
              ...
      
      At the end of the GOV_STOP case cpu0 still has a work queued to
      run although the code is expecting all of the works to be
      canceled. __cpufreq_remove_dev() will then proceed to
      re-initialize all the other CPUs works except for the CPU that is
      going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs()
      will trample over the queued work and debugobjects will spit out
      a warning:
      
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc()
      ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x14
      Modules linked in:
      CPU: 1 PID: 1205 Comm: sh Tainted: G        W    3.10.0 #200
      [<c01144f0>] (unwind_backtrace+0x0/0xf8) from [<c0111d98>] (show_stack+0x10/0x14)
      [<c0111d98>] (show_stack+0x10/0x14) from [<c01272cc>] (warn_slowpath_common+0x4c/0x68)
      [<c01272cc>] (warn_slowpath_common+0x4c/0x68) from [<c012737c>] (warn_slowpath_fmt+0x30/0x40)
      [<c012737c>] (warn_slowpath_fmt+0x30/0x40) from [<c034c640>] (debug_print_object+0x94/0xbc)
      [<c034c640>] (debug_print_object+0x94/0xbc) from [<c034c7f8>] (__debug_object_init+0xc8/0x3c0)
      [<c034c7f8>] (__debug_object_init+0xc8/0x3c0) from [<c01360e0>] (init_timer_key+0x20/0x104)
      [<c01360e0>] (init_timer_key+0x20/0x104) from [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c)
      [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) from [<c04833a8>] (__cpufreq_governor+0x80/0x1b0)
      [<c04833a8>] (__cpufreq_governor+0x80/0x1b0) from [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380)
      [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) from [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c)
      [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) from [<c014fb40>] (notifier_call_chain+0x44/0x84)
      [<c014fb40>] (notifier_call_chain+0x44/0x84) from [<c012ae44>] (__cpu_notify+0x2c/0x48)
      [<c012ae44>] (__cpu_notify+0x2c/0x48) from [<c068dd40>] (_cpu_down+0x80/0x258)
      [<c068dd40>] (_cpu_down+0x80/0x258) from [<c068df40>] (cpu_down+0x28/0x3c)
      [<c068df40>] (cpu_down+0x28/0x3c) from [<c068e4c0>] (store_online+0x30/0x74)
      [<c068e4c0>] (store_online+0x30/0x74) from [<c03a7308>] (dev_attr_store+0x18/0x24)
      [<c03a7308>] (dev_attr_store+0x18/0x24) from [<c0256fe0>] (sysfs_write_file+0x100/0x180)
      [<c0256fe0>] (sysfs_write_file+0x100/0x180) from [<c01fec9c>] (vfs_write+0xbc/0x184)
      [<c01fec9c>] (vfs_write+0xbc/0x184) from [<c01ff034>] (SyS_write+0x40/0x68)
      [<c01ff034>] (SyS_write+0x40/0x68) from [<c010e200>] (ret_fast_syscall+0x0/0x48)
      
      In gov_queue_work(), lock cpufreq_governor_lock before gov_queue_work,
      and unlock it after __gov_queue_work(). In this way, governor_enabled
      is guaranteed not changed in gov_queue_work().
      Signed-off-by: NJane Li <jiel@marvell.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      6f1e4efd
  16. 16 11月, 2013 1 次提交
  17. 30 8月, 2013 1 次提交
    • S
      cpufreq: Don't use smp_processor_id() in preemptible context · 69320783
      Stephen Boyd 提交于
      Workqueues are preemptible even if works are queued on them with
      queue_work_on(). Let's use raw_smp_processor_id() here to silence
      the warning.
      
      BUG: using smp_processor_id() in preemptible [00000000] code: kworker/3:2/674
      caller is gov_queue_work+0x28/0xb0
      CPU: 0 PID: 674 Comm: kworker/3:2 Tainted: G        W    3.10.0 #30
      Workqueue: events od_dbs_timer
      [<c010c178>] (unwind_backtrace+0x0/0x11c) from [<c0109dec>] (show_stack+0x10/0x14)
      [<c0109dec>] (show_stack+0x10/0x14) from [<c03885a4>] (debug_smp_processor_id+0xbc/0xf0)
      [<c03885a4>] (debug_smp_processor_id+0xbc/0xf0) from [<c0635864>] (gov_queue_work+0x28/0xb0)
      [<c0635864>] (gov_queue_work+0x28/0xb0) from [<c0635618>] (od_dbs_timer+0x108/0x134)
      [<c0635618>] (od_dbs_timer+0x108/0x134) from [<c01aa8f8>] (process_one_work+0x25c/0x444)
      [<c01aa8f8>] (process_one_work+0x25c/0x444) from [<c01aaf88>] (worker_thread+0x200/0x344)
      [<c01aaf88>] (worker_thread+0x200/0x344) from [<c01b03bc>] (kthread+0xa0/0xb0)
      [<c01b03bc>] (kthread+0xa0/0xb0) from [<c01061b8>] (ret_from_fork+0x14/0x3c)
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      69320783
  18. 29 8月, 2013 2 次提交
    • S
      cpufreq: governor: Fix typos in comments · c4afc410
      Stratos Karafotis 提交于
       - 'Governer' should be 'Governor'.
       - 'S' is used for Siemens (electrical conductance) in SI units,
         so use small 's' for seconds.
      Signed-off-by: NStratos Karafotis <stratosk@semaphore.gr>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c4afc410
    • S
      cpufreq: Fix timer/workqueue corruption due to double queueing · 3617f2ca
      Stephen Boyd 提交于
      When a CPU is hot removed we'll cancel all the delayed work items
      via gov_cancel_work(). Normally this will just cancels a delayed
      timer on each CPU that the policy is managing and the work won't
      run, but if the work is already running the workqueue code will
      wait for the work to finish before continuing to prevent the
      work items from re-queuing themselves like they normally do. This
      scheme will work most of the time, except for the case where the
      work function determines that it should adjust the delay for all
      other CPUs that the policy is managing. If this scenario occurs,
      the canceling CPU will cancel its own work but queue up the other
      CPUs works to run. For example:
      
       CPU0                                        CPU1
       ----                                        ----
       cpu_down()
        ...
        __cpufreq_remove_dev()
         cpufreq_governor_dbs()
          case CPUFREQ_GOV_STOP:
           gov_cancel_work(dbs_data, policy);
            cpu0 work is canceled
             timer is canceled
             cpu1 work is canceled                    <work runs>
             <waits for cpu1>                         od_dbs_timer()
                                                       gov_queue_work(*, *, true);
       						  cpu0 work queued
       						  cpu1 work queued
      						  cpu2 work queued
      						  ...
             cpu1 work is canceled
             cpu2 work is canceled
             ...
      
      At the end of the GOV_STOP case cpu0 still has a work queued to
      run although the code is expecting all of the works to be
      canceled. __cpufreq_remove_dev() will then proceed to
      re-initialize all the other CPUs works except for the CPU that is
      going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs()
      will trample over the queued work and debugobjects will spit out
      a warning:
      
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc()
      ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x10
      Modules linked in:
      CPU: 0 PID: 1491 Comm: sh Tainted: G        W    3.10.0 #19
      [<c010c178>] (unwind_backtrace+0x0/0x11c) from [<c0109dec>] (show_stack+0x10/0x14)
      [<c0109dec>] (show_stack+0x10/0x14) from [<c01904cc>] (warn_slowpath_common+0x4c/0x6c)
      [<c01904cc>] (warn_slowpath_common+0x4c/0x6c) from [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c)
      [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c) from [<c0388a7c>] (debug_print_object+0x94/0xbc)
      [<c0388a7c>] (debug_print_object+0x94/0xbc) from [<c0388e34>] (__debug_object_init+0x2d0/0x340)
      [<c0388e34>] (__debug_object_init+0x2d0/0x340) from [<c019e3b0>] (init_timer_key+0x14/0xb0)
      [<c019e3b0>] (init_timer_key+0x14/0xb0) from [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8)
      [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8) from [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4)
      [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4) from [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434)
      [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434) from [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80)
      [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80) from [<c08a43c0>] (notifier_call_chain+0x38/0x68)
      [<c08a43c0>] (notifier_call_chain+0x38/0x68) from [<c01938e0>] (__cpu_notify+0x28/0x40)
      [<c01938e0>] (__cpu_notify+0x28/0x40) from [<c0892ad4>] (_cpu_down+0x7c/0x2c0)
      [<c0892ad4>] (_cpu_down+0x7c/0x2c0) from [<c0892d3c>] (cpu_down+0x24/0x40)
      [<c0892d3c>] (cpu_down+0x24/0x40) from [<c0893ea8>] (store_online+0x2c/0x74)
      [<c0893ea8>] (store_online+0x2c/0x74) from [<c04519d8>] (dev_attr_store+0x18/0x24)
      [<c04519d8>] (dev_attr_store+0x18/0x24) from [<c02a69d4>] (sysfs_write_file+0x100/0x148)
      [<c02a69d4>] (sysfs_write_file+0x100/0x148) from [<c0255c18>] (vfs_write+0xcc/0x174)
      [<c0255c18>] (vfs_write+0xcc/0x174) from [<c0255f70>] (SyS_write+0x38/0x64)
      [<c0255f70>] (SyS_write+0x38/0x64) from [<c0106120>] (ret_fast_syscall+0x0/0x30)
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      3617f2ca
  19. 08 8月, 2013 2 次提交
  20. 26 7月, 2013 1 次提交
    • S
      cpufreq: ondemand: Change the calculation of target frequency · dfa5bb62
      Stratos Karafotis 提交于
      The ondemand governor calculates load in terms of frequency and
      increases it only if load_freq is greater than up_threshold
      multiplied by the current or average frequency.  This appears to
      produce oscillations of frequency between min and max because,
      for example, a relatively small load can easily saturate minimum
      frequency and lead the CPU to the max.  Then, it will decrease
      back to the min due to small load_freq.
      
      Change the calculation method of load and target frequency on the
      basis of the following two observations:
      
       - Load computation should not depend on the current or average
         measured frequency.  For example, absolute load of 80% at 100MHz
         is not necessarily equivalent to 8% at 1000MHz in the next
         sampling interval.
      
       - It should be possible to increase the target frequency to any
         value present in the frequency table proportional to the absolute
         load, rather than to the max only, so that:
      
         Target frequency = C * load
      
         where we take C = policy->cpuinfo.max_freq / 100.
      
      Tested on Intel i7-3770 CPU @ 3.40GHz and on Quad core 1500MHz Krait.
      Phoronix benchmark of Linux Kernel Compilation 3.1 test shows an
      increase ~1.5% in performance. cpufreq_stats (time_in_state) shows
      that middle frequencies are used more, with this patch.  Highest
      and lowest frequencies were used less by ~9%.
      
      [rjw: We have run multiple other tests on kernels with this
       change applied and in the vast majority of cases it turns out
       that the resulting performance improvement also leads to reduced
       consumption of energy.  The change is additionally justified by
       the overall simplification of the code in question.]
      Signed-off-by: NStratos Karafotis <stratosk@semaphore.gr>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      dfa5bb62
  21. 17 7月, 2013 1 次提交
    • S
      cpufreq: Revert commit 2f7021a8 to fix CPU hotplug regression · e8d05276
      Srivatsa S. Bhat 提交于
      commit 2f7021a8 "cpufreq: protect 'policy->cpus' from offlining
      during __gov_queue_work()" caused a regression in CPU hotplug,
      because it lead to a deadlock between cpufreq governor worker thread
      and the CPU hotplug writer task.
      
      Lockdep splat corresponding to this deadlock is shown below:
      
      [   60.277396] ======================================================
      [   60.277400] [ INFO: possible circular locking dependency detected ]
      [   60.277407] 3.10.0-rc7-dbg-01385-g241fd04-dirty #1744 Not tainted
      [   60.277411] -------------------------------------------------------
      [   60.277417] bash/2225 is trying to acquire lock:
      [   60.277422]  ((&(&j_cdbs->work)->work)){+.+...}, at: [<ffffffff810621b5>] flush_work+0x5/0x280
      [   60.277444] but task is already holding lock:
      [   60.277449]  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81042d8b>] cpu_hotplug_begin+0x2b/0x60
      [   60.277465] which lock already depends on the new lock.
      
      [   60.277472] the existing dependency chain (in reverse order) is:
      [   60.277477] -> #2 (cpu_hotplug.lock){+.+.+.}:
      [   60.277490]        [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200
      [   60.277503]        [<ffffffff815b6157>] mutex_lock_nested+0x67/0x410
      [   60.277514]        [<ffffffff81042cbc>] get_online_cpus+0x3c/0x60
      [   60.277522]        [<ffffffff814b842a>] gov_queue_work+0x2a/0xb0
      [   60.277532]        [<ffffffff814b7891>] cs_dbs_timer+0xc1/0xe0
      [   60.277543]        [<ffffffff8106302d>] process_one_work+0x1cd/0x6a0
      [   60.277552]        [<ffffffff81063d31>] worker_thread+0x121/0x3a0
      [   60.277560]        [<ffffffff8106ae2b>] kthread+0xdb/0xe0
      [   60.277569]        [<ffffffff815bb96c>] ret_from_fork+0x7c/0xb0
      [   60.277580] -> #1 (&j_cdbs->timer_mutex){+.+...}:
      [   60.277592]        [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200
      [   60.277600]        [<ffffffff815b6157>] mutex_lock_nested+0x67/0x410
      [   60.277608]        [<ffffffff814b785d>] cs_dbs_timer+0x8d/0xe0
      [   60.277616]        [<ffffffff8106302d>] process_one_work+0x1cd/0x6a0
      [   60.277624]        [<ffffffff81063d31>] worker_thread+0x121/0x3a0
      [   60.277633]        [<ffffffff8106ae2b>] kthread+0xdb/0xe0
      [   60.277640]        [<ffffffff815bb96c>] ret_from_fork+0x7c/0xb0
      [   60.277649] -> #0 ((&(&j_cdbs->work)->work)){+.+...}:
      [   60.277661]        [<ffffffff810ab826>] __lock_acquire+0x1766/0x1d30
      [   60.277669]        [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200
      [   60.277677]        [<ffffffff810621ed>] flush_work+0x3d/0x280
      [   60.277685]        [<ffffffff81062d8a>] __cancel_work_timer+0x8a/0x120
      [   60.277693]        [<ffffffff81062e53>] cancel_delayed_work_sync+0x13/0x20
      [   60.277701]        [<ffffffff814b89d9>] cpufreq_governor_dbs+0x529/0x6f0
      [   60.277709]        [<ffffffff814b76a7>] cs_cpufreq_governor_dbs+0x17/0x20
      [   60.277719]        [<ffffffff814b5df8>] __cpufreq_governor+0x48/0x100
      [   60.277728]        [<ffffffff814b6b80>] __cpufreq_remove_dev.isra.14+0x80/0x3c0
      [   60.277737]        [<ffffffff815adc0d>] cpufreq_cpu_callback+0x38/0x4c
      [   60.277747]        [<ffffffff81071a4d>] notifier_call_chain+0x5d/0x110
      [   60.277759]        [<ffffffff81071b0e>] __raw_notifier_call_chain+0xe/0x10
      [   60.277768]        [<ffffffff815a0a68>] _cpu_down+0x88/0x330
      [   60.277779]        [<ffffffff815a0d46>] cpu_down+0x36/0x50
      [   60.277788]        [<ffffffff815a2748>] store_online+0x98/0xd0
      [   60.277796]        [<ffffffff81452a28>] dev_attr_store+0x18/0x30
      [   60.277806]        [<ffffffff811d9edb>] sysfs_write_file+0xdb/0x150
      [   60.277818]        [<ffffffff8116806d>] vfs_write+0xbd/0x1f0
      [   60.277826]        [<ffffffff811686fc>] SyS_write+0x4c/0xa0
      [   60.277834]        [<ffffffff815bbbbe>] tracesys+0xd0/0xd5
      [   60.277842] other info that might help us debug this:
      
      [   60.277848] Chain exists of:
        (&(&j_cdbs->work)->work) --> &j_cdbs->timer_mutex --> cpu_hotplug.lock
      
      [   60.277864]  Possible unsafe locking scenario:
      
      [   60.277869]        CPU0                    CPU1
      [   60.277873]        ----                    ----
      [   60.277877]   lock(cpu_hotplug.lock);
      [   60.277885]                                lock(&j_cdbs->timer_mutex);
      [   60.277892]                                lock(cpu_hotplug.lock);
      [   60.277900]   lock((&(&j_cdbs->work)->work));
      [   60.277907]  *** DEADLOCK ***
      
      [   60.277915] 6 locks held by bash/2225:
      [   60.277919]  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffff81168173>] vfs_write+0x1c3/0x1f0
      [   60.277937]  #1:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811d9e3c>] sysfs_write_file+0x3c/0x150
      [   60.277954]  #2:  (s_active#61){.+.+.+}, at: [<ffffffff811d9ec3>] sysfs_write_file+0xc3/0x150
      [   60.277972]  #3:  (x86_cpu_hotplug_driver_mutex){+.+...}, at: [<ffffffff81024cf7>] cpu_hotplug_driver_lock+0x17/0x20
      [   60.277990]  #4:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff815a0d32>] cpu_down+0x22/0x50
      [   60.278007]  #5:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81042d8b>] cpu_hotplug_begin+0x2b/0x60
      [   60.278023] stack backtrace:
      [   60.278031] CPU: 3 PID: 2225 Comm: bash Not tainted 3.10.0-rc7-dbg-01385-g241fd04-dirty #1744
      [   60.278037] Hardware name: Acer             Aspire 5741G    /Aspire 5741G    , BIOS V1.20 02/08/2011
      [   60.278042]  ffffffff8204e110 ffff88014df6b9f8 ffffffff815b3d90 ffff88014df6ba38
      [   60.278055]  ffffffff815b0a8d ffff880150ed3f60 ffff880150ed4770 3871c4002c8980b2
      [   60.278068]  ffff880150ed4748 ffff880150ed4770 ffff880150ed3f60 ffff88014df6bb00
      [   60.278081] Call Trace:
      [   60.278091]  [<ffffffff815b3d90>] dump_stack+0x19/0x1b
      [   60.278101]  [<ffffffff815b0a8d>] print_circular_bug+0x2b6/0x2c5
      [   60.278111]  [<ffffffff810ab826>] __lock_acquire+0x1766/0x1d30
      [   60.278123]  [<ffffffff81067e08>] ? __kernel_text_address+0x58/0x80
      [   60.278134]  [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200
      [   60.278142]  [<ffffffff810621b5>] ? flush_work+0x5/0x280
      [   60.278151]  [<ffffffff810621ed>] flush_work+0x3d/0x280
      [   60.278159]  [<ffffffff810621b5>] ? flush_work+0x5/0x280
      [   60.278169]  [<ffffffff810a9b14>] ? mark_held_locks+0x94/0x140
      [   60.278178]  [<ffffffff81062d77>] ? __cancel_work_timer+0x77/0x120
      [   60.278188]  [<ffffffff810a9cbd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      [   60.278196]  [<ffffffff81062d8a>] __cancel_work_timer+0x8a/0x120
      [   60.278206]  [<ffffffff81062e53>] cancel_delayed_work_sync+0x13/0x20
      [   60.278214]  [<ffffffff814b89d9>] cpufreq_governor_dbs+0x529/0x6f0
      [   60.278225]  [<ffffffff814b76a7>] cs_cpufreq_governor_dbs+0x17/0x20
      [   60.278234]  [<ffffffff814b5df8>] __cpufreq_governor+0x48/0x100
      [   60.278244]  [<ffffffff814b6b80>] __cpufreq_remove_dev.isra.14+0x80/0x3c0
      [   60.278255]  [<ffffffff815adc0d>] cpufreq_cpu_callback+0x38/0x4c
      [   60.278265]  [<ffffffff81071a4d>] notifier_call_chain+0x5d/0x110
      [   60.278275]  [<ffffffff81071b0e>] __raw_notifier_call_chain+0xe/0x10
      [   60.278284]  [<ffffffff815a0a68>] _cpu_down+0x88/0x330
      [   60.278292]  [<ffffffff81024cf7>] ? cpu_hotplug_driver_lock+0x17/0x20
      [   60.278302]  [<ffffffff815a0d46>] cpu_down+0x36/0x50
      [   60.278311]  [<ffffffff815a2748>] store_online+0x98/0xd0
      [   60.278320]  [<ffffffff81452a28>] dev_attr_store+0x18/0x30
      [   60.278329]  [<ffffffff811d9edb>] sysfs_write_file+0xdb/0x150
      [   60.278337]  [<ffffffff8116806d>] vfs_write+0xbd/0x1f0
      [   60.278347]  [<ffffffff81185950>] ? fget_light+0x320/0x4b0
      [   60.278355]  [<ffffffff811686fc>] SyS_write+0x4c/0xa0
      [   60.278364]  [<ffffffff815bbbbe>] tracesys+0xd0/0xd5
      [   60.280582] smpboot: CPU 1 is now offline
      
      The intention of that commit was to avoid warnings during CPU
      hotplug, which indicated that offline CPUs were getting IPIs from the
      cpufreq governor's work items.  But the real root-cause of that
      problem was commit a66b2e50 (cpufreq: Preserve sysfs files across
      suspend/resume) because it totally skipped all the cpufreq callbacks
      during CPU hotplug in the suspend/resume path, and hence it never
      actually shut down the cpufreq governor's worker threads during CPU
      offline in the suspend/resume path.
      
      Reflecting back, the reason why we never suspected that commit as the
      root-cause earlier, was that the original issue was reported with
      just the halt command and nobody had brought in suspend/resume to the
      equation.
      
      The reason for _that_ in turn, as it turns out, is that earlier
      halt/shutdown was being done by disabling non-boot CPUs while tasks
      were frozen, just like suspend/resume....  but commit cf7df378
      (reboot: migrate shutdown/reboot to boot cpu) which came somewhere
      along that very same time changed that logic: shutdown/halt no longer
      takes CPUs offline.  Thus, the test-cases for reproducing the bug
      were vastly different and thus we went totally off the trail.
      
      Overall, it was one hell of a confusion with so many commits
      affecting each other and also affecting the symptoms of the problems
      in subtle ways.  Finally, now since the original problematic commit
      (a66b2e50) has been completely reverted, revert this intermediate fix
      too (2f7021a8), to fix the CPU hotplug deadlock.  Phew!
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Tested-by: NPeter Wu <lekensteyn@gmail.com>
      Cc: 3.10+ <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e8d05276
  22. 28 6月, 2013 1 次提交