1. 18 10月, 2013 1 次提交
  2. 01 9月, 2013 1 次提交
    • P
      nohz_full: Add full-system-idle state machine · 0edd1b17
      Paul E. McKenney 提交于
      This commit adds the state machine that takes the per-CPU idle data
      as input and produces a full-system-idle indication as output.  This
      state machine is driven out of RCU's quiescent-state-forcing
      mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
      idle state and then rcu_sysidle_report() to drive the state machine.
      
      The full-system-idle state is sampled using rcu_sys_is_idle(), which
      also drives the state machine if RCU is idle (and does so by forcing
      RCU to become non-idle).  This function returns true if all but the
      timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
      enough to avoid memory contention on the full_sysidle_state state
      variable.  The rcu_sysidle_force_exit() may be called externally
      to reset the state machine back into non-idle state.
      
      For large systems the state machine is driven out of RCU's
      force-quiescent-state logic, which provides good scalability at the price
      of millisecond-scale latencies on the transition to full-system-idle
      state.  This is not so good for battery-powered systems, which are usually
      small enough that they don't need to care about scalability, but which
      do care deeply about energy efficiency.  Small systems therefore drive
      the state machine directly out of the idle-entry code.  The number of
      CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
      Kconfig parameter, which defaults to 8.  Note that this is a build-time
      definition.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Simplify logic and provide better comments for memory barriers,
        based on review comments and questions by Lai Jiangshan. ]
      0edd1b17
  3. 29 8月, 2013 1 次提交
  4. 19 8月, 2013 1 次提交
  5. 16 8月, 2013 1 次提交
  6. 14 8月, 2013 3 次提交
    • F
      nohz: Optimize full dynticks's sched hooks with static keys · d13508f9
      Frederic Weisbecker 提交于
      Scheduler IPIs and task context switches are serious fast path.
      Let's try to hide as much as we can the impact of full
      dynticks APIs' off case that are called on these sites
      through the use of static keys.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      d13508f9
    • F
      nohz: Optimize full dynticks state checks with static keys · 460775df
      Frederic Weisbecker 提交于
      These APIs are frequenctly accessed and priority is given
      to optimize the full dynticks off-case in order to let
      distros enable this feature without suffering from
      significant performance regressions.
      
      Let's inline these APIs and optimize them with static keys.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      460775df
    • F
      nohz: Rename a few state variables · 73867dcd
      Frederic Weisbecker 提交于
      Rename the full dynticks's cpumask and cpumask state variables
      to some more exportable names.
      
      These will be used later from global headers to optimize
      the main full dynticks APIs in conjunction with static keys.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      73867dcd
  7. 13 8月, 2013 2 次提交
    • F
      context_tracking: Remove full dynticks' hacky dependency on wide context tracking · d84d27a4
      Frederic Weisbecker 提交于
      Now that the full dynticks subsystem only enables the context tracking
      on full dynticks CPUs, lets remove the dependency on CONTEXT_TRACKING_FORCE
      
      This dependency was a hack to enable the context tracking widely for the
      full dynticks susbsystem until the latter becomes able to enable it in a
      more CPU-finegrained fashion.
      
      Now CONTEXT_TRACKING_FORCE only stands for testing on archs that
      work on support for the context tracking while full dynticks can't be
      used yet due to unmet dependencies. It simulates a system where all CPUs
      are full dynticks so that RCU user extended quiescent states and dynticks
      cputime accounting can be tested on the given arch.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      d84d27a4
    • F
      nohz: Only enable context tracking on full dynticks CPUs · 2e709338
      Frederic Weisbecker 提交于
      The context tracking subsystem has the ability to selectively
      enable the tracking on any defined subset of CPU. This means that
      we can define a CPU range that doesn't run the context tracking
      and another range that does.
      
      Now what we want in practice is to enable the tracking on full
      dynticks CPUs only. In order to perform this, we just need to pass
      our full dynticks CPU range selection from the full dynticks
      subsystem to the context tracking.
      
      This way we can spare the overhead of RCU user extended quiescent
      state and vtime maintainance on the CPUs that are outside the
      full dynticks range. Just keep in mind the raw context tracking
      itself is still necessary everywhere.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      2e709338
  8. 29 7月, 2013 1 次提交
    • R
      Revert "cpuidle: Quickly notice prediction failure for repeat mode" · 14851912
      Rafael J. Wysocki 提交于
      Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
      repeat mode), because it has been identified as the source of a
      significant performance regression in v3.8 and later as explained by
      Jeremy Eder:
      
        We believe we've identified a particular commit to the cpuidle code
        that seems to be impacting performance of variety of workloads.
        The simplest way to reproduce is using netperf TCP_RR test, so
        we're using that, on a pair of Sandy Bridge based servers.  We also
        have data from a large database setup where performance is also
        measurably/positively impacted, though that test data isn't easily
        share-able.
      
        Included below are test results from 3 test kernels:
      
        kernel       reverts
        -----------------------------------------------------------
        1) vanilla   upstream (no reverts)
      
        2) perfteam2 reverts e11538d1
      
        3) test      reverts 69a37bea
                             e11538d1
      
        In summary, netperf TCP_RR numbers improve by approximately 4%
        after reverting 69a37bea.  When
        69a37bea is included, C0 residency
        never seems to get above 40%.  Taking that patch out gets C0 near
        100% quite often, and performance increases.
      
        The below data are histograms representing the %c0 residency @
        1-second sample rates (using turbostat), while under netperf test.
      
        - If you look at the first 4 histograms, you can see %c0 residency
          almost entirely in the 30,40% bin.
        - The last pair, which reverts 69a37bea,
          shows %c0 in the 80,90,100% bins.
      
        Below each kernel name are netperf TCP_RR trans/s numbers for the
        particular kernel that can be disclosed publicly, comparing the 3
        test kernels.  We ran a 4th test with the vanilla kernel where
        we've also set /dev/cpu_dma_latency=0 to show overall impact
        boosting single-threaded TCP_RR performance over 11% above
        baseline.
      
        3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
        TCP_RR trans/s 54323.78
      
        -----------------------------------------------------------
        3.10-rc2 vanilla RX (no reverts)
        TCP_RR trans/s 48192.47
      
        Receiver %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    59]:
        ***********************************************************
           40.0000 -    50.0000 [     1]: *
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        Sender %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    11]: ***********
           40.0000 -    50.0000 [    49]:
        *************************************************
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        -----------------------------------------------------------
        3.10-rc2 perfteam2 RX (reverts commit
        e11538d1)
        TCP_RR trans/s 49698.69
      
        Receiver %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     1]: *
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    59]:
        ***********************************************************
           40.0000 -    50.0000 [     0]:
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        Sender %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [     2]: **
           40.0000 -    50.0000 [    58]:
        **********************************************************
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        -----------------------------------------------------------
        3.10-rc2 test RX (reverts 69a37bea
        and e11538d1)
        TCP_RR trans/s 47766.95
      
        Receiver %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     1]: *
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    27]: ***************************
           40.0000 -    50.0000 [     2]: **
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     2]: **
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [    28]: ****************************
      
        Sender:
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    11]: ***********
           40.0000 -    50.0000 [     0]:
           50.0000 -    60.0000 [     1]: *
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     3]: ***
           80.0000 -    90.0000 [     7]: *******
           90.0000 -   100.0000 [    38]: **************************************
      
        These results demonstrate gaining back the tendency of the CPU to
        stay in more responsive, performant C-states (and thus yield
        measurably better performance), by reverting commit
        69a37bea.
      Requested-by: NJeremy Eder <jeder@redhat.com>
      Tested-by: NLen Brown <len.brown@intel.com>
      Cc: 3.8+ <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      14851912
  9. 25 7月, 2013 2 次提交
    • L
      nohz: fix compile warning in tick_nohz_init() · ca06416b
      Li Zhong 提交于
      cpu is not used after commit 5b8621a6Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      ca06416b
    • S
      nohz: Do not warn about unstable tsc unless user uses nohz_full · 543487c7
      Steven Rostedt 提交于
      If the user enables CONFIG_NO_HZ_FULL and runs the kernel on a machine
      with an unstable TSC, it will produce a WARN_ON dump as well as taint
      the kernel. This is a bit extreme for a kernel that just enables a
      feature but doesn't use it.
      
      The warning should only happen if the user tries to use the feature by
      either adding nohz_full to the kernel command line, or by enabling
      CONFIG_NO_HZ_FULL_ALL that makes nohz used on all CPUs at boot up. Note,
      this second feature should not (yet) be used by distros or anyone that
      doesn't care if NO_HZ is used or not.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      543487c7
  10. 23 7月, 2013 1 次提交
  11. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  12. 12 7月, 2013 1 次提交
    • S
      tick: broadcast: Check broadcast mode on CPU hotplug · a272dcca
      Stephen Boyd 提交于
      On ARM systems the dummy clockevent is registered with the cpu
      hotplug notifier chain before any other per-cpu clockevent. This
      has the side-effect of causing the dummy clockevent to be
      registered first in every hotplug sequence. Because the dummy is
      first, we'll try to turn the broadcast source on but the code in
      tick_device_uses_broadcast() assumes the broadcast source is in
      periodic mode and calls tick_broadcast_start_periodic()
      unconditionally.
      
      On boot this isn't a problem because we typically haven't
      switched into oneshot mode yet (if at all). During hotplug, if
      the broadcast source isn't in periodic mode we'll replace the
      broadcast oneshot handler with the broadcast periodic handler and
      start emulating oneshot mode when we shouldn't. Due to the way
      the broadcast oneshot handler programs the next_event it's
      possible for it to contain KTIME_MAX and cause us to hang the
      system when the periodic handler tries to program the next tick.
      Fix this by using the appropriate function to start the broadcast
      source.
      Reported-by: NStephen Warren <swarren@nvidia.com>
      Tested-by: NStephen Warren <swarren@nvidia.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Mark Rutland <Mark.Rutland@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: ARM kernel mailing list <linux-arm-kernel@lists.infradead.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joseph Lo <josephl@nvidia.com>
      Link: http://lkml.kernel.org/r/20130711140059.GA27430@codeaurora.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a272dcca
  13. 05 7月, 2013 1 次提交
    • T
      clocksource: Reselect clocksource when watchdog validated high-res capability · 332962f2
      Thomas Gleixner 提交于
      Up to commit 5d33b883 (clocksource: Always verify highres capability)
      we had no sanity check when selecting a clocksource, which prevented
      that a non highres capable clocksource is used when the system already
      switched to highres/nohz mode.
      
      The new sanity check works as Alex and Tim found out. It prevents the
      TSC from being used. This happens because on x86 the boot process
      looks like this:
      
       tsc_start_freqency_validation(TSC);
       clocksource_register(HPET);
       clocksource_done_booting();
      	clocksource_select()
      		Selects HPET which is valid for high-res
      
       switch_to_highres();
      
       clocksource_register(TSC);
       	TSC is not selected, because it is not yet
      	flagged as VALID_HIGH_RES
      
       clocksource_watchdog()
      	Validates TSC for highres, but that does not make TSC
      	the current clocksource.
      
      Before the sanity check was added, we installed TSC unvalidated which
      worked most of the time. If the TSC was really detected as unstable,
      then the unstable logic removed it and installed HPET again.
      
      The sanity check is correct and needed. So the watchdog needs to kick
      a reselection of the clocksource, when it qualifies TSC as a valid
      high res clocksource.
      
      To solve this, we mark the clocksource which got the flag
      CLOCK_SOURCE_VALID_FOR_HRES set by the watchdog with an new flag
      CLOCK_SOURCE_RESELECT and trigger the watchdog thread. The watchdog
      thread evaluates the flag and invokes clocksource_select() when set.
      
      To avoid that the clocksource_done_booting() code, which is about to
      install the first real clocksource anyway, needs to go through
      clocksource_select and tick_oneshot_notify() pointlessly, split out
      the clocksource_watchdog_kthread() list walk code and invoke the
      select/notify only when called from clocksource_watchdog_kthread().
      
      So clocksource_done_booting() can utilize the same splitout code
      without the select/notify invocation and the clocksource_mutex
      unlock/relock dance.
      Reported-and-tested-by: NAlex Shi <alex.shi@intel.com>
      Cc: Hans Peter Anvin <hpa@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Tested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307042239150.11637@ionos.tec.linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      332962f2
  14. 02 7月, 2013 3 次提交
    • T
      tick: Sanitize broadcast control logic · 07bd1172
      Thomas Gleixner 提交于
      The recent implementation of a generic dummy timer resulted in a
      different registration order of per cpu local timers which made the
      broadcast control logic go belly up.
      
      If the dummy timer is the first clock event device which is registered
      for a CPU, then it is installed, the broadcast timer is initialized
      and the CPU is marked as broadcast target.
      
      If a real clock event device is installed after that, we can fail to
      take the CPU out of the broadcast mask. In the worst case we end up
      with two periodic timer events firing for the same CPU. One from the
      per cpu hardware device and one from the broadcast.
      
      Now the problem is that we have no way to distinguish whether the
      system is in a state which makes broadcasting necessary or the
      broadcast bit was set due to the nonfunctional dummy timer
      installment.
      
      To solve this we need to keep track of the system state seperately and
      provide a more detailed decision logic whether we keep the CPU in
      broadcast mode or not.
      
      The old decision logic only clears the broadcast mode, if the newly
      installed clock event device is not affected by power states.
      
      The new logic clears the broadcast mode if one of the following is
      true:
      
        - The new device is not affected by power states.
      
        - The system is not in a power state affected mode
      
        - The system has switched to oneshot mode. The oneshot broadcast is
          controlled from the deep idle state. The CPU is not in idle at
          this point, so it's safe to remove it from the mask.
      
      If we clear the broadcast bit for the CPU when a new device is
      installed, we also shutdown the broadcast device when this was the
      last CPU in the broadcast mask.
      
      If the broadcast bit is kept, then we leave the new device in shutdown
      state and rely on the broadcast to deliver the timer interrupts via
      the broadcast ipis.
      Reported-and-tested-by: NStehle Vincent-B46079 <B46079@freescale.com>
      Reviewed-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: John Stultz <john.stultz@linaro.org>,
      Cc: Mark Rutland <mark.rutland@arm.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      07bd1172
    • T
      tick: Prevent uncontrolled switch to oneshot mode · 1f73a980
      Thomas Gleixner 提交于
      When the system switches from periodic to oneshot mode, the broadcast
      logic causes a possibility that a CPU which has not yet switched to
      oneshot mode puts its own clock event device into oneshot mode without
      updating the state and the timer handler.
      
      CPU0				CPU1
      				per cpu tickdev is in periodic mode
      				and switched to broadcast
      
      Switch to oneshot mode
       tick_broadcast_switch_to_oneshot()
        cpumask_copy(tick_oneshot_broacast_mask,
      	       tick_broadcast_mask);
      
        broadcast device mode = oneshot
      
      				Timer interrupt
      						
      				irq_enter()
      				 tick_check_oneshot_broadcast()
      				  dev->set_mode(ONESHOT);
      
      				tick_handle_periodic()
      				 if (dev->mode == ONESHOT)
      				   dev->next_event += period;
      				   FAIL.
      
      We fail, because dev->next_event contains KTIME_MAX, if the device was
      in periodic mode before the uncontrolled switch to oneshot happened.
      
      We must copy the broadcast bits over to the oneshot mask, because
      otherwise a CPU which relies on the broadcast would not been woken up
      anymore after the broadcast device switched to oneshot mode.
      
      So we need to verify in tick_check_oneshot_broadcast() whether the CPU
      has already switched to oneshot mode. If not, leave the device
      untouched and let the CPU switch controlled into oneshot mode.
      
      This is a long standing bug, which was never noticed, because the main
      user of the broadcast x86 cannot run into that scenario, AFAICT. The
      nonarchitected timer mess of ARM creates a gazillion of differently
      broken abominations which trigger the shortcomings of that broadcast
      code, which better had never been necessary in the first place.
      Reported-and-tested-by: NStehle Vincent-B46079 <B46079@freescale.com>
      Reviewed-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: John Stultz <john.stultz@linaro.org>,
      Cc: Mark Rutland <mark.rutland@arm.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1f73a980
    • T
      tick: Make oneshot broadcast robust vs. CPU offlining · c9b5a266
      Thomas Gleixner 提交于
      In periodic mode we remove offline cpus from the broadcast propagation
      mask. In oneshot mode we fail to do so. This was not a problem so far,
      but the recent changes to the broadcast propagation introduced a
      constellation which can result in a NULL pointer dereference.
      
      What happens is:
      
      CPU0			CPU1
      			idle()
      			  arch_idle()
      			    tick_broadcast_oneshot_control(OFF);
      			      set cpu1 in tick_broadcast_force_mask
      			  if (cpu_offline())
      			     arch_cpu_dead()
      
      cpu_dead_cleanup(cpu1)
       cpu1 tickdevice pointer = NULL
      
      broadcast interrupt
        dereference cpu1 tickdevice pointer -> OOPS
      
      We dereference the pointer because cpu1 is still set in
      tick_broadcast_force_mask and tick_do_broadcast() expects a valid
      cpumask and therefor lacks any further checks.
      
      Remove the cpu from the tick_broadcast_force_mask before we set the
      tick device pointer to NULL. Also add a sanity check to the oneshot
      broadcast function, so we can detect such issues w/o crashing the
      machine.
      Reported-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: athorlton@sgi.com
      Cc: CAI Qian <caiqian@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1306261303260.4013@ionos.tec.linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c9b5a266
  15. 29 6月, 2013 2 次提交
  16. 25 6月, 2013 1 次提交
    • S
      clockevents: Prefer CPU local devices over global devices · 70e5975d
      Stephen Boyd 提交于
      On an SMP system with only one global clockevent and a dummy
      clockevent per CPU we run into problems. We want the dummy
      clockevents to be registered as the per CPU tick devices, but
      we can only achieve that if we register the dummy clockevents
      before the global clockevent or if we artificially inflate the
      rating of the dummy clockevents to be higher than the rating
      of the global clockevent. Failure to do so leads to boot
      hangs when the dummy timers are registered on all other CPUs
      besides the CPU that accepted the global clockevent as its tick
      device and there is no broadcast timer to poke the dummy
      devices.
      
      If we're registering multiple clockevents and one clockevent is
      global and the other is local to a particular CPU we should
      choose to use the local clockevent regardless of the rating of
      the device. This way, if the clockevent is a dummy it will take
      the tick device duty as long as there isn't a higher rated tick
      device and any global clockevent will be bumped out into
      broadcast mode, fixing the problem described above.
      Reported-and-tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Tested-by: soren.brinkmann@xilinx.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/20130613183950.GA32061@codeaurora.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      70e5975d
  17. 21 6月, 2013 1 次提交
    • D
      tick: Fix tick_broadcast_pending_mask not cleared · ea8deb8d
      Daniel Lezcano 提交于
      The recent modification in the cpuidle framework consolidated the
      timer broadcast code across the different drivers by setting a new
      flag in the idle state. It tells the cpuidle core code to enter/exit
      the broadcast mode for the cpu when entering a deep idle state. The
      broadcast timer enter/exit is no longer handled by the back-end
      driver.
      
      This change made the local interrupt to be enabled *before* calling
      CLOCK_EVENT_NOTIFY_EXIT.
      
      On a tegra114, a four cores system, when the flag has been introduced
      in the driver, the following warning appeared:
      
      WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control
      CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15
      [<c00667f8>] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [<c0065cd0>] (tick_notify+0x240/0x40c)
      [<c0065cd0>] (tick_notify+0x240/0x40c) from [<c0044724>] (notifier_call_chain+0x44/0x84)
      [<c0044724>] (notifier_call_chain+0x44/0x84) from [<c0044828>] (raw_notifier_call_chain+0x18/0x20)
      [<c0044828>] (raw_notifier_call_chain+0x18/0x20) from [<c00650cc>] (clockevents_notify+0x28/0x170)
      [<c00650cc>] (clockevents_notify+0x28/0x170) from [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168)
      [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) from [<c000ea94>] (arch_cpu_idle+0x8/0x38)
      [<c000ea94>] (arch_cpu_idle+0x8/0x38) from [<c005ea80>] (cpu_startup_entry+0x60/0x134)
      [<c005ea80>] (cpu_startup_entry+0x60/0x134) from [<804fe9a4>] (0x804fe9a4)
      
      I don't have the hardware, so I wasn't able to reproduce the warning
      but after looking a while at the code, I deduced the following:
      
       1. the CPU2 enters a deep idle state and sets the broadcast timer
      
       2. the timer expires, the tick_handle_oneshot_broadcast function is
          called, setting the tick_broadcast_pending_mask and waking up the
          idle cpu CPU2
      
       3. the CPU2 exits idle handles the interrupt and then invokes
          tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which
          runs the following code:
      
          [...]
          if (dev->next_event.tv64 == KTIME_MAX)
                  goto out;
      
          if (cpumask_test_and_clear_cpu(cpu,
                                       tick_broadcast_pending_mask))
                  goto out;
          [...]
      
          So if there is no next event scheduled for CPU2, we fulfil the
          first condition and jump out without clearing the
          tick_broadcast_pending_mask.
      
       4. CPU2 goes to deep idle again and calls
          tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but
          with the tick_broadcast_pending_mask set for CPU2, triggering the
          warning.
      
      The issue only surfaced due to the modifications of the cpuidle
      framework, which resulted in interrupts being enabled before the call
      to the clockevents code. If the call happens before interrupts have
      been enabled, the warning cannot trigger, because there is still the
      event pending which caused the broadcast timer expiry.
      
      Move the check for the next event below the check for the pending bit,
      so the pending bit gets cleared whether an event is scheduled on the
      cpu or not.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Reported-and-tested-by: NJoseph Lo <josephl@nvidia.com>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linaro-kernel@lists.linaro.org
      Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ea8deb8d
  18. 20 6月, 2013 2 次提交
    • F
      nohz: Remove obsolete check for full dynticks CPUs to be RCU nocbs · 5b8621a6
      Frederic Weisbecker 提交于
      Building full dynticks now implies that all CPUs are forced
      into RCU nocb mode through CONFIG_RCU_NOCB_CPU_ALL.
      
      The dynamic check has become useless.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      5b8621a6
    • S
      nohz: Warn if the machine can not perform nohz_full · e12d0271
      Steven Rostedt 提交于
      If the user configures NO_HZ_FULL and defines nohz_full=XXX on the
      kernel command line, or enables NO_HZ_FULL_ALL, but nohz fails
      due to the machine having a unstable clock, warn about it.
      
      We do not want users thinking that they are getting the benefit
      of nohz when their machine can not support it.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      e12d0271
  19. 18 6月, 2013 1 次提交
    • S
      ARM: sched_clock: Load cycle count after epoch stabilizes · 336ae118
      Stephen Boyd 提交于
      There is a small race between when the cycle count is read from
      the hardware and when the epoch stabilizes. Consider this
      scenario:
      
       CPU0                           CPU1
       ----                           ----
       cyc = read_sched_clock()
       cyc_to_sched_clock()
                                       update_sched_clock()
                                        ...
                                        cd.epoch_cyc = cyc;
        epoch_cyc = cd.epoch_cyc;
        ...
        epoch_ns + cyc_to_ns((cyc - epoch_cyc)
      
      The cyc on cpu0 was read before the epoch changed. But we
      calculate the nanoseconds based on the new epoch by subtracting
      the new epoch from the old cycle count. Since epoch is most likely
      larger than the old cycle count we calculate a large number that
      will be converted to nanoseconds and added to epoch_ns, causing
      time to jump forward too much.
      
      Fix this problem by reading the hardware after the epoch has
      stabilized.
      
      Cc: Russell King <linux@arm.linux.org.uk>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      336ae118
  20. 13 6月, 2013 2 次提交
  21. 31 5月, 2013 2 次提交
  22. 30 5月, 2013 2 次提交
    • C
      power: Add option to log time spent in suspend · 5c83545f
      Colin Cross 提交于
      Below is a patch from android kernel that maintains a histogram of
      suspend times. Please review and provide feedback.
      
      Statistices on the time spent in suspend are kept in
      /sys/kernel/debug/sleep_time.
      
      Cc: Android Kernel Team <kernel-team@android.com>
      Cc: Colin Cross <ccross@android.com>
      Cc: Todd Poynor <toddpoynor@google.com>
      Cc: San Mehat <san@google.com>
      Cc: Benoit Goby <benoit@android.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NColin Cross <ccross@android.com>
      Signed-off-by: NTodd Poynor <toddpoynor@google.com>
      [zoran.markovic@linaro.org: Re-formatted suspend time table to better
      fit expected values. Moved accounting of suspend time into timekeeping
      core. Removed CONFIG_SUSPEND_TIME flag and made the feature conditional
      on CONFIG_DEBUG_FS. Changed the file name to sleep_time to better fit
      terminology in timekeeping core. Changed seq_printf to seq_puts. Tweaked
      commit message]
      Signed-off-by: NZoran Markovic <zoran.markovic@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      5c83545f
    • T
      alarmtimer: Add functions for timerfd support · 6cffe00f
      Todd Poynor 提交于
      Add functions needed for hooking up alarmtimer to timerfd:
      
      * alarm_restart: Similar to hrtimer_restart, restart an alarmtimer after
        the expires time has already been updated (as with alarm_forward).
      
      * alarm_forward_now: Similar to hrtimer_forward_now, move the expires
        time forward to an interval from the current time of the associated clock.
      
      * alarm_start_relative: Start an alarmtimer with an expires time relative to
        the current time of the associated clock.
      
      * alarm_expires_remaining: Similar to hrtimer_expires_remaining, return the
        amount of time remaining until alarm expiry.
      Signed-off-by: NTodd Poynor <toddpoynor@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6cffe00f
  23. 29 5月, 2013 2 次提交
    • Z
      timekeeping: Correct run-time detection of persistent_clock. · 0d6bd995
      Zoran Markovic 提交于
      Since commit 31ade306, timekeeping_init()
      checks for presence of persistent clock by attempting to read a non-zero
      time value. This is an issue on platforms where persistent_clock (instead
      is implemented as a free-running counter (instead of an RTC) starting
      from zero on each boot and running during suspend. Examples are some ARM
      platforms (e.g. PandaBoard).
      
      An attempt to read such a clock during timekeeping_init() may return zero
      value and falsely declare persistent clock as missing. Additionally, in
      the above case suspend times may be accounted twice (once from
      timekeeping_resume() and once from rtc_resume()), resulting in a gradual
      drift of system time.
      
      This patch does a run-time correction of the issue by doing the same check
      during timekeeping_suspend().
      
      A better long-term solution would have to return error when trying to read
      non-existing clock and zero when trying to read an uninitialized clock, but
      that would require changing all persistent_clock implementations.
      
      This patch addresses the immediate breakage, for now.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NZoran Markovic <zoran.markovic@linaro.org>
      [jstultz: Tweaked commit message and subject]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0d6bd995
    • G
      ntp: Remove unused variable flags in __hardpps · aa848233
      Geert Uytterhoeven 提交于
      kernel/time/ntp.c: In function ‘__hardpps’:
      kernel/time/ntp.c:877: warning: unused variable ‘flags’
      
      commit a076b214 ("ntp: Remove ntp_lock,
      using the timekeeping locks to protect ntp state") removed its users,
      but not the actual variable.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      aa848233
  24. 28 5月, 2013 3 次提交
  25. 16 5月, 2013 2 次提交