1. 25 4月, 2013 1 次提交
    • T
      clockevents: Set dummy handler on CPU_DEAD shutdown · 6f7a05d7
      Thomas Gleixner 提交于
      Vitaliy reported that a per cpu HPET timer interrupt crashes the
      system during hibernation. What happens is that the per cpu HPET timer
      gets shut down when the nonboot cpus are stopped. When the nonboot
      cpus are onlined again the HPET code sets up the MSI interrupt which
      fires before the clock event device is registered. The event handler
      is still set to hrtimer_interrupt, which then crashes the machine due
      to highres mode not being active.
      
      See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333
      
      There is no real good way to avoid that in the HPET code. The HPET
      code alrady has a mechanism to detect spurious interrupts when event
      handler == NULL for a similar reason.
      
      We can handle that in the clockevent/tick layer and replace the
      previous functional handler with a dummy handler like we do in
      tick_setup_new_device().
      
      The original clockevents code did this in clockevents_exchange_device(),
      but that got removed by commit 7c1e7689 (clockevents: prevent
      clockevent event_handler ending up handler_noop) which forgot to fix
      it up in tick_shutdown(). Same issue with the broadcast device.
      Reported-by: NVitaliy Fillipov <vitalif@yourcmc.ru>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Cc: 700333@bugs.debian.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      6f7a05d7
  2. 23 4月, 2013 1 次提交
  3. 19 4月, 2013 1 次提交
  4. 18 4月, 2013 10 次提交
  5. 15 4月, 2013 5 次提交
  6. 13 4月, 2013 2 次提交
  7. 12 4月, 2013 2 次提交
    • T
      kthread: Prevent unpark race which puts threads on the wrong cpu · f2530dc7
      Thomas Gleixner 提交于
      The smpboot threads rely on the park/unpark mechanism which binds per
      cpu threads on a particular core. Though the functionality is racy:
      
      CPU0	       	 	CPU1  	     	    CPU2
      unpark(T)				    wake_up_process(T)
        clear(SHOULD_PARK)	T runs
      			leave parkme() due to !SHOULD_PARK  
        bind_to(CPU2)		BUG_ON(wrong CPU)						    
      
      We cannot let the tasks move themself to the target CPU as one of
      those tasks is actually the migration thread itself, which requires
      that it starts running on the target cpu right away.
      
      The solution to this problem is to prevent wakeups in park mode which
      are not from unpark(). That way we can guarantee that the association
      of the task to the target cpu is working correctly.
      
      Add a new task state (TASK_PARKED) which prevents other wakeups and
      use this state explicitly for the unpark wakeup.
      
      Peter noticed: Also, since the task state is visible to userspace and
      all the parked tasks are still in the PID space, its a good hint in ps
      and friends that these tasks aren't really there for the moment.
      
      The migration thread has another related issue.
      
      CPU0	      	     	 CPU1
      Bring up CPU2
      create_thread(T)
      park(T)
       wait_for_completion()
      			 parkme()
      			 complete()
      sched_set_stop_task()
      			 schedule(TASK_PARKED)
      
      The sched_set_stop_task() call is issued while the task is on the
      runqueue of CPU1 and that confuses the hell out of the stop_task class
      on that cpu. So we need the same synchronizaion before
      sched_set_stop_task().
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NDave Hansen <dave@sr71.net>
      Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
      Acked-by: NPeter Ziljstra <peterz@infradead.org>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: dhillf@gmail.com
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f2530dc7
    • W
      perf: Fix error return code · c4814202
      Wei Yongjun 提交于
      Fix to return -ENOMEM in the allocation error case instead of 0
      (if pmu_bus_running == 1), as done elsewhere in this function.
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: a.p.zijlstra@chello.nl
      Cc: paulus@samba.org
      Cc: acme@ghostprotocols.net
      Link: http://lkml.kernel.org/r/CAPgLHd8j_fWcgqe%3DKLWjpBj%2B%3Do0Pw6Z-SEq%3DNTPU08c2w1tngQ@mail.gmail.com
      [ Tweaked the error code setting placement and the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c4814202
  8. 11 4月, 2013 1 次提交
  9. 10 4月, 2013 1 次提交
  10. 09 4月, 2013 6 次提交
    • D
      hrtimer: Fix ktime_add_ns() overflow on 32bit architectures · 51fd36f3
      David Engraf 提交于
      One can trigger an overflow when using ktime_add_ns() on a 32bit
      architecture not supporting CONFIG_KTIME_SCALAR.
      
      When passing a very high value for u64 nsec, e.g. 7881299347898368000
      the do_div() function converts this value to seconds (7881299347) which
      is still to high to pass to the ktime_set() function as long. The result
      in is a negative value.
      
      The problem on my system occurs in the tick-sched.c,
      tick_nohz_stop_sched_tick() when time_delta is set to
      timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is
      valid, thus ktime_add_ns() is called with a too large value resulting in
      a negative expire value. This leads to an endless loop in the ticker code:
      
      time_delta: 7881299347898368000
      expires = ktime_add_ns(last_update, time_delta)
      expires: negative value
      
      This fix caps the value to KTIME_MAX.
      
      This error doesn't occurs on 64bit or architectures supporting
      CONFIG_KTIME_SCALAR (e.g. ARM, x86-32).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Engraf <david.engraf@sysgo.com>
      [jstultz: Minor tweaks to commit message & header]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      51fd36f3
    • P
      hrtimer: Add expiry time overflow check in hrtimer_interrupt · 8f294b5a
      Prarit Bhargava 提交于
      The settimeofday01 test in the LTP testsuite effectively does
      
              gettimeofday(current time);
              settimeofday(Jan 1, 1970 + 100 seconds);
              settimeofday(current time);
      
      This test causes a stack trace to be displayed on the console during the
      setting of timeofday to Jan 1, 1970 + 100 seconds:
      
      [  131.066751] ------------[ cut here ]------------
      [  131.096448] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x135/0x140()
      [  131.104935] Hardware name: Dinar
      [  131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables kvm_amd kvm sp5100_tco bnx2 i2c_piix4 crc32c_intel k10temp fam15h_power ghash_clmulni_intel amd64_edac_mod pcspkr serio_raw edac_mce_amd edac_core microcode xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc_t10dif pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ahci pata_atiixp libahci libata usb_storage i2c_core dm_mirror dm_region_hash dm_log dm_mod
      [  131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6
      [  131.182248] Call Trace:
      [  131.184684]  <IRQ>  [<ffffffff810612af>] warn_slowpath_common+0x7f/0xc0
      [  131.191312]  [<ffffffff8106130a>] warn_slowpath_null+0x1a/0x20
      [  131.197131]  [<ffffffff810b9fd5>] clockevents_program_event+0x135/0x140
      [  131.203721]  [<ffffffff810bb584>] tick_program_event+0x24/0x30
      [  131.209534]  [<ffffffff81089ab1>] hrtimer_interrupt+0x131/0x230
      [  131.215437]  [<ffffffff814b9600>] ? cpufreq_p4_target+0x130/0x130
      [  131.221509]  [<ffffffff81619119>] smp_apic_timer_interrupt+0x69/0x99
      [  131.227839]  [<ffffffff8161805d>] apic_timer_interrupt+0x6d/0x80
      [  131.233816]  <EOI>  [<ffffffff81099745>] ? sched_clock_cpu+0xc5/0x120
      [  131.240267]  [<ffffffff814b9ff0>] ? cpuidle_wrap_enter+0x50/0xa0
      [  131.246252]  [<ffffffff814b9fe9>] ? cpuidle_wrap_enter+0x49/0xa0
      [  131.252238]  [<ffffffff814ba050>] cpuidle_enter_tk+0x10/0x20
      [  131.257877]  [<ffffffff814b9c89>] cpuidle_idle_call+0xa9/0x260
      [  131.263692]  [<ffffffff8101c42f>] cpu_idle+0xaf/0x120
      [  131.268727]  [<ffffffff815f8971>] start_secondary+0x255/0x257
      [  131.274449] ---[ end trace 1151a50552231615 ]---
      
      When we change the system time to a low value like this, the value of
      timekeeper->offs_real will be a negative value.
      
      It seems that the WARN occurs because an hrtimer has been started in the time
      between the releasing of the timekeeper lock and the IPI call (via a call to
      on_each_cpu) in clock_was_set() in the do_settimeofday() code.  The end result
      is that a REALTIME_CLOCK timer has been added with softexpires = expires =
      KTIME_MAX.  The hrtimer_interrupt() fires/is called and the loop at
      kernel/hrtimer.c:1289 is executed.  In this loop the code subtracts the
      clock base's offset (which was set to timekeeper->offs_real in
      do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which
      was KTIME_MAX):
      
      	KTIME_MAX - (a negative value) = overflow
      
      A simple check for an overflow can resolve this problem.  Using KTIME_MAX
      instead of the overflow value will result in the hrtimer function being run,
      and the reprogramming of the timer after that.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      [jstultz: Tweaked commit subject]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      8f294b5a
    • H
      PM / reboot: call syscore_shutdown() after disable_nonboot_cpus() · 6f389a8f
      Huacai Chen 提交于
      As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
      subsystems PM) say, syscore_ops operations should be carried with one
      CPU on-line and interrupts disabled. However, after commit f96972f2
      (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
      syscore_shutdown() is called before disable_nonboot_cpus(), so break
      the rules. We have a MIPS machine with a 8259A PIC, and there is an
      external timer (HPET) linked at 8259A. Since 8259A has been shutdown
      too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
      timer interrupt, so it hangs and reboot fails. This patch call
      syscore_shutdown() a little later (after disable_nonboot_cpus()) to
      avoid reboot failure, this is the same way as poweroff does.
      
      For consistency, add disable_nonboot_cpus() to kernel_halt().
      Signed-off-by: NHuacai Chen <chenhc@lemote.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      6f389a8f
    • S
      ftrace: Do not call stub functions in control loop · 395b97a3
      Steven Rostedt (Red Hat) 提交于
      The function tracing control loop used by perf spits out a warning
      if the called function is not a control function. This is because
      the control function references a per cpu allocated data structure
      on struct ftrace_ops that is not allocated for other types of
      functions.
      
      commit 0a016409 "ftrace: Optimize the function tracer list loop"
      
      Had an optimization done to all function tracing loops to optimize
      for a single registered ops. Unfortunately, this allows for a slight
      race when tracing starts or ends, where the stub function might be
      called after the current registered ops is removed. In this case we
      get the following dump:
      
      root# perf stat -e ftrace:function sleep 1
      [   74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0()
      [   74.349522] Hardware name: PRIMERGY RX200 S6
      [   74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
      r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
      bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
      [   74.446233] Pid: 1377, comm: perf Tainted: G        W    3.9.0-rc1 #1
      [   74.453458] Call Trace:
      [   74.456233]  [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0
      [   74.462997]  [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0
      [   74.470272]  [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
      [   74.478117]  [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20
      [   74.484681]  [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0
      [   74.491760]  [<ffffffff8162f400>] ftrace_call+0x5/0x2f
      [   74.497511]  [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
      [   74.503486]  [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
      [   74.509500]  [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
      [   74.516088]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
      [   74.522268]  [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
      [   74.528837]  [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
      [   74.536696]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
      [   74.542878]  [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
      [   74.548869]  [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50
      [   74.556243]  [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140
      [   74.563709]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
      [   74.569887]  [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
      [   74.575898]  [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50
      [   74.582505]  [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10
      [   74.589298]  [<ffffffff811295d0>] free_event+0x70/0x1a0
      [   74.595208]  [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0
      [   74.602460]  [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
      [   74.608667]  [<ffffffff8112a640>] put_event+0x90/0xc0
      [   74.614373]  [<ffffffff8112a740>] perf_release+0x10/0x20
      [   74.620367]  [<ffffffff811a3044>] __fput+0xf4/0x280
      [   74.625894]  [<ffffffff811a31de>] ____fput+0xe/0x10
      [   74.631387]  [<ffffffff81083697>] task_work_run+0xa7/0xe0
      [   74.637452]  [<ffffffff81014981>] do_notify_resume+0x71/0xb0
      [   74.643843]  [<ffffffff8162fa92>] int_signal+0x12/0x17
      
      To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end
      ftrace_ops stub as just that, a stub. This flag is now checked in the
      control loop and the function is not called if the flag is set.
      
      Thanks to Jovi for not just reporting the bug, but also pointing out
      where the bug was in the code.
      
      Link: http://lkml.kernel.org/r/514A8855.7090402@redhat.com
      Link: http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhangwei@huawei.comTested-by: NWANG Chao <chaowang@redhat.com>
      Reported-by: NWANG Chao <chaowang@redhat.com>
      Reported-by: Nzhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      395b97a3
    • J
      ftrace: Consistently restore trace function on sysctl enabling · 5000c418
      Jan Kiszka 提交于
      If we reenable ftrace via syctl, we currently set ftrace_trace_function
      based on the previous simplistic algorithm. This is inconsistent with
      what update_ftrace_function does. So better call that helper instead.
      
      Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      5000c418
    • S
      tracing: Fix race with update_max_tr_single and changing tracers · 2930e04d
      Steven Rostedt (Red Hat) 提交于
      The commit 34600f0e "tracing: Fix race with max_tr and changing tracers"
      fixed the updating of the main buffers with the race of changing
      tracers, but left out the fix to the updating of just a per cpu buffer.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2930e04d
  11. 08 4月, 2013 6 次提交
    • S
      sched/cputime: Fix accounting on multi-threaded processes · e614b333
      Stanislaw Gruszka 提交于
      Recent commit 6fac4829 ("cputime: Use accessors to read task
      cputime stats") introduced a bug, where we account many times
      the cputime of the first thread, instead of cputimes of all
      the different threads.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20130404085740.GA2495@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e614b333
    • C
      ftrace: Fix strncpy() use, use strlcpy() instead of strncpy() · 75761cc1
      Chen Gang 提交于
      For NUL terminated string we always need to set '\0' at the end.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: rostedt@goodmis.org
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/516243B7.9020405@asianux.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75761cc1
    • C
      perf: Fix strncpy() use, use strlcpy() instead of strncpy() · 67012ab1
      Chen Gang 提交于
      For NUL terminated string we always need to set '\0' at the end.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: rostedt@goodmis.org
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/51624254.30301@asianux.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67012ab1
    • C
      perf: Fix strncpy() use, always make sure it's NUL terminated · c97847d2
      Chen Gang 提交于
      For NUL terminated string, always make sure that there's '\0' at the end.
      
      In our case we need a return value, so still use strncpy() and
      fix up the tail explicitly.
      
      (strlcpy() returns the size, not the pointer)
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl>
      Cc: paulus@samba.org <paulus@samba.org>
      Cc: acme@ghostprotocols.net <acme@ghostprotocols.net>
      Link: http://lkml.kernel.org/r/51623E0B.7070101@asianux.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c97847d2
    • L
      sched/debug: Fix sd->*_idx limit range avoiding overflow · fd9b86d3
      libin 提交于
      Commit 201c373e ("sched/debug: Limit sd->*_idx range on
      sysctl") was an incomplete bug fix.
      
      This patch fixes sd->*_idx limit range to [0 ~ CPU_LOAD_IDX_MAX-1]
      avoiding array overflow caused by setting sd->*_idx to CPU_LOAD_IDX_MAX
      on sysctl.
      Signed-off-by: NLibin <huawei.libin@huawei.com>
      Cc: <jiang.liu@huawei.com>
      Cc: <guohanjun@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/51626610.2040607@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fd9b86d3
    • T
      sched_clock: Prevent 64bit inatomicity on 32bit systems · a1cbcaa9
      Thomas Gleixner 提交于
      The sched_clock_remote() implementation has the following inatomicity
      problem on 32bit systems when accessing the remote scd->clock, which
      is a 64bit value.
      
      CPU0			CPU1
      
      sched_clock_local()	sched_clock_remote(CPU0)
      ...
      			remote_clock = scd[CPU0]->clock
      			    read_low32bit(scd[CPU0]->clock)
      cmpxchg64(scd->clock,...)
      			    read_high32bit(scd[CPU0]->clock)
      
      While the update of scd->clock is using an atomic64 mechanism, the
      readout on the remote cpu is not, which can cause completely bogus
      readouts.
      
      It is a quite rare problem, because it requires the update to hit the
      narrow race window between the low/high readout and the update must go
      across the 32bit boundary.
      
      The resulting misbehaviour is, that CPU1 will see the sched_clock on
      CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
      to this bogus timestamp. This stays that way due to the clamping
      implementation for about 4 seconds until the synchronization with
      CLOCK_MONOTONIC undoes the problem.
      
      The issue is hard to observe, because it might only result in a less
      accurate SCHED_OTHER timeslicing behaviour. To create observable
      damage on realtime scheduling classes, it is necessary that the bogus
      update of CPU1 sched_clock happens in the context of an realtime
      thread, which then gets charged 4 seconds of RT runtime, which results
      in the RT throttler mechanism to trigger and prevent scheduling of RT
      tasks for a little less than 4 seconds. So this is quite unlikely as
      well.
      
      The issue was quite hard to decode as the reproduction time is between
      2 days and 3 weeks and intrusive tracing makes it less likely, but the
      following trace recorded with trace_clock=global, which uses
      sched_clock_local(), gave the final hint:
      
        <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
        <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
      irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
        <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
      
      What happens is that CPU0 goes idle and invokes
      sched_clock_idle_sleep_event() which invokes sched_clock_local() and
      CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
      sched_remote_clock(). The time jump gets propagated to CPU0 via
      sched_remote_clock() and stays stale on both cores for ~4 seconds.
      
      There are only two other possibilities, which could cause a stale
      sched clock:
      
      1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
         wrong value.
      
      2) sched_clock() which reads the TSC returns a sporadic wrong value.
      
      #1 can be excluded because sched_clock would continue to increase for
         one jiffy and then go stale.
      
      #2 can be excluded because it would not make the clock jump
         forward. It would just result in a stale sched_clock for one jiffy.
      
      After quite some brain twisting and finding the same pattern on other
      traces, sched_clock_remote() remained the only place which could cause
      such a problem and as explained above it's indeed racy on 32bit
      systems.
      
      So while on 64bit systems the readout is atomic, we need to verify the
      remote readout on 32bit machines. We need to protect the local->clock
      readout in sched_clock_remote() on 32bit as well because an NMI could
      hit between the low and the high readout, call sched_clock_local() and
      modify local->clock.
      
      Thanks to Siegfried Wulsch for bearing with my debug requests and
      going through the tedious tasks of running a bunch of reproducer
      systems to generate the debug information which let me decode the
      issue.
      Reported-by: NSiegfried Wulsch <Siegfried.Wulsch@rovema.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      a1cbcaa9
  12. 05 4月, 2013 4 次提交