1. 22 11月, 2014 5 次提交
    • P
      time: Provide y2038 safe timekeeping_inject_sleeptime() replacement · 04d90890
      pang.xunlei 提交于
      As part of addressing "y2038 problem" for in-kernel uses, this
      patch adds timekeeping_inject_sleeptime64() using timespec64.
      
      After this patch, timekeeping_inject_sleeptime() is deprecated
      and all its call sites will be fixed using the new interface,
      after that it can be removed.
      
      NOTE: timekeeping_inject_sleeptime() is safe actually, but we
      want to eliminate timespec eventually, so comes this patch.
      Signed-off-by: Npang.xunlei <pang.xunlei@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      04d90890
    • P
      time: Provide y2038 safe do_settimeofday() replacement · 21f7eca5
      pang.xunlei 提交于
      The kernel uses 32-bit signed value(time_t) for seconds elapsed
      1970-01-01:00:00:00, thus it will overflow at 2038-01-19 03:14:08
      on 32-bit systems. This is widely known as the y2038 problem.
      
      As part of addressing "y2038 problem" for in-kernel uses, this patch
      adds safe do_settimeofday64() using timespec64.
      
      After this patch, do_settimeofday() is deprecated and all its call
      sites will be fixed using do_settimeofday64(), after that it can be
      removed.
      Signed-off-by: Npang.xunlei <pang.xunlei@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      21f7eca5
    • P
      time: Complete NTP adjustment threshold judging conditions · 659bc17b
      pang.xunlei 提交于
      The clocksource mult-adjustment threshold is [mult-maxadj, mult+maxadj],
      timekeeping_adjust() only deals with the upper threshold, but misses the
      lower threshold.
      
      This patch adds the lower threshold judging condition.
      Signed-off-by: Npang.xunlei <pang.xunlei@linaro.org>
      [jstultz: Minor fix for > 80 char line]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      659bc17b
    • P
      time: Avoid possible NTP adjustment mult overflow. · 6067dc5a
      pang.xunlei 提交于
      Ideally, __clocksource_updatefreq_scale, selects the largest shift
      value possible for a clocksource. This results in the mult memember of
      struct clocksource being particularly large, although not so large
      that NTP would adjust the clock to cause it to overflow.
      
      That said, nothing actually prohibits an overflow from occuring, its
      just that it "shouldn't" occur.
      
      So while very unlikely, and so far never observed, the value of
      (cs->mult+cs->maxadj) may have a chance to reach very near 0xFFFFFFFF,
      so there is a possibility it may overflow when doing NTP positive
      adjustment
      
      See the following detail: When NTP slewes the clock, kernel goes
      through update_wall_time()->...->timekeeping_apply_adjustment():
      	tk->tkr.mult += mult_adj;
      
      Since there is no guard against it, its possible tk->tkr.mult may
      overflow during this operation.
      
      This patch avoids any possible mult overflow by judging the overflow
      case before adding mult_adj to mult, also adds the WARNING message
      when capturing such case.
      Signed-off-by: Npang.xunlei <pang.xunlei@linaro.org>
      [jstultz: Reworded commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6067dc5a
    • J
      time: Rename udelay_test.c to test_udelay.c · fd866e2b
      John Stultz 提交于
      Kees requested that this test module be renamed for consistency sake,
      so this patch renames the udelay_test.c file (recently added to
      tip/timers/core for 3.17) to test_udelay.c
      
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linux-Next <linux-next@vger.kernel.org>
      Cc: David Riley <davidriley@chromium.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      fd866e2b
  2. 09 10月, 2014 1 次提交
  3. 19 9月, 2014 1 次提交
    • K
      sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule() · f139caf2
      Kirill Tkhai 提交于
      schedule(), io_schedule() and schedule_timeout() always return
      with TASK_RUNNING state set, so one more setting is unnecessary.
      
      (All places in patch are visible good, only exception is
       kiblnd_scheduler() from:
      
            drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
      
       Its schedule() is one line above standard 3 lines of unified diff)
      
      No places where set_current_state() is used for mb().
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Anil Belur <askb23@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry Eremin <dmitry.eremin@intel.com>
      Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Isaac Huang <he.huang@intel.com>
      Cc: James E.J. Bottomley <JBottomley@parallels.com>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Liang Zhen <liang.zhen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masaru Nomura <massa.nomura@gmail.com>
      Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleg Drokin <green@linuxhacker.ru>
      Cc: Peng Tao <bergwolf@gmail.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Robert Love <robert.w.love@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Ursula Braun <ursula.braun@de.ibm.com>
      Cc: Zi Shen Lim <zlim.lnx@gmail.com>
      Cc: devel@driverdev.osuosl.org
      Cc: dm-devel@redhat.com
      Cc: dri-devel@lists.freedesktop.org
      Cc: fcoe-devel@open-fcoe.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux390@de.ibm.com
      Cc: linux-afs@lists.infradead.org
      Cc: linux-cris-kernel@axis.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: qla2xxx-upstream@qlogic.com
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Cc: user-mode-linux-user@lists.sourceforge.net
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f139caf2
  4. 14 9月, 2014 4 次提交
    • F
      nohz: nohz full depends on irq work self IPI support · 9b01f5bf
      Frederic Weisbecker 提交于
      The nohz full functionality depends on IRQ work to trigger its own
      interrupts. As it's used to restart the tick, we can't rely on the tick
      fallback for irq work callbacks, ie: we can't use the tick to restart
      the tick itself.
      
      Lets reject the full dynticks initialization if that arch support isn't
      available.
      
      As a side effect, this makes sure that nohz kick is never called from
      the tick. That otherwise would result in illegal hrtimer self-cancellation
      and lockup.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      9b01f5bf
    • F
      nohz: Consolidate nohz full init code · 4327b15f
      Frederic Weisbecker 提交于
      The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
      parameter both have their own way to do the same thing: allocate
      full dynticks cpumasks, fill them and initialize some state variables.
      
      Lets consolidate that all in the same place.
      
      While at it, convert some regular printk message to warnings when
      fundamental allocations fail.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4327b15f
    • F
      irq_work: Force raised irq work to run on irq work interrupt · 76a33061
      Frederic Weisbecker 提交于
      The nohz full kick, which restarts the tick when any resource depend
      on it, can't be executed anywhere given the operation it does on timers.
      If it is called from the scheduler or timers code, chances are that
      we run into a deadlock.
      
      This is why we run the nohz full kick from an irq work. That way we make
      sure that the kick runs on a virgin context.
      
      However if that's the case when irq work runs in its own dedicated
      self-ipi, things are different for the big bunch of archs that don't
      support the self triggered way. In order to support them, irq works are
      also handled by the timer interrupt as fallback.
      
      Now when irq works run on the timer interrupt, the context isn't blank.
      More precisely, they can run in the context of the hrtimer that runs the
      tick. But the nohz kick cancels and restarts this hrtimer and cancelling
      an hrtimer from itself isn't allowed. This is why we run in an endless
      loop:
      
      	Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
      	CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
      	Workqueue: btrfs-endio-write normal_work_helper [btrfs]
      	 ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
      	 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
      	 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
      	Call Trace:
      	 <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
      	 [<ffffffff8a7ef928>] panic+0xd4/0x207
      	 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
      	 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
      	 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
      	 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
      	 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
      	 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
      	 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
      	 [<ffffffff8a008268>] do_nmi+0xb8/0x100
      	 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
      	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
      	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
      	 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
      	 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
      	 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
      	 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
      	 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
      	 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
      	 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
      	 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
      	 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
      	 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
      	 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
      	 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
      	 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
      	 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
      	 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80
      
      To fix this we force non-lazy irq works to run on irq work self-IPIs
      when available. That ability of the arch to trigger irq work self IPIs
      is available with arch_irq_work_has_interrupt().
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76a33061
    • F
      nohz: Move nohz full init call to tick init · a80e49e2
      Frederic Weisbecker 提交于
      This way we unbloat a bit main.c and more importantly we initialize
      nohz full after init_IRQ(). This dependency will be needed in further
      patches because nohz full needs irq work to raise its own IRQ.
      Information about the support for this ability on ARM64 is obtained on
      init_IRQ() which initialize the pointer to __smp_call_function.
      
      Since tick_init() is called right after init_IRQ(), this is a good place
      to call tick_nohz_init() and prepare for that dependency.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a80e49e2
  5. 13 9月, 2014 4 次提交
    • R
      alarmtimer: Lock k_itimer during timer callback · 474e941b
      Richard Larocque 提交于
      Locks the k_itimer's it_lock member when handling the alarm timer's
      expiry callback.
      
      The regular posix timers defined in posix-timers.c have this lock held
      during timout processing because their callbacks are routed through
      posix_timer_fn().  The alarm timers follow a different path, so they
      ought to grab the lock somewhere else.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      474e941b
    • R
      alarmtimer: Do not signal SIGEV_NONE timers · 265b81d2
      Richard Larocque 提交于
      Avoids sending a signal to alarm timers created with sigev_notify set to
      SIGEV_NONE by checking for that special case in the timeout callback.
      
      The regular posix timers avoid sending signals to SIGEV_NONE timers by
      not scheduling any callbacks for them in the first place.  Although it
      would be possible to do something similar for alarm timers, it's simpler
      to handle this as a special case in the timeout.
      
      Prior to this patch, the alarm timer would ignore the sigev_notify value
      and try to deliver signals to the process anyway.  Even worse, the
      sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
      specified, so the signal number could be bogus.  If sigev_signo was an
      unitialized value (as it often would be if SIGEV_NONE is used), then
      it's hard to predict which signal will be sent.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      265b81d2
    • R
      alarmtimer: Return relative times in timer_gettime · e86fea76
      Richard Larocque 提交于
      Returns the time remaining for an alarm timer, rather than the time at
      which it is scheduled to expire.  If the timer has already expired or it
      is not currently scheduled, the it_value's members are set to zero.
      
      This new behavior matches that of the other posix-timers and the POSIX
      specifications.
      
      This is a change in user-visible behavior, and may break existing
      applications.  Hopefully, few users rely on the old incorrect behavior.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      [jstultz: minor style tweak]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e86fea76
    • A
      jiffies: Fix timeval conversion to jiffies · d78c9300
      Andrew Hunter 提交于
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: NPaul Turner <pjt@google.com>
      Reported-by: NAaron Jacobs <jacobsa@google.com>
      Signed-off-by: NAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d78c9300
  6. 08 9月, 2014 1 次提交
    • R
      time, signal: Protect resource use statistics with seqlock · e78c3496
      Rik van Riel 提交于
      Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
      issues on large systems, due to both functions being serialized with a
      lock.
      
      The lock protects against reporting a wrong value, due to a thread in the
      task group exiting, its statistics reporting up to the signal struct, and
      that exited task's statistics being counted twice (or not at all).
      
      Protecting that with a lock results in times() and clock_gettime() being
      completely serialized on large systems.
      
      This can be fixed by using a seqlock around the events that gather and
      propagate statistics. As an additional benefit, the protection code can
      be moved into thread_group_cputime(), slightly simplifying the calling
      functions.
      
      In the case of posix_cpu_clock_get_task() things can be simplified a
      lot, because the calling function already ensures that the task sticks
      around, and the rest is now taken care of in thread_group_cputime().
      
      This way the statistics reporting code can run lockless.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e78c3496
  7. 06 9月, 2014 1 次提交
  8. 05 9月, 2014 1 次提交
    • F
      nohz: Restore NMI safe local irq work for local nohz kick · 40bea039
      Frederic Weisbecker 提交于
      The local nohz kick is currently used by perf which needs it to be
      NMI-safe. Recent commit though (7d1311b9)
      changed its implementation to fire the local kick using the remote kick
      API. It was convenient to make the code more generic but the remote kick
      isn't NMI-safe.
      
      As a result:
      
      	WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
      	CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
      	0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
      	0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
      	0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
      	Call Trace:
      	<NMI>  [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a
      	[<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0
      	[<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20
      	[<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140
      	[<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90
      	[<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350
      	[<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	[<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	[<ffffffff9a187934>] perf_event_overflow+0x14/0x20
      	[<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410
      	[<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130
      	[<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50
      	[<ffffffff9a007b72>] nmi_handle+0xd2/0x390
      	[<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390
      	[<ffffffff9a0d131b>] ? lock_release+0xab/0x330
      	[<ffffffff9a008062>] default_do_nmi+0x72/0x1c0
      	[<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200
      	[<ffffffff9a008268>] do_nmi+0xb8/0x100
      
      Lets fix this by restoring the use of local irq work for the nohz local
      kick.
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      40bea039
  9. 27 8月, 2014 3 次提交
  10. 23 8月, 2014 2 次提交
    • V
      nohz: Avoid tick's double reprogramming in highres mode · 2a16fc93
      Viresh Kumar 提交于
      In highres mode, the tick reschedules itself unconditionally to the
      next jiffies.
      
      However while this clock reprogramming is relevant when the tick is
      in periodic mode, it's not that interesting when we run in dynticks mode
      because irq exit is likely going to overwrite the next tick to some
      randomly deferred future.
      
      So lets just get rid of this tick self rescheduling in dynticks mode.
      This way we can avoid some clockevents double write in favourable
      scenarios like when we stop the tick completely in idle while no other
      hrtimer is pending.
      Suggested-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      2a16fc93
    • V
      nohz: Fix spurious periodic tick behaviour in low-res dynticks mode · b5e995e6
      Viresh Kumar 提交于
      When we reach the end of the tick handler, we unconditionally reschedule
      the next tick to the next jiffy. Then on irq exit, the nohz code
      overrides that setting if needed and defers the next tick as far away in
      the future as possible.
      
      Now in the best dynticks case, when we actually don't need any tick in
      the future (ie: expires == KTIME_MAX), low-res and high-res behave
      differently. What we want in this case is to cancel the next tick
      programmed by the previous one. That's what we do in high-res mode. OTOH
      we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't
      do anything in this case and the next tick remains scheduled to jiffies + 1.
      
      As a result, in low-res mode, when the dynticks code determines that no
      tick is needed in the future, we can recursively get a spurious tick
      every jiffy because then the next tick is always reprogrammed from the
      tick handler and is never cancelled. And this can happen indefinetly
      until some subsystem actually needs a precise tick in the future and only
      then we eventually overwrite the previous tick handler setting to defer
      the next tick.
      
      We are fixing this by introducing the ONESHOT_STOPPED mode which will
      let us pause a clockevent when no further interrupt is needed. Meanwhile
      we can't expect all drivers to support this new mode.
      
      So lets reduce much of the symptoms by skipping the nohz-blind tick
      rescheduling from the tick-handler when the CPU is in dynticks mode.
      That tick rescheduling wrongly assumed periodicity and the low-res
      dynticks code can't cancel such decision. This breaks the recursive (and
      thus the worst) part of the problem. In the worst case now, we'll get
      only one extra tick due to uncancelled tick scheduled before we entered
      dynticks mode.
      
      This also removes a needless clockevent write on idle ticks. Since those
      clock write are usually considered to be slow, it's a general win.
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      b5e995e6
  11. 15 8月, 2014 1 次提交
  12. 01 8月, 2014 1 次提交
    • J
      timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks · 504d5874
      Jan Kara 提交于
      clockevents_increase_min_delta() calls printk() from under
      hrtimer_bases.lock. That causes lock inversion on scheduler locks because
      printk() can call into the scheduler. Lockdep puts it as:
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      3.15.0-rc8-06195-g939f04be #2 Not tainted
      -------------------------------------------------------
      trinity-main/74 is trying to acquire lock:
       (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c
      
      but task is already holding lock:
       (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (hrtimer_bases.lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
             [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
             [<81080792>] task_clock_event_start+0x3a/0x3f
             [<810807a4>] task_clock_event_add+0xd/0x14
             [<8108259a>] event_sched_in+0xb6/0x17a
             [<810826a2>] group_sched_in+0x44/0x122
             [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
             [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
             [<81082bf6>] __perf_install_in_context+0x8b/0xa3
             [<8107eb8e>] remote_function+0x12/0x2a
             [<8105f5af>] smp_call_function_single+0x2d/0x53
             [<8107e17d>] task_function_call+0x30/0x36
             [<8107fb82>] perf_install_in_context+0x87/0xbb
             [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
             [<810856f9>] SyS_perf_event_open+0x17/0x19
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #4 (&ctx->lock){......}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      -> #3 (&rq->lock){-.-.-.}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81040873>] __task_rq_lock+0x33/0x3a
             [<8104184c>] wake_up_new_task+0x25/0xc2
             [<8102474b>] do_fork+0x15c/0x2a0
             [<810248a9>] kernel_thread+0x1a/0x1f
             [<814232a2>] rest_init+0x1a/0x10e
             [<817af949>] start_kernel+0x303/0x308
             [<817af2ab>] i386_start_kernel+0x79/0x7d
      
      -> #2 (&p->pi_lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<810413dd>] try_to_wake_up+0x1d/0xd6
             [<810414cd>] default_wake_function+0xb/0xd
             [<810461f3>] __wake_up_common+0x39/0x59
             [<81046346>] __wake_up+0x29/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #1 (&tty->write_wait){-.....}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<81046332>] __wake_up+0x15/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #0 (&port_lock_key){-.....}:
             [<8104a62d>] __lock_acquire+0x9ea/0xc6d
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<811c60be>] serial8250_console_write+0x8c/0x10c
             [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
             [<8104f5d5>] console_unlock+0x1d7/0x398
             [<8104fb70>] vprintk_emit+0x3da/0x3e4
             [<81425f76>] printk+0x17/0x19
             [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
             [<8105c548>] clockevents_program_event+0xe7/0xf3
             [<8105cc1c>] tick_program_event+0x1e/0x23
             [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
             [<8103c49e>] __remove_hrtimer+0x5b/0x79
             [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
             [<8103cb4b>] hrtimer_cancel+0xd/0x18
             [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
             [<81080705>] task_clock_event_stop+0x20/0x64
             [<81080756>] task_clock_event_del+0xd/0xf
             [<81081350>] event_sched_out+0xab/0x11e
             [<810813e0>] group_sched_out+0x1d/0x66
             [<81081682>] ctx_sched_out+0xaf/0xbf
             [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &port_lock_key --> &ctx->lock --> hrtimer_bases.lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(hrtimer_bases.lock);
                                     lock(&ctx->lock);
                                     lock(hrtimer_bases.lock);
        lock(&port_lock_key);
      
       *** DEADLOCK ***
      
      4 locks held by trinity-main/74:
       #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
       #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
       #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4
      
      stack backtrace:
      CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04be #2
       00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
       8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
       8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
      Call Trace:
       [<81426f69>] dump_stack+0x16/0x18
       [<81425a99>] print_circular_bug+0x18f/0x19c
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104af87>] ? lock_release+0x191/0x223
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8104416d>] ? __dequeue_entity+0x23/0x27
       [<81044505>] ? pick_next_task_fair+0xb1/0x120
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
       [<810475b0>] ? trace_hardirqs_off+0xb/0xd
       [<81056346>] ? rcu_irq_exit+0x64/0x77
      
      Fix the problem by using printk_deferred() which does not call into the
      scheduler.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      504d5874
  13. 24 7月, 2014 15 次提交
    • S
      sched_clock: Avoid corrupting hrtimer tree during suspend · f723aa18
      Stephen Boyd 提交于
      During suspend we call sched_clock_poll() to update the epoch and
      accumulated time and reprogram the sched_clock_timer to fire
      before the next wrap-around time. Unfortunately,
      sched_clock_poll() doesn't restart the timer, instead it relies
      on the hrtimer layer to do that and during suspend we aren't
      calling that function from the hrtimer layer. Instead, we're
      reprogramming the expires time while the hrtimer is enqueued,
      which can cause the hrtimer tree to be corrupted. Furthermore, we
      restart the timer during suspend but we update the epoch during
      resume which seems counter-intuitive.
      
      Let's fix this by saving the accumulated state and canceling the
      timer during suspend. On resume we can update the epoch and
      restart the timer similar to what we would do if we were starting
      the clock for the first time.
      
      Fixes: a08ca5d1 "sched_clock: Use an hrtimer instead of timer"
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/1406174630-23458-1-git-send-email-john.stultz@linaro.org
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f723aa18
    • J
      timekeeping: Use cached ntp_tick_length when accumulating error · 375f45b5
      John Stultz 提交于
      By caching the ntp_tick_length() when we correct the frequency error,
      and then using that cached value to accumulate error, we avoid large
      initial errors when the tick length is changed.
      
      This makes convergence happen much faster in the simulator, since the
      initial error doesn't have to be slowly whittled away.
      
      This initially seems like an accounting error, but Miroslav pointed out
      that ntp_tick_length() can change mid-tick, so when we apply it in the
      error accumulation, we are applying any recent change to the entire tick.
      
      This approach chooses to apply changes in the ntp_tick_length() only to
      the next tick, which allows us to calculate the freq correction before
      using the new tick length, which avoids accummulating error.
      
      Credit to Miroslav for pointing this out and providing the original patch
      this functionality has been pulled out from, along with the rational.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reported-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      375f45b5
    • J
      timekeeping: Rework frequency adjustments to work better w/ nohz · dc491596
      John Stultz 提交于
      The existing timekeeping_adjust logic has always been complicated
      to understand. Further, since it was developed prior to NOHZ becoming
      common, its not surprising it performs poorly when NOHZ is enabled.
      
      Since Miroslav pointed out the problematic nature of the existing code
      in the NOHZ case, I've tried to refactor the code to perform better.
      
      The problem with the previous approach was that it tried to adjust
      for the total cumulative error using a scaled dampening factor. This
      resulted in large errors to be corrected slowly, while small errors
      were corrected quickly. With NOHZ the timekeeping code doesn't know
      how far out the next tick will be, so this results in bad
      over-correction to small errors, and insufficient correction to large
      errors.
      
      Inspired by Miroslav's patch, I've refactored the code to try to
      address the correction in two steps.
      
      1) Check the future freq error for the next tick, and if the frequency
      error is large, try to make sure we correct it so it doesn't cause
      much accumulated error.
      
      2) Then make a small single unit adjustment to correct any cumulative
      error that has collected over time.
      
      This method performs fairly well in the simulator Miroslav created.
      
      Major credit to Miroslav for pointing out the issue, providing the
      original patch to resolve this, a simulator for testing, as well as
      helping debug and resolve issues in my implementation so that it
      performed closer to his original implementation.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reported-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      dc491596
    • J
      timekeeping: Minor fixup for timespec64->timespec assignment · e2dff1ec
      John Stultz 提交于
      In the GENERIC_TIME_VSYSCALL_OLD update_vsyscall implementation,
      we take the tk_xtime() value, which returns a timespec64, and
      store it in a timespec.
      
      This luckily is ok, since the only architectures that use
      GENERIC_TIME_VSYSCALL_OLD are ia64 and ppc64, which are both
      64 bit systems where timespec64 is the same as a timespec.
      
      Even so, for cleanliness reasons, use the conversion function
      to assign the proper type.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e2dff1ec
    • T
      timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC · 4396e058
      Thomas Gleixner 提交于
      Tracers want a correlated time between the kernel instrumentation and
      user space. We really do not want to export sched_clock() to user
      space, so we need to provide something sensible for this.
      
      Using separate data structures with an non blocking sequence count
      based update mechanism allows us to do that. The data structure
      required for the readout has a sequence counter and two copies of the
      timekeeping data.
      
      On the update side:
      
        smp_wmb();
        tkf->seq++;
        smp_wmb();
        update(tkf->base[0], tk);
        smp_wmb();
        tkf->seq++;
        smp_wmb();
        update(tkf->base[1], tk);
      
      On the reader side:
      
        do {
           seq = tkf->seq;
           smp_rmb();
           idx = seq & 0x01;
           now = now(tkf->base[idx]);
           smp_rmb();
        } while (seq != tkf->seq)
      
      So if a NMI hits the update of base[0] it will use base[1] which is
      still consistent, but this timestamp is not guaranteed to be monotonic
      across an update.
      
      The timestamp is calculated by:
      
      	now = base_mono + clock_delta * slope
      
      So if the update lowers the slope, readers who are forced to the
      not yet updated second array are still using the old steeper slope.
      
       tmono
       ^
       |    o  n
       |   o n
       |  u
       | o
       |o
       |12345678---> reader order
      
       o = old slope
       u = update
       n = new slope
      
      So reader 6 will observe time going backwards versus reader 5.
      
      While other CPUs are likely to be able observe that, the only way
      for a CPU local observation is when an NMI hits in the middle of
      the update. Timestamps taken from that NMI context might be ahead
      of the following timestamps. Callers need to be aware of that and
      deal with it.
      
      V2: Got rid of clock monotonic raw and reorganized the data
          structures. Folded in the barrier fix from Mathieu.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      4396e058
    • T
      timekeeping: Use tk_read_base as argument for timekeeping_get_ns() · 0e5ac3a8
      Thomas Gleixner 提交于
      All the function needs is in the tk_read_base struct. No functional
      change for the current code, just a preparatory patch for the NMI safe
      accessor to clock monotonic which will use struct tk_read_base as well.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0e5ac3a8
    • T
      timekeeping: Create struct tk_read_base and use it in struct timekeeper · d28ede83
      Thomas Gleixner 提交于
      The members of the new struct are the required ones for the new NMI
      safe accessor to clcok monotonic. In order to reuse the existing
      timekeeping code and to make the update of the fast NMI safe
      timekeepers a simple memcpy use the struct for the timekeeper as well
      and convert all users.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d28ede83
    • T
      timekeeping: Restructure the timekeeper some more · 6d3aadf3
      Thomas Gleixner 提交于
      Access to time requires to touch two cachelines at minimum
      
         1) The timekeeper data structure
      
         2) The clocksource data structure
      
      The access to the clocksource data structure can be avoided as almost
      all clocksource implementations ignore the argument to the read
      callback, which is a pointer to the clocksource.
      
      But the core needs to touch it to access the members @read and @mask.
      
      So we are better off by copying the @read function pointer and the
      @mask from the clocksource to the core data structure itself.
      
      For the most used ktime_get() access all required data including the
      @read and @mask copies fits together with the sequence counter into a
      single 64 byte cacheline.
      
      For the other time access functions we touch in the current code three
      cache lines in the worst case. But with the clocksource data copies we
      can reduce that to two adjacent cachelines, which is more efficient
      than disjunct cache lines.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6d3aadf3
    • T
      clocksource: Get rid of cycle_last · 4a0e6377
      Thomas Gleixner 提交于
      cycle_last was added to the clocksource to support the TSC
      validation. We moved that to the core code, so we can get rid of the
      extra copy.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      4a0e6377
    • T
      clocksource: Move cycle_last validation to core code · 09ec5442
      Thomas Gleixner 提交于
      The only user of the cycle_last validation is the x86 TSC. In order to
      provide NMI safe accessor functions for clock monotonic and
      monotonic_raw we need to do that in the core.
      
      We can't do the TSC specific
      
          if (now < cycle_last)
             	    now = cycle_last;
      
      for the other wrapping around clocksources, but TSC has
      CLOCKSOURCE_MASK(64) which actually does not mask out anything so if
      now is less than cycle_last the subtraction will give a negative
      result. So we can check for that in clocksource_delta() and return 0
      for that case.
      
      Implement and enable it for x86
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      09ec5442
    • T
      clocksource: Make delta calculation a function · 3a978377
      Thomas Gleixner 提交于
      We want to move the TSC sanity check into core code to make NMI safe
      accessors to clock monotonic[_raw] possible. For this we need to
      sanity check the delta calculation. Create a helper function and
      convert all sites to use it.
      
      [ Build fix from jstultz ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      3a978377
    • T
      timekeeping: Provide ktime_get_raw() · f519b1a2
      Thomas Gleixner 提交于
      Provide a ktime_t based interface for raw monotonic time.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      f519b1a2
    • T
      timekeeping: Simplify timekeeping_clocktai() · 61edec81
      Thomas Gleixner 提交于
      timekeeping_clocktai() is not used in fast pathes, so the extra
      timespec conversion is not problematic.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      61edec81
    • T
      timekeeping: Remove timekeeper.total_sleep_time · 47da70d3
      Thomas Gleixner 提交于
      No more users. Remove it
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      47da70d3
    • T
      timekeeping: Simplify getboottime() · 02cba159
      Thomas Gleixner 提交于
      Subtracting plain nsec values and converting to timespec is simpler
      than the whole timespec math. Not really fastpath code, so the
      division is not an issue.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      02cba159