1. 05 10月, 2009 1 次提交
    • J
      time: Implement logarithmic time accumulation · a092ff0f
      john stultz 提交于
      Accumulating one tick at a time works well unless we're using NOHZ.
      Then it can be an issue, since we may have to run through the loop
      a few thousand times, which can increase timer interrupt caused
      latency.
      
      The current solution was to accumulate in half-second intervals
      with NOHZ. This kept the number of loops down, however it did
      slightly change how we make NTP adjustments. While not an issue
      with NTPd users, as NTPd makes adjustments over a longer period of
      time, other adjtimex() users have noticed the half-second
      granularity with which we can apply frequency changes to the clock.
      
      For instance, if a application tries to apply a 100ppm frequency
      correction for 20ms to correct a 2us offset, with NOHZ they either
      get no correction, or a 50us correction.
      
      Now, there will always be some granularity error for applying
      frequency corrections. However with users sensitive to this error
      have seen a 50-500x increase with NOHZ compared to running without
      NOHZ.
      
      So I figured I'd try another approach then just simply increasing
      the interval. My approach is to consume the time interval
      logarithmically. This reduces the number of times through the loop
      needed keeping latency down, while still preserving the original
      granularity error for adjtimex() changes.
      
      Further, this change allows us to remove the xtime_cache code
      (patch to follow), as xtime is always within one tick of the
      current time, instead of the half-second updates it saw before.
      
      An earlier version of this patch has been shipping to x86 users in
      the RedHat MRG releases for awhile without issue, but I've reworked
      this version to be even more careful about avoiding possible
      overflows if the shift value gets too large.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJohn Kacur <jkacur@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1254525473.7741.88.camel@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a092ff0f
  2. 25 8月, 2009 1 次提交
  3. 22 8月, 2009 1 次提交
    • J
      time: Introduce CLOCK_REALTIME_COARSE · da15cfda
      john stultz 提交于
      After talking with some application writers who want very fast, but not
      fine-grained timestamps, I decided to try to implement new clock_ids
      to clock_gettime(): CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE
      which returns the time at the last tick. This is very fast as we don't
      have to access any hardware (which can be very painful if you're using
      something like the acpi_pm clocksource), and we can even use the vdso
      clock_gettime() method to avoid the syscall. The only trade off is you
      only get low-res tick grained time resolution.
      
      This isn't a new idea, I know Ingo has a patch in the -rt tree that made
      the vsyscall gettimeofday() return coarse grained time when the
      vsyscall64 sysctrl was set to 2. However this affects all applications
      on a system.
      
      With this method, applications can choose the proper speed/granularity
      trade-off for themselves.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: nikolag@ca.ibm.com
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: arjan@infradead.org
      Cc: jonathan@jonmasters.org
      LKML-Reference: <1250734414.6897.5.camel@localhost.localdomain>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      da15cfda
  4. 15 8月, 2009 11 次提交
  5. 07 7月, 2009 2 次提交
    • T
      timekeeping: Move ktime_get() functions to timekeeping.c · a40f262c
      Thomas Gleixner 提交于
      The ktime_get() functions for GENERIC_TIME=n are still located in
      hrtimer.c. Move them to time/timekeeping.c where they belong.
      
      LKML-Reference: <new-submission>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      a40f262c
    • M
      timekeeping: optimized ktime_get[_ts] for GENERIC_TIME=y · 951ed4d3
      Martin Schwidefsky 提交于
      The generic ktime_get function defined in kernel/hrtimer.c is suboptimial
      for GENERIC_TIME=y:
      
       0)               |  ktime_get() {
       0)               |    ktime_get_ts() {
       0)               |      getnstimeofday() {
       0)               |        read_tod_clock() {
       0)   0.601 us    |        }
       0)   1.938 us    |      }
       0)               |      set_normalized_timespec() {
       0)   0.602 us    |      }
       0)   4.375 us    |    }
       0)   5.523 us    |  }
      
      Overall there are two read_seqbegin/read_seqretry loops and a lot of
      unnecessary struct timespec calculations. ktime_get returns a nano second
      value which is the sum of xtime, wall_to_monotonic and the nano second
      delta from the clock source.
      
      ktime_get can be optimized for GENERIC_TIME=y. The new version only calls
      clocksource_read:
      
       0)               |  ktime_get() {
       0)               |    read_tod_clock() {
       0)   0.610 us    |    }
       0)   1.977 us    |  }
      
      It uses a single read_seqbegin/readseqretry loop and just adds everthing
      to a nano second value.
      
      ktime_get_ts is optimized in a similar fashion.
      
      [ tglx: added WARN_ON(timekeeping_suspended) as in getnstimeofday() ]
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: Njohn stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090707112728.3005244d@skybase>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      951ed4d3
  6. 15 5月, 2009 1 次提交
    • T
      sched, timers: move calc_load() to scheduler · dce48a84
      Thomas Gleixner 提交于
      Dimitri Sivanich noticed that xtime_lock is held write locked across
      calc_load() which iterates over all online CPUs. That can cause long
      latencies for xtime_lock readers on large SMP systems. 
      
      The load average calculation is an rough estimate anyway so there is
      no real need to protect the readers vs. the update. It's not a problem
      when the avenrun array is updated while a reader copies the values.
      
      Instead of iterating over all online CPUs let the scheduler_tick code
      update the number of active tasks shortly before the avenrun update
      happens. The avenrun update itself is handled by the CPU which calls
      do_timer().
      
      [ Impact: reduce xtime_lock write locked section ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      dce48a84
  7. 02 5月, 2009 1 次提交
  8. 22 4月, 2009 1 次提交
  9. 31 12月, 2008 1 次提交
    • T
      sched_clock: prevent scd->clock from moving backwards, take #2 · 1c5745aa
      Thomas Gleixner 提交于
      Redo:
      
        5b7dba4f: sched_clock: prevent scd->clock from moving backwards
      
      which had to be reverted due to s2ram hangs:
      
        ca7e716c: Revert "sched_clock: prevent scd->clock from moving backwards"
      
      ... this time with resume restoring GTOD later in the sequence
      taken into account as well.
      
      The "timekeeping_suspended" flag is not very nice but we cannot call into
      GTOD before it has been properly resumed and the scheduler will run very
      early in the resume sequence.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1c5745aa
  10. 04 12月, 2008 1 次提交
    • J
      time: catch xtime_nsec underflows and fix them · 6c9bacb4
      john stultz 提交于
      Impact: fix time warp bug
      
      Alex Shi, along with Yanmin Zhang have been noticing occasional time
      inconsistencies recently. Through their great diagnosis, they found that
      the xtime_nsec value used in update_wall_time was occasionally going
      negative. After looking through the code for awhile, I realized we have
      the possibility for an underflow when three conditions are met in
      update_wall_time():
      
        1) We have accumulated a second's worth of nanoseconds, so we
           incremented xtime.tv_sec and appropriately decrement xtime_nsec.
           (This doesn't cause xtime_nsec to go negative, but it can cause it
            to be small).
      
        2) The remaining offset value is large, but just slightly less then
           cycle_interval.
      
        3) clocksource_adjust() is speeding up the clock, causing a
           corrective amount (compensating for the increase in the multiplier
           being multiplied against the unaccumulated offset value) to be
           subtracted from xtime_nsec.
      
      This can cause xtime_nsec to underflow.
      
      Unfortunately, since we notify the NTP subsystem via second_overflow()
      whenever we accumulate a full second, and this effects the error
      accumulation that has already occured, we cannot simply revert the
      accumulated second from xtime nor move the second accumulation to after
      the clocksource_adjust call without a change in behavior.
      
      This leaves us with (at least) two options:
      
      1) Simply return from clocksource_adjust() without making a change if we
         notice the adjustment would cause xtime_nsec to go negative.
      
      This would work, but I'm concerned that if a large adjustment was needed
      (due to the error being large), it may be possible to get stuck with an
      ever increasing error that becomes too large to correct (since it may
      always force xtime_nsec negative). This may just be paranoia on my part.
      
      2) Catch xtime_nsec if it is negative, then add back the amount its
         negative to both xtime_nsec and the error.
      
      This second method is consistent with how we've handled earlier rounding
      issues, and also has the benefit that the error being added is always in
      the oposite direction also always equal or smaller then the correction
      being applied. So the risk of a corner case where things get out of
      control is lessened.
      
      This patch fixes bug 11970, as tested by Yanmin Zhang
      http://bugzilla.kernel.org/show_bug.cgi?id=11970
      
      Reported-by: alex.shi@intel.com
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Acked-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Tested-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6c9bacb4
  11. 24 9月, 2008 1 次提交
    • R
      timekeeping: fix rounding problem during clock update · 5cd1c9c5
      Roman Zippel 提交于
      Due to a rounding problem during a clock update it's possible for readers
      to observe the clock jumping back by 1nsec.  The following simplified
      example demonstrates the problem:
      
      cycle	xtime
      0	0
      1000	999999.6
      2000	1999999.2
      3000	2999998.8
      ...
      
      1500 =	1499999.4
      =	0.0 + 1499999.4
      =	999999.6 + 499999.8
      
      When reading the clock only the full nanosecond part is used, while
      timekeeping internally keeps nanosecond fractions.  If the clock is now
      updated at cycle 1500 here, a nanosecond is missing due to the truncation.
      
      The simple fix is to round up the xtime value during the update, this also
      changes the distance to the reference time, but the adjustment will
      automatically take care that it stays under control.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5cd1c9c5
  12. 21 8月, 2008 2 次提交
    • J
      clocksource: introduce CLOCK_MONOTONIC_RAW · 2d42244a
      John Stultz 提交于
      In talking with Josip Loncaric, and his work on clock synchronization (see
      btime.sf.net), he mentioned that for really close synchronization, it is
      useful to have access to "hardware time", that is a notion of time that is
      not in any way adjusted by the clock slewing done to keep close time sync.
      
      Part of the issue is if we are using the kernel's ntp adjusted
      representation of time in order to measure how we should correct time, we
      can run into what Paul McKenney aptly described as "Painting a road using
      the lines we're painting as the guide".
      
      I had been thinking of a similar problem, and was trying to come up with a
      way to give users access to a purely hardware based time representation
      that avoided users having to know the underlying frequency and mask values
      needed to deal with the wide variety of possible underlying hardware
      counters.
      
      My solution is to introduce CLOCK_MONOTONIC_RAW.  This exposes a
      nanosecond based time value, that increments starting at bootup and has no
      frequency adjustments made to it what so ever.
      
      The time is accessed from userspace via the posix_clock_gettime() syscall,
      passing CLOCK_MONOTONIC_RAW as the clock_id.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2d42244a
    • R
      clocksource: introduce clocksource_forward_now() · 9a055117
      Roman Zippel 提交于
      To keep the raw monotonic patch simple first introduce
      clocksource_forward_now(), which takes care of the offset since the last
      update_wall_time() call and adds it to the clock, so there is no need
      anymore to deal with it explicitly at various places, which need to make
      significant changes to the clock.
      
      This is also gets rid of the timekeeping_suspend_nsecs, instead of
      waiting until resume, the value is accumulated during suspend. In the end
      there is only a single user of __get_nsec_offset() left, so I integrated
      it back to getnstimeofday().
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a055117
  13. 01 5月, 2008 3 次提交
  14. 20 4月, 2008 1 次提交
    • T
      x86: tsc prevent time going backwards · d8bb6f4c
      Thomas Gleixner 提交于
      We already catch most of the TSC problems by sanity checks, but there
      is a subtle bug which has been in the code forever. This can cause
      time jumps in the range of hours.
      
      This was reported in:
           http://lkml.org/lkml/2007/8/23/96
      and
           http://lkml.org/lkml/2008/3/31/23
      
      I was able to reproduce the problem with a gettimeofday loop test on a
      dual core and a quad core machine which both have sychronized
      TSCs. The TSCs seems not to be perfectly in sync though, but the
      kernel is not able to detect the slight delta in the sync check. Still
      there exists an extremly small window where this delta can be observed
      with a real big time jump. So far I was only able to reproduce this
      with the vsyscall gettimeofday implementation, but in theory this
      might be observable with the syscall based version as well.
      
      CPU 0 updates the clock source variables under xtime/vyscall lock and
      CPU1, where the TSC is slighty behind CPU0, is reading the time right
      after the seqlock was unlocked.
      
      The clocksource reference data was updated with the TSC from CPU0 and
      the value which is read from TSC on CPU1 is less than the reference
      data. This results in a huge delta value due to the unsigned
      subtraction of the TSC value and the reference value. This algorithm
      can not be changed due to the support of wrapping clock sources like
      pm timer.
      
      The huge delta is converted to nanoseconds and added to xtime, which
      is then observable by the caller. The next gettimeofday call on CPU1
      will show the correct time again as now the TSC has advanced above the
      reference value.
      
      To prevent this TSC specific wreckage we need to compare the TSC value
      against the reference value and return the latter when it is larger
      than the actual TSC value.
      
      I pondered to mark the TSC unstable when the readout is smaller than
      the reference value, but this would render an otherwise good and fast
      clocksource unusable without a real good reason.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d8bb6f4c
  15. 25 3月, 2008 1 次提交
  16. 09 3月, 2008 1 次提交
  17. 09 2月, 2008 2 次提交
  18. 02 2月, 2008 1 次提交
  19. 30 1月, 2008 2 次提交
    • J
      NTP: correct inconsistent ntp interval/tick_length usage · bbe4d18a
      john stultz 提交于
      I recently noticed on one of my boxes that when synched with an NTP
      server, the drift value reported for the system was ~283ppm. While in
      some cases, clock hardware can be that bad, it struck me as unusual as
      the system was using the acpi_pm clocksource, which is one of the more
      trustworthy and accurate clocksources on x86 hardware.
      
      I brought up another system and let it sync to the same NTP server, and
      I noticed a similar 280some ppm drift.
      
      In looking at the code, I found that the acpi_pm's constant frequency
      was being computed correctly at boot-up, however once the system was up,
      even without the ntp daemon running, the clocksource's frequency was
      being modified by the clocksource_adjust() function.
      
      Digging deeper, I realized that in the code that keeps track of how much
      the clocksource is skewing from the ntp desired time, we were using
      different lengths to establish how long an time interval was.
      
      The clocksource was being setup with the following interval:
      	NTP_INTERVAL_LENGTH = NSEC_PER_SEC/NTP_INTERVAL_FREQ
      
      While the ntp code was using the tick_length_base value:
      	tick_length_base ~= (tick_usec * NSEC_PER_USEC * USER_HZ)
      					/NTP_INTERVAL_FREQ
      
      The subtle difference is:
      	(tick_usec * NSEC_PER_USEC * USER_HZ) != NSEC_PER_SEC
      
      This difference in calculation was causing the clocksource correction
      code to apply a correction factor to the clocksource so the two
      intervals were the same, however this results in the actual frequency of
      the clocksource to be made incorrect. I believe this difference would
      affect all clocksources, although to differing degrees depending on the
      clocksource resolution.
      
      The issue was introduced when my HZ free ntp patch landed in 2.6.21-rc1,
      so my apologies for the mistake, and for not noticing it until now.
      
      The following patch, corrects the clocksource's initialization code so
      it uses the same interval length as the code in ntp.c. After applying
      this patch, the drift value for the same system went from ~283ppm to
      only 2.635ppm.
      
      I believe this patch to be good, however it does affect all arches and
      I've only tested on x86, so some caution is advised. I do think it would
      be a likely candidate for a stable 2.6.24.x release.
      
      Any thoughts or feedback would be appreciated.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      bbe4d18a
    • G
      time: fold __get_realtime_clock_ts() into getnstimeofday() · efd9ac86
      Geert Uytterhoeven 提交于
        - getnstimeofday() was just a wrapper around __get_realtime_clock_ts()
        - Replace calls to __get_realtime_clock_ts() by calls to getnstimeofday()
        - Fix bogus reference to get_realtime_clock_ts(), which never existed
      Signed-off-by: NGeert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      efd9ac86
  20. 25 1月, 2008 1 次提交
  21. 17 10月, 2007 2 次提交
    • A
      kernel/time/timekeeping.c: cleanups · ba2a631b
      Adrian Bunk 提交于
      - remove the no longer required __attribute__((weak)) of xtime_lock
      - remove the following no longer used EXPORT_SYMBOL's:
        - xtime
        - xtime_lock
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba2a631b
    • I
      time: introduce xtime_seconds · f20bf612
      Ingo Molnar 提交于
      improve performance of sys_time(). sys_time() returns time in seconds,
      but it does so by calling do_gettimeofday() and then returning the
      tv_sec portion of the GTOD time. But the data structure "xtime", which
      is updated by every timer/scheduler tick, already offers HZ granularity
      time.
      
      the patch improves the sysbench oltp macrobenchmark by 4-5% on an AMD
      dual-core system:
      
      v2.6.23:
      
      #threads
      
         1:     transactions:                        4073   (407.23 per sec.)
         2:     transactions:                        8530   (852.81 per sec.)
         3:     transactions:                        8321   (831.88 per sec.)
         4:     transactions:                        8407   (840.58 per sec.)
         5:     transactions:                        8070   (806.74 per sec.)
      
      v2.6.23 + sys_time-speedup.patch:
      
         1:     transactions:                        4281   (428.09 per sec.)
         2:     transactions:                        8910   (890.85 per sec.)
         3:     transactions:                        8659   (865.79 per sec.)
         4:     transactions:                        8676   (867.34 per sec.)
         5:     transactions:                        8532   (852.91 per sec.)
      
      and by 4-5% on an Intel dual-core system too:
      
      2.6.23:
      
        1:     transactions:                        4560   (455.94 per sec.)
        2:     transactions:                        10094  (1009.30 per sec.)
        3:     transactions:                        9755   (975.36 per sec.)
        4:     transactions:                        9859   (985.78 per sec.)
        5:     transactions:                        9701   (969.72 per sec.)
      
      2.6.23 + sys_time-speedup.patch:
      
        1:     transactions:                        4779   (477.84 per sec.)
        2:     transactions:                        10103  (1010.14 per sec.)
        3:     transactions:                        10141  (1013.93 per sec.)
        4:     transactions:                        10371  (1036.89 per sec.)
        5:     transactions:                        10178  (1017.50 per sec.)
      
      (the more CPUs the system has, the more speedup this patch gives for
      this particular workload.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f20bf612
  22. 16 9月, 2007 2 次提交
    • T
      timekeeping: Prevent time going backwards on resume · 6a669ee8
      Thomas Gleixner 提交于
      Timekeeping resume adjusts xtime by adding the slept time in seconds and
      resets the reference value of the clock source (clock->cycle_last).
      clock->cycle last is used to calculate the delta between the last xtime
      update and the readout of the clock source in __get_nsec_offset(). xtime
      plus the offset is the current time. The resume code ignores the delta
      which had already elapsed between the last xtime update and the actual
      time of suspend. If the suspend time is short, then we can see time
      going backwards on resume.
      
      Suspend:
      offs_s = clock->read() - clock->cycle_last;
      now = xtime + offs_s;
      timekeeping_suspend_time = read_rtc();
      
      Resume:
      sleep_time = read_rtc() - timekeeping_suspend_time;
      xtime.tv_sec += sleep_time;
      clock->cycle_last = clock->read();
      offs_r = clock->read() - clock->cycle_last;
      now = xtime + offs_r;
      
      if sleep_time_seconds == 0 and offs_r < offs_s, then time goes
      backwards.
      
      Fix this by storing the offset from the last xtime update and add it to
      xtime during resume, when we reset clock->cycle_last:
      
      sleep_time = read_rtc() - timekeeping_suspend_time;
      xtime.tv_sec += sleep_time;
      xtime += offs_s;	/* Fixup xtime offset at suspend time */
      clock->cycle_last = clock->read();
      offs_r = clock->read() - clock->cycle_last;
      now = xtime + offs_r;
      
      Thanks to Marcelo for tracking this down on the OLPC and providing the
      necessary details to analyze the root cause.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <johnstul@us.ibm.com>
      Cc: Tosatti <marcelo@kvack.org>
      6a669ee8
    • T
      timekeeping: access rtc outside of xtime lock · 3be90950
      Thomas Gleixner 提交于
      Lockdep complains about the access of rtc in timekeeping_suspend
      inside the interrupt disabled region of the write locked xtime lock.
      Move the access outside.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <johnstul@us.ibm.com>
      3be90950