1. 20 6月, 2017 2 次提交
    • J
      time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting · 3d88d56c
      John Stultz 提交于
      Due to how the MONOTONIC_RAW accumulation logic was handled,
      there is the potential for a 1ns discontinuity when we do
      accumulations. This small discontinuity has for the most part
      gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
      in their vDSO clock_gettime implementation, we've seen failures
      with the inconsistency-check test in kselftest.
      
      This patch addresses the issue by using the same sub-ns
      accumulation handling that CLOCK_MONOTONIC uses, which avoids
      the issue for in-kernel users.
      
      Since the ARM64 vDSO implementation has its own clock_gettime
      calculation logic, this patch reduces the frequency of errors,
      but failures are still seen. The ARM64 vDSO will need to be
      updated to include the sub-nanosecond xtime_nsec values in its
      calculation for this issue to be completely fixed.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Tested-by: NDaniel Mentz <danielmentz@google.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "stable #4 . 8+" <stable@vger.kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3d88d56c
    • J
      time: Fix clock->read(clock) race around clocksource changes · ceea5e37
      John Stultz 提交于
      In tests, which excercise switching of clocksources, a NULL
      pointer dereference can be observed on AMR64 platforms in the
      clocksource read() function:
      
      u64 clocksource_mmio_readl_down(struct clocksource *c)
      {
      	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
      }
      
      This is called from the core timekeeping code via:
      
      	cycle_now = tkr->read(tkr->clock);
      
      tkr->read is the cached tkr->clock->read() function pointer.
      When the clocksource is changed then tkr->clock and tkr->read
      are updated sequentially. The code above results in a sequential
      load operation of tkr->read and tkr->clock as well.
      
      If the store to tkr->clock hits between the loads of tkr->read
      and tkr->clock, then the old read() function is called with the
      new clock pointer. As a consequence the read() function
      dereferences a different data structure and the resulting 'reg'
      pointer can point anywhere including NULL.
      
      This problem was introduced when the timekeeping code was
      switched over to use struct tk_read_base. Before that, it was
      theoretically possible as well when the compiler decided to
      reload clock in the code sequence:
      
           now = tk->clock->read(tk->clock);
      
      Add a helper function which avoids the issue by reading
      tk_read_base->clock once into a local variable clk and then issue
      the read function via clk->read(clk). This guarantees that the
      read() function always gets the proper clocksource pointer handed
      in.
      
      Since there is now no use for the tkr.read pointer, this patch
      also removes it, and to address stopping the fast timekeeper
      during suspend/resume, it introduces a dummy clocksource to use
      rather then just a dummy read function.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ceea5e37
  2. 31 3月, 2017 1 次提交
  3. 02 3月, 2017 2 次提交
  4. 07 1月, 2017 1 次提交
  5. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  6. 25 12月, 2016 1 次提交
  7. 09 12月, 2016 4 次提交
    • T
      timekeeping: Use mul_u64_u32_shr() instead of open coding it · c029a2be
      Thomas Gleixner 提交于
      The resume code must deal with a clocksource delta which is potentially big
      enough to overflow the 64bit mult.
      
      Replace the open coded handling with the proper function.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Parit Bhargava <prarit@redhat.com>
      Cc: Laurent Vivier <lvivier@redhat.com>
      Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Liav Rehana <liavr@mellanox.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/20161208204228.921674404@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c029a2be
    • T
      timekeeping: Get rid of pointless typecasts · cbd99e3b
      Thomas Gleixner 提交于
      cycle_t is defined as u64, so casting it to u64 is a pointless and
      confusing exercise. cycle_t should simply go away and be replaced with a
      plain u64 to avoid further confusion.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Parit Bhargava <prarit@redhat.com>
      Cc: Laurent Vivier <lvivier@redhat.com>
      Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Liav Rehana <liavr@mellanox.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/20161208204228.844699737@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      cbd99e3b
    • T
      timekeeping: Make the conversion call chain consistently unsigned · acc89612
      Thomas Gleixner 提交于
      Propagating a unsigned value through signed variables and functions makes
      absolutely no sense and is just prone to (re)introduce subtle signed
      vs. unsigned issues as happened recently.
      
      Clean it up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Parit Bhargava <prarit@redhat.com>
      Cc: Laurent Vivier <lvivier@redhat.com>
      Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Liav Rehana <liavr@mellanox.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/20161208204228.765843099@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      acc89612
    • T
      timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion · 9c164572
      Thomas Gleixner 提交于
      The clocksource delta to nanoseconds conversion is using signed math, but
      the delta is unsigned. This makes the conversion space smaller than
      necessary and in case of a multiplication overflow the conversion can
      become negative. The conversion is done with scaled math:
      
          s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;
      
      Shifting a signed integer right obvioulsy preserves the sign, which has
      interesting consequences:
       
       - Time jumps backwards
       
       - __iter_div_u64_rem() which is used in one of the calling code pathes
         will take forever to piecewise calculate the seconds/nanoseconds part.
      
      This has been reported by several people with different scenarios:
      
      David observed that when stopping a VM with a debugger:
      
       "It was essentially the stopped by debugger case.  I forget exactly why,
        but the guest was being explicitly stopped from outside, it wasn't just
        scheduling lag.  I think it was something in the vicinity of 10 minutes
        stopped."
      
       When lifting the stop the machine went dead.
      
      The stopped by debugger case is not really interesting, but nevertheless it
      would be a good thing not to die completely.
      
      But this was also observed on a live system by Liav:
      
       "When the OS is too overloaded, delta will get a high enough value for the
        msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
        after the shift the nsec variable will gain a value similar to
        0xffffffffff000000."
      
      Unfortunately this has been reintroduced recently with commit 6bd58f09
      ("time: Add cycles to nanoseconds translation"). It had been fixed a year
      ago already in commit 35a4933a ("time: Avoid signed overflow in
      timekeeping_get_ns()").
      
      Though it's not surprising that the issue has been reintroduced because the
      function itself and the whole call chain uses s64 for the result and the
      propagation of it. The change in this recent commit is subtle:
      
         s64 nsec;
      
      -  nsec = (d * m + n) >> s:
      +  nsec = d * m + n;
      +  nsec >>= s;
      
      d being type of cycle_t adds another level of obfuscation.
      
      This wouldn't have happened if the previous change to unsigned computation
      would have made the 'nsec' variable u64 right away and a follow up patch
      had cleaned up the whole call chain.
      
      There have been patches submitted which basically did a revert of the above
      patch leaving everything else unchanged as signed. Back to square one. This
      spawned a admittedly pointless discussion about potential users which rely
      on the unsigned behaviour until someone pointed out that it had been fixed
      before. The changelogs of said patches added further confusion as they made
      finally false claims about the consequences for eventual users which expect
      signed results.
      
      Despite delta being cycle_t, aka. u64, it's very well possible to hand in
      a signed negative value and the signed computation will happily return the
      correct result. But nobody actually sat down and analyzed the code which
      was added as user after the propably unintended signed conversion.
      
      Though in sensitive code like this it's better to analyze it proper and
      make sure that nothing relies on this than hunting the subtle wreckage half
      a year later. After analyzing all call chains it stands that no caller can
      hand in a negative value (which actually would work due to the s64 cast)
      and rely on the signed math to do the right thing.
      
      Change the conversion function to unsigned math. The conversion of all call
      chains is done in a follow up patch.
      
      This solves the starvation issue, which was caused by the negative result,
      but it does not solve the underlying problem. It merily procrastinates
      it. When the timekeeper update is deferred long enough that the unsigned
      multiplication overflows, then time going backwards is observable again.
      
      It does neither solve the issue of clocksources with a small counter width
      which will wrap around possibly several times and cause random time stamps
      to be generated. But those are usually not found on systems used for
      virtualization, so this is likely a non issue.
      
      I took the liberty to claim authorship for this simply because
      analyzing all callsites and writing the changelog took substantially
      more time than just making the simple s/s64/u64/ change and ignore the
      rest.
      
      Fixes: 6bd58f09 ("time: Add cycles to nanoseconds translation")
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reported-by: NLiav Rehana <liavr@mellanox.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Parit Bhargava <prarit@redhat.com>
      Cc: Laurent Vivier <lvivier@redhat.com>
      Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9c164572
  8. 30 11月, 2016 1 次提交
    • J
      timekeeping: Add a fast and NMI safe boot clock · 948a5312
      Joel Fernandes 提交于
      This boot clock can be used as a tracing clock and will account for
      suspend time.
      
      To keep it NMI safe since we're accessing from tracing, we're not using a
      separate timekeeper with updates to monotonic clock and boot offset
      protected with seqlocks. This has the following minor side effects:
      
      (1) Its possible that a timestamp be taken after the boot offset is updated
      but before the timekeeper is updated. If this happens, the new boot offset
      is added to the old timekeeping making the clock appear to update slightly
      earlier:
         CPU 0                                        CPU 1
         timekeeping_inject_sleeptime64()
         __timekeeping_inject_sleeptime(tk, delta);
                                                      timestamp();
         timekeeping_update(tk, TK_CLEAR_NTP...);
      
      (2) On 32-bit systems, the 64-bit boot offset (tk->offs_boot) may be
      partially updated.  Since the tk->offs_boot update is a rare event, this
      should be a rare occurrence which postprocessing should be able to handle.
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1480372524-15181-6-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      948a5312
  9. 05 10月, 2016 1 次提交
    • J
      timekeeping: Fix __ktime_get_fast_ns() regression · 58bfea95
      John Stultz 提交于
      In commit 27727df2 ("Avoid taking lock in NMI path with
      CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
      the timekeeping_get_ns() function, but I forgot to include
      the unit conversion from cycles to nanoseconds, breaking the
      function's output, which impacts users like perf.
      
      This results in bogus perf timestamps like:
       swapper     0 [000]   253.427536:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426573:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426687:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426800:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.426905:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427022:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427127:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427239:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427346:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   254.427463:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]   255.426572:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      
      Instead of more reasonable expected timestamps like:
       swapper     0 [000]    39.953768:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.064839:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.175956:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.287103:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.398217:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.509324:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.620437:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.731546:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.842654:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    40.953772:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
       swapper     0 [000]    41.064881:  111111111 cpu-clock:  ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
      
      Add the proper use of timekeeping_delta_to_ns() to convert
      the cycle delta to nanoseconds as needed.
      
      Thanks to Brendan and Alexei for finding this quickly after
      the v4.8 release. Unfortunately the problematic commit has
      landed in some -stable trees so they'll need this fix as
      well.
      
      Many apologies for this mistake. I'll be looking to add a
      perf-clock sanity test to the kselftest timers tests soon.
      
      Fixes: 27727df2 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
      Reported-by: NBrendan Gregg <bgregg@netflix.com>
      Reported-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Tested-and-reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      58bfea95
  10. 24 8月, 2016 1 次提交
  11. 01 7月, 2016 1 次提交
  12. 21 6月, 2016 1 次提交
  13. 08 3月, 2016 1 次提交
    • I
      time/timekeeping: Work around false positive GCC warning · 6436257b
      Ingo Molnar 提交于
      Newer GCC versions trigger the following warning:
      
        kernel/time/timekeeping.c: In function ‘get_device_system_crosststamp’:
        kernel/time/timekeeping.c:987:5: warning: ‘clock_was_set_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
          if (discontinuity) {
           ^
        kernel/time/timekeeping.c:1045:15: note: ‘clock_was_set_seq’ was declared here
          unsigned int clock_was_set_seq;
                       ^
      
      GCC clearly is unable to recognize that the 'do_interp' boolean tracks
      the initialization status of 'clock_was_set_seq'.
      
      The GCC version used was:
      
        gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC)
      
      Work it around by initializing clock_was_set_seq to 0. Compilers that
      are able to recognize the code flow will eliminate the unnecessary
      initialization.
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6436257b
  14. 03 3月, 2016 5 次提交
    • C
      time: Add history to cross timestamp interface supporting slower devices · 2c756feb
      Christopher S. Hall 提交于
      Another representative use case of time sync and the correlated
      clocksource (in addition to PTP noted above) is PTP synchronized
      audio.
      
      In a streaming application, as an example, samples will be sent and/or
      received by multiple devices with a presentation time that is in terms
      of the PTP master clock. Synchronizing the audio output on these
      devices requires correlating the audio clock with the PTP master
      clock. The more precise this correlation is, the better the audio
      quality (i.e. out of sync audio sounds bad).
      
      From an application standpoint, to correlate the PTP master clock with
      the audio device clock, the system clock is used as a intermediate
      timebase. The transforms such an application would perform are:
      
          System Clock <-> Audio clock
          System Clock <-> Network Device Clock [<-> PTP Master Clock]
      
      Modern Intel platforms can perform a more accurate cross timestamp in
      hardware (ART,audio device clock).  The audio driver requires
      ART->system time transforms -- the same as required for the network
      driver. These platforms offload audio processing (including
      cross-timestamps) to a DSP which to ensure uninterrupted audio
      processing, communicates and response to the host only once every
      millsecond. As a result is takes up to a millisecond for the DSP to
      receive a request, the request is processed by the DSP, the audio
      output hardware is polled for completion, the result is copied into
      shared memory, and the host is notified. All of these operation occur
      on a millisecond cadence.  This transaction requires about 2 ms, but
      under heavier workloads it may take up to 4 ms.
      
      Adding a history allows these slow devices the option of providing an
      ART value outside of the current interval. In this case, the callback
      provided is an accessor function for the previously obtained counter
      value. If get_system_device_crosststamp() receives a counter value
      previous to cycle_last, it consults the history provided as an
      argument in history_ref and interpolates the realtime and monotonic
      raw system time using the provided counter value. If there are any
      clock discontinuities, e.g. from calling settimeofday(), the monotonic
      raw time is interpolated in the usual way, but the realtime clock time
      is adjusted by scaling the monotonic raw adjustment.
      
      When an accessor function is used a history argument *must* be
      provided. The history is initialized using ktime_get_snapshot() and
      must be called before the counter values are read.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      [jstultz: Fixed up cycles_t/cycle_t type confusion]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      2c756feb
    • C
      time: Add driver cross timestamp interface for higher precision time synchronization · 8006c245
      Christopher S. Hall 提交于
      ACKNOWLEDGMENT: cross timestamp code was developed by Thomas Gleixner
      <tglx@linutronix.de>. It has changed considerably and any mistakes are
      mine.
      
      The precision with which events on multiple networked systems can be
      synchronized using, as an example, PTP (IEEE 1588, 802.1AS) is limited
      by the precision of the cross timestamps between the system clock and
      the device (timestamp) clock. Precision here is the degree of
      simultaneity when capturing the cross timestamp.
      
      Currently the PTP cross timestamp is captured in software using the
      PTP device driver ioctl PTP_SYS_OFFSET. Reads of the device clock are
      interleaved with reads of the realtime clock. At best, the precision
      of this cross timestamp is on the order of several microseconds due to
      software latencies. Sub-microsecond precision is required for
      industrial control and some media applications. To achieve this level
      of precision hardware supported cross timestamping is needed.
      
      The function get_device_system_crosstimestamp() allows device drivers
      to return a cross timestamp with system time properly scaled to
      nanoseconds.  The realtime value is needed to discipline that clock
      using PTP and the monotonic raw value is used for applications that
      don't require a "real" time, but need an unadjusted clock time.  The
      get_device_system_crosstimestamp() code calls back into the driver to
      ensure that the system counter is within the current timekeeping
      update interval.
      
      Modern Intel hardware provides an Always Running Timer (ART) which is
      exactly related to TSC through a known frequency ratio. The ART is
      routed to devices on the system and is used to precisely and
      simultaneously capture the device clock with the ART.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      [jstultz: Reworked to remove extra structures and simplify calling]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      8006c245
    • C
      time: Remove duplicated code in ktime_get_raw_and_real() · ba26621e
      Christopher S. Hall 提交于
      The code in ktime_get_snapshot() is a superset of the code in
      ktime_get_raw_and_real() code. Further, ktime_get_raw_and_real() is
      called only by the PPS code, pps_get_ts(). Consolidate the
      pps_get_ts() code into a single function calling ktime_get_snapshot()
      and eliminate ktime_get_raw_and_real(). A side effect of this is that
      the raw and real results of pps_get_ts() correspond to exactly the
      same clock cycle. Previously these values represented separate reads
      of the system clock.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      ba26621e
    • C
      time: Add timekeeping snapshot code capturing system time and counter · 9da0f49c
      Christopher S. Hall 提交于
      In the current timekeeping code there isn't any interface to
      atomically capture the current relationship between the system counter
      and system time. ktime_get_snapshot() returns this triple (counter,
      monotonic raw, realtime) in the system_time_snapshot struct.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      [jstultz: Moved structure definitions around to clean things up,
       fixed cycles_t/cycle_t confusion.]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      9da0f49c
    • C
      time: Add cycles to nanoseconds translation · 6bd58f09
      Christopher S. Hall 提交于
      The timekeeping code does not currently provide a way to translate
      externally provided clocksource cycles to system time. The cycle count
      is always provided by the result clocksource read() method internal to
      the timekeeping code. The added function timekeeping_cycles_to_ns()
      calculated a nanosecond value from a cycle count that can be added to
      tk_read_base.base value yielding the current system time. This allows
      clocksource cycle values external to the timekeeping code to provide a
      cycle count that can be transformed to system time.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: kevin.b.stanton@intel.com
      Cc: kevin.j.clarke@intel.com
      Cc: hpa@zytor.com
      Cc: jeffrey.t.kirsher@intel.com
      Cc: netdev@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChristopher S. Hall <christopher.s.hall@intel.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6bd58f09
  15. 15 2月, 2016 1 次提交
  16. 17 12月, 2015 2 次提交
    • J
      timekeeping: Cap adjustments so they don't exceed the maxadj value · ec02b076
      John Stultz 提交于
      Thus its been occasionally noted that users have seen
      confusing warnings like:
      
          Adjusting tsc more than 11% (5941981 vs 7759439)
      
      We try to limit the maximum total adjustment to 11% (10% tick
      adjustment + 0.5% frequency adjustment). But this is done by
      bounding the requested adjustment values, and the internal
      steering that is done by tracking the error from what was
      requested and what was applied, does not have any such limits.
      
      This is usually not problematic, but in some cases has a risk
      that an adjustment could cause the clocksource mult value to
      overflow, so its an indication things are outside of what is
      expected.
      
      It ends up most of the reports of this 11% warning are on systems
      using chrony, which utilizes the adjtimex() ADJ_TICK interface
      (which allows a +-10% adjustment). The original rational for
      ADJ_TICK unclear to me but my assumption it was originally added
      to allow broken systems to get a big constant correction at boot
      (see adjtimex userspace package for an example) which would allow
      the system to work w/ ntpd's 0.5% adjustment limit.
      
      Chrony uses ADJ_TICK to make very aggressive short term corrections
      (usually right at startup). Which push us close enough to the max
      bound that a few late ticks can cause the internal steering to push
      past the max adjust value (tripping the warning).
      
      Thus this patch adds some extra logic to enforce the max adjustment
      cap in the internal steering.
      
      Note: This has the potential to slow corrections when the ADJ_TICK
      value is furthest away from the default value. So it would be good to
      get some testing from folks using chrony, to make sure we don't
      cause any troubles there.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Tested-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      ec02b076
    • D
      timekeeping: Provide internal function __ktime_get_real_seconds · dee36654
      DengChao 提交于
      In order to fix Y2038 issues in the ntp code we will need replace
      get_seconds() with ktime_get_real_seconds() but as the ntp code uses
      the timekeeping lock which is also used by ktime_get_real_seconds(),
      we need a version without locking.
      Add a new function __ktime_get_real_seconds() in timekeeping to
      do this.
      Reviewed-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      dee36654
  17. 11 12月, 2015 1 次提交
    • J
      time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow · 37cf4dc3
      John Stultz 提交于
      For adjtimex()'s ADJ_SETOFFSET, make sure the tv_usec value is
      sane. We might multiply them later which can cause an overflow
      and undefined behavior.
      
      This patch introduces new helper functions to simplify the
      checking code and adds comments to clarify
      
      Orginally this patch was by Sasha Levin, but I've basically
      rewritten it, so he should get credit for finding the issue
      and I should get the blame for any mistakes made since.
      
      Also, credit to Richard Cochran for the phrasing used in the
      comment for what is considered valid here.
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      37cf4dc3
  18. 08 12月, 2015 1 次提交
    • D
      time: Avoid signed overflow in timekeeping_get_ns() · 35a4933a
      David Gibson 提交于
      1e75fa8b "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
      clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
      of the same logic to avoid keeping a semi-redundant struct timespec
      in struct timekeeper.
      
      However, the commit also introduced a subtle semantic change - where
      clocksource_cyc2ns() uses purely unsigned math, the new version introduces
      a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
      overflow the following shift will still give a negative result.  The
      choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
      generally happen if there's a ~10 minute pause in examining the
      clocksource.
      
      This can be triggered on a powerpc KVM guest by stopping it from qemu for
      a bit over 10 minutes.  After resuming time has jumped backwards several
      minutes causing numerous problems (jiffies does not advance, msleep()s can
      be extended by minutes..).  It doesn't happen on x86 KVM guests, because
      the guest TSC is effectively frozen while the guest is stopped, which is
      not the case for the powerpc timebase.
      
      Obviously an unsigned (64 bit) overflow will only take twice as long as a
      signed, 63-bit overflow.  I don't know the time code well enough to know
      if that will still cause incorrect calculations, or if a 64-bit overflow
      is avoided elsewhere.
      
      Still, an incorrect forwards clock adjustment will cause less trouble than
      time going backwards.  So, this patch removes the potential for
      intermediate signed overflow.
      
      Cc: stable@vger.kernel.org  (3.7+)
      Suggested-by: NLaurent Vivier <lvivier@redhat.com>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      35a4933a
  19. 10 11月, 2015 1 次提交
    • A
      remove abs64() · 79211c8e
      Andrew Morton 提交于
      Switch everything to the new and more capable implementation of abs().
      Mainly to give the new abs() a bit of a workout.
      
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79211c8e
  20. 16 10月, 2015 1 次提交
  21. 02 10月, 2015 2 次提交
  22. 22 9月, 2015 1 次提交
  23. 13 9月, 2015 1 次提交
    • J
      time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64() · 2619d7e9
      John Stultz 提交于
      The internal clocksteering done for fine-grained error
      correction uses a logarithmic approximation, so any time
      adjtimex() adjusts the clock steering, timekeeping_freqadjust()
      quickly approximates the correct clock frequency over a series
      of ticks.
      
      Unfortunately, the logic in timekeeping_freqadjust(), introduced
      in commit:
      
        dc491596 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
      
      used the abs() function with a s64 error value to calculate the
      size of the approximated adjustment to be made.
      
      Per include/linux/kernel.h:
      
        "abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".
      
      Thus on 32-bit platforms, this resulted in the clocksteering to
      take a quite dampended random walk trying to converge on the
      proper frequency, which caused the adjustments to be made much
      slower then intended (most easily observed when large
      adjustments are made).
      
      This patch fixes the issue by using abs64() instead.
      Reported-by: NNuno Gonçalves <nunojpg@gmail.com>
      Tested-by: NNuno Goncalves <nunojpg@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: <stable@vger.kernel.org> # v3.17+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2619d7e9
  24. 18 8月, 2015 2 次提交
    • B
      time: Introduce current_kernel_time64() · 8758a240
      Baolin Wang 提交于
      The current_kernel_time() is not year 2038 safe on 32bit systems
      since it returns a timespec value. Introduce current_kernel_time64()
      which returns a timespec64 value.
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBaolin Wang <baolin.wang@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      8758a240
    • W
      time: Always make sure wall_to_monotonic isn't positive · e1d7ba87
      Wang YanQing 提交于
      Two issues were found on an IMX6 development board without an
      enabled RTC device(resulting in the boot time and monotonic
      time being initialized to 0).
      
      Issue 1:exportfs -a generate:
             "exportfs: /opt/nfs/arm does not support NFS export"
      Issue 2:cat /proc/stat:
             "btime 4294967236"
      
      The same issues can be reproduced on x86 after running the
      following code:
      	int main(void)
      	{
      	    struct timeval val;
      	    int ret;
      
      	    val.tv_sec = 0;
      	    val.tv_usec = 0;
      	    ret = settimeofday(&val, NULL);
      	    return 0;
      	}
      
      Two issues are different symptoms of same problem:
      The reason is a positive wall_to_monotonic pushes boot time back
      to the time before Epoch, and getboottime will return negative
      value.
      
      In symptom 1:
                negative boot time cause get_expiry() to overflow time_t
                when input expire time is 2147483647, then cache_flush()
                always clears entries just added in ip_map_parse.
      In symptom 2:
                show_stat() uses "unsigned long" to print negative btime
                value returned by getboottime.
      
      This patch fix the problem by prohibiting time from being set to a value which
      would cause a negative boot time. As a result one can't set the CLOCK_REALTIME
      time prior to (1970 + system uptime).
      
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NWang YanQing <udknight@gmail.com>
      [jstultz: reworded commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e1d7ba87
  25. 18 6月, 2015 1 次提交
    • J
      timekeeping: Copy the shadow-timekeeper over the real timekeeper last · 906c5557
      John Stultz 提交于
      The fix in d1518326 (time: Move clock_was_set_seq update
      before updating shadow-timekeeper) was unfortunately incomplete.
      
      The main gist of that change was to do the shadow-copy update
      last, so that any state changes were properly duplicated, and
      we wouldn't accidentally have stale data in the shadow.
      
      Unfortunately in the main update_wall_time() logic, we update
      use the shadow-timekeeper to calculate the next update values,
      then while holding the lock, copy the shadow-timekeeper over,
      then call timekeeping_update() to do some additional
      bookkeeping, (skipping the shadow mirror). The bug with this is
      the additional bookkeeping isn't all read-only, and some
      changes timkeeper state. Thus we might then overwrite this state
      change on the next update.
      
      To avoid this problem, do the timekeeping_update() on the
      shadow-timekeeper prior to copying the full state over to
      the real-timekeeper.
      
      This avoids problems with both the clock_was_set_seq and
      next_leap_ktime being overwritten and possibly the
      fast-timekeepers as well.
      
      Many thanks to Prarit for his rigorous testing, which discovered
      this problem, along with Prarit and Daniel's work validating this
      fix.
      Reported-by: NPrarit Bhargava <prarit@redhat.com>
      Tested-by: NPrarit Bhargava <prarit@redhat.com>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: http://lkml.kernel.org/r/1434560753-7441-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      906c5557
  26. 12 6月, 2015 2 次提交
    • J
      time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge · 833f32d7
      John Stultz 提交于
      Currently, leapsecond adjustments are done at tick time. As a result,
      the leapsecond was applied at the first timer tick *after* the
      leapsecond (~1-10ms late depending on HZ), rather then exactly on the
      second edge.
      
      This was in part historical from back when we were always tick based,
      but correcting this since has been avoided since it adds extra
      conditional checks in the gettime fastpath, which has performance
      overhead.
      
      However, it was recently pointed out that ABS_TIME CLOCK_REALTIME
      timers set for right after the leapsecond could fire a second early,
      since some timers may be expired before we trigger the timekeeping
      timer, which then applies the leapsecond.
      
      This isn't quite as bad as it sounds, since behaviorally it is similar
      to what is possible w/ ntpd made leapsecond adjustments done w/o using
      the kernel discipline. Where due to latencies, timers may fire just
      prior to the settimeofday call. (Also, one should note that all
      applications using CLOCK_REALTIME timers should always be careful,
      since they are prone to quirks from settimeofday() disturbances.)
      
      However, the purpose of having the kernel do the leap adjustment is to
      avoid such latencies, so I think this is worth fixing.
      
      So in order to properly keep those timers from firing a second early,
      this patch modifies the ntp and timekeeping logic so that we keep
      enough state so that the update_base_offsets_now accessor, which
      provides the hrtimer core the current time, can check and apply the
      leapsecond adjustment on the second edge. This prevents the hrtimer
      core from expiring timers too early.
      
      This patch does not modify any other time read path, so no additional
      overhead is incurred. However, this also means that the leap-second
      continues to be applied at tick time for all other read-paths.
      
      Apologies to Richard Cochran, who pushed for similar changes years
      ago, which I resisted due to the concerns about the performance
      overhead.
      
      While I suspect this isn't extremely critical, folks who care about
      strict leap-second correctness will likely want to watch
      this. Potentially a -stable candidate eventually.
      Originally-suggested-by: NRichard Cochran <richardcochran@gmail.com>
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Reported-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: http://lkml.kernel.org/r/1434063297-28657-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      833f32d7
    • J
      time: Move clock_was_set_seq update before updating shadow-timekeeper · d1518326
      John Stultz 提交于
      It was reported that 868a3e91 (hrtimer: Make offset
      update smarter) was causing timer problems after suspend/resume.
      
      The problem with that change is the modification to
      clock_was_set_seq in timekeeping_update is done prior to
      mirroring the time state to the shadow-timekeeper. Thus the
      next time we do update_wall_time() the updated sequence is
      overwritten by whats in the shadow copy.
      
      This patch moves the shadow-timekeeper mirroring to the end
      of the function, after all updates have been made, so all data
      is kept in sync.
      
      (This patch also affects the update_fast_timekeeper calls which
      were also problematically done prior to the mirroring).
      Reported-and-tested-by: NJeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/1434063297-28657-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d1518326
  27. 28 5月, 2015 1 次提交