1. 14 11月, 2009 4 次提交
    • J
      nohz: Prevent clocksource wrapping during idle · 98962465
      Jon Hunter 提交于
      The dynamic tick allows the kernel to sleep for periods longer than a
      single tick, but it does not limit the sleep time currently. In the
      worst case the kernel could sleep longer than the wrap around time of
      the time keeping clock source which would result in losing track of
      time.
      
      Prevent this by limiting it to the safe maximum sleep time of the
      current time keeping clock source. The value is calculated when the
      clock source is registered.
      
      [ tglx: simplified the code a bit and massaged the commit msg ]
      Signed-off-by: NJon Hunter <jon-hunter@ti.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      98962465
    • T
      nohz: Type cast printk argument · 529eaccd
      Thomas Gleixner 提交于
      On some archs local_softirq_pending() has a data type of unsigned long
      on others its unsigned int. Type cast it to (unsigned int) in the
      printk to avoid the compiler warning.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <new-submission>
      529eaccd
    • T
      clocksource: Provide a generic mult/shift factor calculation · 7d2f944a
      Thomas Gleixner 提交于
      MIPS has two functions to calculcate the mult/shift factors for clock
      sources and clock events at run time. ARM needs such functions as
      well.
      
      Implement a function which calculates the mult/shift factors based on
      the frequencies to which and from which is converted. The function
      also has a parameter to specify the minimum conversion range in
      seconds. This range is guaranteed not to produce a 64bit overflow when
      a value is multiplied with the calculated mult factor. The larger the
      conversion range the less becomes the conversion accuracy.
      
      Provide two inline wrappers which handle clock events and clock
      sources. For clock events the "from" frequency is nano seconds per
      second which corresponds to 1GHz and "to" is the device frequency. For
      clock sources "from" is the device frequency and "to" is nano seconds
      per second.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMikael Pettersson <mikpe@it.uu.se>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NLinus Walleij <linus.walleij@stericsson.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20091111134229.766673305@linutronix.de>
      7d2f944a
    • T
      clockevents: Use u32 for mult and shift factors · 23af368e
      Thomas Gleixner 提交于
      The mult and shift factors of clock events differ in their data type
      from those of clock sources for no reason. u32 is sufficient for
      both. shift is always <= 32 and mult is limited to 2^32-1 to avoid
      64bit multiplication overflows in the conversion.
      
      Preparatory patch for a generic mult/shift factor calculation
      function.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMikael Pettersson <mikpe@it.uu.se>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NLinus Walleij <linus.walleij@stericsson.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20091111134229.725664788@linutronix.de>
      23af368e
  2. 05 11月, 2009 2 次提交
    • M
      nohz: Introduce arch_needs_cpu · 3c5d92a0
      Martin Schwidefsky 提交于
      Allow the architecture to request a normal jiffy tick when the system
      goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is
      used to prevent the system going fully idle if there has been an
      interrupt other than a clock comparator interrupt since the last wakeup.
      
      On s390 the HiperSockets response time for 1 connection ping-pong goes
      down from 42 to 34 microseconds. The CPU cost decreases by 27%.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      LKML-Reference: <20090929122533.402715150@de.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3c5d92a0
    • M
      nohz: Reuse ktime in sub-functions of tick_check_idle. · eed3b9cf
      Martin Schwidefsky 提交于
      On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and
      tick_nohz_update_jiffies. Given the right conditions (ts->idle_active
      and/or ts->tick_stopped) both function get a time stamp with ktime_get.
      The same time stamp can be reused if both function require one.
      
      On s390 this change has the additional benefit that gcc inlines the
      tick_nohz_stop_idle function into tick_check_idle. The number of
      instructions to execute tick_check_idle drops from 225 to 144
      (without the ktime_get optimization it is 367 vs 215 instructions).
      
      before:
      
       0)               |  tick_check_idle() {
       0)               |    tick_nohz_stop_idle() {
       0)               |      ktime_get() {
       0)               |        read_tod_clock() {
       0)   0.601 us    |        }
       0)   1.765 us    |      }
       0)   3.047 us    |    }
       0)               |    ktime_get() {
       0)               |      read_tod_clock() {
       0)   0.570 us    |      }
       0)   1.727 us    |    }
       0)               |    tick_do_update_jiffies64() {
       0)   0.609 us    |    }
       0)   8.055 us    |  }
      
      after:
      
       0)               |  tick_check_idle() {
       0)               |    ktime_get() {
       0)               |      read_tod_clock() {
       0)   0.617 us    |      }
       0)   1.773 us    |    }
       0)               |    tick_do_update_jiffies64() {
       0)   0.593 us    |    }
       0)   4.477 us    |  }
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090929122533.206589318@de.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      eed3b9cf
  3. 05 10月, 2009 2 次提交
    • J
      time: Remove xtime_cache · 7bc7d637
      john stultz 提交于
      With the prior logarithmic time accumulation patch, xtime will now
      always be within one "tick" of the current time, instead of
      possibly half a second off.
      
      This removes the need for the xtime_cache value, which always
      stored the time at the last interrupt, so this patch cleans that up
      removing the xtime_cache related code.
      
      This is a bit simpler, but still could use some wider testing.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJohn Kacur <jkacur@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1254525855.7741.95.camel@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7bc7d637
    • J
      time: Implement logarithmic time accumulation · a092ff0f
      john stultz 提交于
      Accumulating one tick at a time works well unless we're using NOHZ.
      Then it can be an issue, since we may have to run through the loop
      a few thousand times, which can increase timer interrupt caused
      latency.
      
      The current solution was to accumulate in half-second intervals
      with NOHZ. This kept the number of loops down, however it did
      slightly change how we make NTP adjustments. While not an issue
      with NTPd users, as NTPd makes adjustments over a longer period of
      time, other adjtimex() users have noticed the half-second
      granularity with which we can apply frequency changes to the clock.
      
      For instance, if a application tries to apply a 100ppm frequency
      correction for 20ms to correct a 2us offset, with NOHZ they either
      get no correction, or a 50us correction.
      
      Now, there will always be some granularity error for applying
      frequency corrections. However with users sensitive to this error
      have seen a 50-500x increase with NOHZ compared to running without
      NOHZ.
      
      So I figured I'd try another approach then just simply increasing
      the interval. My approach is to consume the time interval
      logarithmically. This reduces the number of times through the loop
      needed keeping latency down, while still preserving the original
      granularity error for adjtimex() changes.
      
      Further, this change allows us to remove the xtime_cache code
      (patch to follow), as xtime is always within one tick of the
      current time, instead of the half-second updates it saw before.
      
      An earlier version of this patch has been shipping to x86 users in
      the RedHat MRG releases for awhile without issue, but I've reworked
      this version to be even more careful about avoiding possible
      overflows if the shift value gets too large.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJohn Kacur <jkacur@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1254525473.7741.88.camel@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a092ff0f
  4. 02 10月, 2009 1 次提交
  5. 25 9月, 2009 1 次提交
    • M
      clocksource: Resume clocksource without taking the clocksource mutex · 89133f93
      Martin Schwidefsky 提交于
      git commit 75c5158f converted the clocksource spinlock to a
      mutex. This causes the following BUG:
      
      BUG: sleeping function called from invalid context at
      kernel/mutex.c:280 in_atomic(): 0, irqs_disabled(): 1, pid: 2473,
      name: pm-suspend 2 locks held by pm-suspend/2473:
       #0:  (&buffer->mutex){......}, at: [<ffffffff8115ab13>]
      sysfs_write_file+0x3c/0x137
       #1:  (pm_mutex){......}, at: [<ffffffff810865b5>]
      enter_state+0x39/0x130 Pid: 2473, comm: pm-suspend Not tainted 2.6.31
      #1 Call Trace:
       [<ffffffff810792f0>] ? __debug_show_held_locks+0x22/0x24
       [<ffffffff8104a2ef>] __might_sleep+0x107/0x10b
       [<ffffffff8141fca9>] mutex_lock_nested+0x25/0x43
       [<ffffffff81073537>] clocksource_resume+0x1c/0x60
       [<ffffffff81072902>] timekeeping_resume+0x1e/0x1c8
       [<ffffffff812aee62>] __sysdev_resume+0x25/0xcf
       [<ffffffff812aef79>] sysdev_resume+0x6d/0xae
       [<ffffffff810864f8>] suspend_devices_and_enter+0x12b/0x1af
       [<ffffffff8108665b>] enter_state+0xdf/0x130
       [<ffffffff81085dc3>] state_store+0xb6/0xd3
       [<ffffffff81204c73>] kobj_attr_store+0x17/0x19
       [<ffffffff8115abd2>] sysfs_write_file+0xfb/0x137
       [<ffffffff811057d2>] vfs_write+0xae/0x10b
       [<ffffffff81208392>] ? __up_read+0x1a/0x7f
       [<ffffffff811058ef>] sys_write+0x4a/0x6e
       [<ffffffff81011b82>] system_call_fastpath+0x16/0x1b
      
      clocksource_resume is called early in the resume process, there is
      only one cpu, no processes are running and the interrupts are
      disabled. It is therefore possible to resume the clocksources
      without taking the clocksource mutex.
      Reported-by: NXiaotian Feng <xtfeng@gmail.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Xiaotian Feng <xtfeng@gmail.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090924172952.49697825@mschwide.boeblingen.de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      89133f93
  6. 24 9月, 2009 1 次提交
    • Z
      time: add function to convert between calendar time and broken-down time for universal use · 57f1f087
      Zhaolei 提交于
      There are many similar code in kernel for one object: convert time between
      calendar time and broken-down time.
      
      Here is some source I found:
        fs/ncpfs/dir.c
        fs/smbfs/proc.c
        fs/fat/misc.c
        fs/udf/udftime.c
        fs/cifs/netmisc.c
        net/netfilter/xt_time.c
        drivers/scsi/ips.c
        drivers/input/misc/hp_sdc_rtc.c
        drivers/rtc/rtc-lib.c
        arch/ia64/hp/sim/boot/fw-emu.c
        arch/m68k/mac/misc.c
        arch/powerpc/kernel/time.c
        arch/parisc/include/asm/rtc.h
        ...
      
      We can make a common function for this type of conversion, At least we
      can get following benefit:
      
      1: Make kernel simple and unify
      2: Easy to fix bug in converting code
      3: Reduce clone of code in future
         For example, I'm trying to make ftrace display walltime,
         this patch will make me easy.
      
      This code is based on code from glibc-2.6
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f1f087
  7. 15 9月, 2009 2 次提交
  8. 12 9月, 2009 1 次提交
    • M
      clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash · f79e0258
      Martin Schwidefsky 提交于
      The watchdog timer is started after the watchdog clocksource
      and at least one watched clocksource have been registered. The
      clocksource work element watchdog_work is initialized just
      before the clocksource timer is started. This is too late for
      the clocksource_mark_unstable call from native_cpu_up. To fix
      this use a static initializer for watchdog_work.
      
      This resolves a boot crash reported by multiple people.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090911153305.3fe9a361@skybase>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f79e0258
  9. 29 8月, 2009 1 次提交
    • T
      clocksource: Resolve cpu hotplug dead lock with TSC unstable · 7285dd7f
      Thomas Gleixner 提交于
      Martin Schwidefsky analyzed it:
      To register a clocksource the clocksource_mutex is acquired and if
      necessary timekeeping_notify is called to install the clocksource as
      the timekeeper clock. timekeeping_notify uses stop_machine which needs
      to take cpu_add_remove_lock mutex.
      Starting a new cpu is done with the cpu_add_remove_lock mutex held.
      native_cpu_up checks the tsc of the new cpu and if the tsc is no good
      clocksource_change_rating is called. Which needs the clocksource_mutex
      and the deadlock is complete.
      
      The solution is to replace the TSC via the clocksource watchdog
      mechanism. Mark the TSC as unstable and schedule the watchdog work so
      it gets removed in the watchdog thread context.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <new-submission>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      7285dd7f
  10. 25 8月, 2009 1 次提交
  11. 22 8月, 2009 1 次提交
    • J
      time: Introduce CLOCK_REALTIME_COARSE · da15cfda
      john stultz 提交于
      After talking with some application writers who want very fast, but not
      fine-grained timestamps, I decided to try to implement new clock_ids
      to clock_gettime(): CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE
      which returns the time at the last tick. This is very fast as we don't
      have to access any hardware (which can be very painful if you're using
      something like the acpi_pm clocksource), and we can even use the vdso
      clock_gettime() method to avoid the syscall. The only trade off is you
      only get low-res tick grained time resolution.
      
      This isn't a new idea, I know Ingo has a patch in the -rt tree that made
      the vsyscall gettimeofday() return coarse grained time when the
      vsyscall64 sysctrl was set to 2. However this affects all applications
      on a system.
      
      With this method, applications can choose the proper speed/granularity
      trade-off for themselves.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: nikolag@ca.ibm.com
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: arjan@infradead.org
      Cc: jonathan@jonmasters.org
      LKML-Reference: <1250734414.6897.5.camel@localhost.localdomain>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      da15cfda
  12. 20 8月, 2009 1 次提交
    • S
      clockevent: Prevent dead lock on clockevents_lock · f833bab8
      Suresh Siddha 提交于
      Currently clockevents_notify() is called with interrupts enabled at
      some places and interrupts disabled at some other places.
      
      This results in a deadlock in this scenario.
      
      cpu A holds clockevents_lock in clockevents_notify() with irqs enabled
      cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled
      cpu C doing set_mtrr() which will try to rendezvous of all the cpus.
      
      This will result in C and A come to the rendezvous point and waiting
      for B. B is stuck forever waiting for the spinlock and thus not
      reaching the rendezvous point.
      
      Fix the clockevents code so that clockevents_lock is taken with
      interrupts disabled and thus avoid the above deadlock.
      
      Also call lapic_timer_propagate_broadcast() on the destination cpu so
      that we avoid calling smp_call_function() in the clockevents notifier
      chain.
      
      This issue left us wondering if we need to change the MTRR rendezvous
      logic to use stop machine logic (instead of smp_call_function) or add
      a check in spinlock debug code to see if there are other spinlocks
      which gets taken under both interrupts enabled/disabled conditions.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
      Cc: "Brown Len" <len.brown@intel.com>
      LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f833bab8
  13. 19 8月, 2009 2 次提交
    • M
      clocksource: Avoid clocksource watchdog circular locking dependency · 01548f4d
      Martin Schwidefsky 提交于
      stop_machine from a multithreaded workqueue is not allowed because
      of a circular locking dependency between cpu_down and the workqueue
      execution. Use a kernel thread to do the clocksource downgrade.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: john stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090818170942.3ab80c91@skybase>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      01548f4d
    • T
      clocksource: Protect the watchdog rating changes with clocksource_mutex · d0981a1b
      Thomas Gleixner 提交于
      Martin pointed out that commit 6ea41d2529 (clocksource: Call
      clocksource_change_rating() outside of watchdog_lock) has a
      theoretical reference count problem. The calls to
      clocksource_change_rating() are now done outside of the clocksource
      mutex and outside of the watchdog lock. A concurrent
      clocksource_unregister() could remove the clock.
      
      Split out the code which changes the rating from
      clocksource_change_rating() into __clocksource_change_rating().
      
      Protect the clocksource_watchdog_work() code sequence with the
      clocksource_mutex() and call __clocksource_change_rating().
      
      LKML-Reference: <alpine.LFD.2.00.0908171038420.2782@localhost.localdomain>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      d0981a1b
  14. 17 8月, 2009 1 次提交
    • A
      timers: Drop write permission on /proc/timer_list · de809347
      Amerigo Wang 提交于
      /proc/timer_list and /proc/slabinfo are not supposed to be
      written, so there should be no write permissions on it.
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: linux-mm@kvack.org
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      LKML-Reference: <20090817094525.6355.88682.sendpatchset@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      de809347
  15. 15 8月, 2009 16 次提交
  16. 19 7月, 2009 1 次提交
  17. 10 7月, 2009 1 次提交
    • T
      hrtimer: Fix migration expiry check · 6ff7041d
      Thomas Gleixner 提交于
      The timer migration expiry check should prevent the migration of a
      timer to another CPU when the timer expires before the next event is
      scheduled on the other CPU. Migrating the timer might delay it because
      we can not reprogram the clock event device on the other CPU. But the
      code implementing that check has two flaws:
      
      - for !HIGHRES the check compares the expiry value with the clock
        events device expiry value which is wrong for CLOCK_REALTIME based
        timers.
      
      - the check is racy. It holds the hrtimer base lock of the target CPU,
        but the clock event device expiry value can be modified
        nevertheless, e.g. by an timer interrupt firing.
      
      The !HIGHRES case is easy to fix as we can enqueue the timer on the
      cpu which was selected by the load balancer. It runs the idle
      balancing code once per jiffy anyway. So the maximum delay for the
      timer is the same as when we keep the tick on the current cpu going.
      
      In the HIGHRES case we can get the next expiry value from the hrtimer
      cpu_base of the target CPU and serialize the update with the cpu_base
      lock. This moves the lock section in hrtimer_interrupt() so we can set
      next_event to KTIME_MAX while we are handling the expired timers and
      set it to the next expiry value after we handled the timers under the
      base lock. While the expired timers are processed timer migration is
      blocked because the expiry time of the timer is always <= KTIME_MAX.
      
      Also remove the now useless clockevents_get_next_event() function.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      6ff7041d
  18. 07 7月, 2009 1 次提交