1. 21 6月, 2015 1 次提交
  2. 20 6月, 2015 2 次提交
  3. 19 6月, 2015 25 次提交
    • T
      timer: Minimize nohz off overhead · 683be13a
      Thomas Gleixner 提交于
      If nohz is disabled on the kernel command line the [hr]timer code
      still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
      pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
      avoid the overhead.
      
      Before:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.73%  hog       [.] main
        15.36%  [kernel]  [k] _raw_spin_lock_irqsave
         9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.61%  [kernel]  [k] lock_timer_base.isra.38
         6.42%  [kernel]  [k] mod_timer
         3.90%  [kernel]  [k] detach_if_pending
         3.76%  [kernel]  [k] del_timer
         2.41%  [kernel]  [k] internal_add_timer
         1.39%  [kernel]  [k] __internal_add_timer
         0.76%  [kernel]  [k] timerfn
      
      We probably should have a cached value for nohz full in the per cpu
      bases as well to avoid the cpumask check. The base cache line is hot
      already, the cpumask not necessarily.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      683be13a
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
    • T
      timer: Stats: Simplify the flags handling · c74441a1
      Thomas Gleixner 提交于
      Simplify the handling of the flag storage for the timer statistics. No
      intermediate storage anymore. Just hand over the flags field.
      
      I left the printout of 'deferrable' for now because changing this
      would be an ABI update and I have no idea how strong people feel about
      that. OTOH, I wonder whether we should kill the whole timer stats
      stuff because all of that information can be retrieved via ftrace/perf
      as well.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c74441a1
    • T
      timer: Replace timer base by a cpu index · 0eeda71b
      Thomas Gleixner 提交于
      Instead of storing a pointer to the per cpu tvec_base we can simply
      cache a CPU index in the timer_list and use that to get hold of the
      correct per cpu tvec_base. This is only used in lock_timer_base() and
      the slightly larger code is peanuts versus the spinlock operation and
      the d-cache foot print of the timer wheel.
      
      Aside of that this allows to get rid of following nuisances:
      
       - boot_tvec_base
      
         That statically allocated 4k bss data is just kept around so the
         timer has a home when it gets statically initialized. It serves no
         other purpose.
      
         With the CPU index we assign the timer to CPU0 at static
         initialization time and therefor can avoid the whole boot_tvec_base
         dance.  That also simplifies the init code, which just can use the
         per cpu base.
      
         Before:
           text	   data	    bss	    dec	    hex	filename
          17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
         After:
           text	   data	    bss	    dec	    hex	filename
          17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o
      
       - Overloading the base pointer with various flags
      
         The CPU index has enough space to hold the flags (deferrable,
         irqsafe) so we can get rid of the extra masking and bit fiddling
         with the base pointer.
      
      As a benefit we reduce the size of struct timer_list on 64 bit
      machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
      which is a real win as we have tons of them embedded in other structs.
      
      This changes also the newly added deferrable printout of the timer
      start trace point to capture and print all timer->flags, which allows
      us to decode the target cpu of the timer as well.
      
      We might have used bitfields for this, but that would change the
      static initializers and the init function for no value to accomodate
      big endian bitfields.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Badhri Jagan Sridharan <Badhri@google.com>
      Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0eeda71b
    • T
      timer: Use hlist for the timer wheel hash buckets · 1dabbcec
      Thomas Gleixner 提交于
      This reduces the size of struct tvec_base by 50% and results in
      slightly smaller code as well.
      
      Before:
         struct tvec_base: size: 8256, cachelines: 129
      
         text	   data	    bss	    dec	    hex	filename
        17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o
      
      After:
        struct tvec_base: 4160, cachelines: 65
      
         text	   data	    bss	    dec	    hex	filename
        17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1dabbcec
    • T
      timer: Remove FIFO "guarantee" · 1bd04bf6
      Thomas Gleixner 提交于
      The FIFO guarantee is only there if two timers are queued into the
      same bucket at the same jiffie on the same cpu:
      
       - The slack value depends on the delta between expiry and enqueue
         time, so the resulting expiry time can be different for timers
         which are queued in different jiffies.
      
       - Timers which are queued into the secondary array end up after a
         later queued timer which was queued into the primary array due to
         cascading.
      
       - Timers can end up on different cpus due to the NOHZ target moving
         around. Obviously there is no guarantee of expiry ordering between
         cpus.
      
      So anything which relies on FIFO behaviour of the timer wheel is
      broken already.
      
      This is a preparatory patch for converting the timer wheel to hlist
      which reduces the memory foot print of the wheel by 50%.
      
      It's a seperate patch so any (unlikely to happen) regression caused by
      this can be identified clearly.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Cc: George Spelvin <linux@horizon.com>
      Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1bd04bf6
    • T
      timers: Sanitize catchup_timer_jiffies() usage · 3bb475a3
      Thomas Gleixner 提交于
      catchup_timer_jiffies() has been applied blindly to several functions
      without looking for possible better ways to do it.
      
      1) internal_add_timer()
      
         Move the update to base->all_timers before we actually insert the
         timer into the wheel.
      
      2) detach_if_pending()
      
         Again the update to base->all_timers allows us to explicitely do
         the timer_jiffies update in place, if this was the last timer which
         got removed.
      
      3) __run_timers()
      
         We only check on entry, which is silly, because base->timer_jiffies
         can be behind - especially on NOHZ kernels - and if there is a
         single deferrable timer somewhere between base->timer_jiffies and
         jiffies we expire it and then loop until base->timer_jiffies ==
         jiffies.
      
         Move it into the loop.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3bb475a3
    • Z
      sched/deadline: Remove needless parameter in dl_runtime_exceeded() · 6fab5410
      Zhiqiang Zhang 提交于
      Sine commit 269ad801 ("sched/deadline: Avoid double-accounting in
      case of missed deadlines), parameter 'rq' is no longer used, so
      remove it.
      Signed-off-by: NZhiqiang Zhang <zhangzhiqiang.zhang@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <juri.lelli@gmail.com>
      Cc: <luca.abeni@unitn.it>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1434338120-43773-1-git-send-email-zhangzhiqiang.zhang@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fab5410
    • W
      sched: Remove superfluous resetting of the p->dl_throttled flag · 6713c3aa
      Wanpeng Li 提交于
      Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is going
      to be boosted) is superfluous, as the natural place to do so is in
      replenish_dl_entity().
      
      If the task was on the runqueue and it is boosted by a DL task, it will be enqueued
      back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is
      reset in replenish_dl_entity().
      
      This patch drops the resetting of throttled status in function rt_mutex_setprio().
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-6-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6713c3aa
    • W
      sched/deadline: Drop duplicate init_sched_dl_class() declaration · 178a4d23
      Wanpeng Li 提交于
      There are two init_sched_dl_class() declarations, this patch drops
      the duplicate.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-5-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      178a4d23
    • W
      sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target · 9d514262
      Wanpeng Li 提交于
      This patch adds a check that prevents futile attempts to move DL tasks
      to a CPU with active tasks of equal or earlier deadline. The same
      behavior as commit 80e3d87b ("sched/rt: Reduce rq lock contention
      by eliminating locking of non-feasible target") for rt class.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-3-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d514262
    • W
      sched/deadline: Make init_sched_dl_class() __init · a6c0e746
      Wanpeng Li 提交于
      It's a bootstrap function, make init_sched_dl_class() __init.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a6c0e746
    • W
      sched/deadline: Optimize pull_dl_task() · 8b5e770e
      Wanpeng Li 提交于
      pull_dl_task() uses pick_next_earliest_dl_task() to select a migration
      candidate; this is sub-optimal since the next earliest task -- as per
      the regular runqueue -- might not be migratable at all. This could
      result in iterating the entire runqueue looking for a task.
      
      Instead iterate the pushable queue -- this queue only contains tasks
      that have at least 2 cpus set in their cpus_allowed mask.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [ Improved the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b5e770e
    • P
      sched/preempt: Add static_key() to preempt_notifiers · 1cde2930
      Peter Zijlstra 提交于
      Avoid touching the curr->preempt_notifier cacheline when not needed.
      
      Provides a small improvement on pipe-bench:
      
        taskset 01 perf stat --repeat 10 -- perf bench sched pipe
      
      before:
      
       Performance counter stats for 'perf bench sched pipe' (10 runs):
      
            12385.016204      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.34% )
               2,000,023      context-switches          #    0.161 M/sec                    ( +-  0.00% )
                       0      cpu-migrations            #    0.000 K/sec
                     175      page-faults               #    0.014 K/sec                    ( +-  0.26% )
          41,376,162,250      cycles                    #    3.341 GHz                      ( +-  0.11% )
          17,389,139,321      stalled-cycles-frontend   #   42.03% frontend cycles idle     ( +-  0.25% )
         <not supported>      stalled-cycles-backend
          68,788,588,003      instructions              #    1.66  insns per cycle
                                                        #    0.25  stalled cycles per insn  ( +-  0.02% )
          13,449,387,620      branches                  # 1085.940 M/sec                    ( +-  0.02% )
              20,880,690      branch-misses             #    0.16% of all branches          ( +-  0.98% )
      
            12.372646094 seconds time elapsed                                          ( +-  0.34% )
      
      after:
      
       Performance counter stats for 'perf bench sched pipe' (10 runs):
      
            12180.936528      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.33% )
               2,000,077      context-switches          #    0.164 M/sec                    ( +-  0.00% )
                       0      cpu-migrations            #    0.000 K/sec
                     174      page-faults               #    0.014 K/sec                    ( +-  0.27% )
          40,691,545,577      cycles                    #    3.341 GHz                      ( +-  0.06% )
          16,446,333,371      stalled-cycles-frontend   #   40.42% frontend cycles idle     ( +-  0.18% )
         <not supported>      stalled-cycles-backend
          68,570,100,387      instructions              #    1.69  insns per cycle
                                                        #    0.24  stalled cycles per insn  ( +-  0.01% )
          13,389,740,014      branches                  # 1099.237 M/sec                    ( +-  0.01% )
              20,175,440      branch-misses             #    0.15% of all branches          ( +-  0.52% )
      
            12.169253010 seconds time elapsed                                          ( +-  0.33% )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cde2930
    • M
      sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration · d84525a8
      Mathieu Desnoyers 提交于
      preempt_notifier_unregister() documents:
      
        "This is safe to call from within a preemption notifier."
      
      However, both fire_sched_in_preempt_notifiers() and
      fire_sched_out_preempt_notifiers() are using hlist_for_each_entry(),
      which is not safe against entry removal during iteration.
      
      Inspection of the KVM code does not reveal any use of
      preempt_notifier_unregister() within the preempt notifiers.
      
      Therefore, fix the comment.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431881590-1456-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d84525a8
    • P
      sched/stop_machine: Fix deadlock between multiple stop_two_cpus() · b17718d0
      Peter Zijlstra 提交于
      Jiri reported a machine stuck in multi_cpu_stop() with
      migrate_swap_stop() as function and with the following src,dst cpu
      pairs: {11,  4} {13, 11} { 4, 13}
      
                              4       11      13
      
      cpuM: queue(4 ,13)
                              *Ma
      cpuN: queue(13,11)
                                      *N      Na
                              *M              Mb
      cpuO: queue(11, 4)
                              *O      Oa
                                      *Nb
                              *Ob
      
      Where *X denotes the cpu running the queueing of cpu-X and X[ab] denotes
      the first/second queued work.
      
      You'll observe the top of the workqueue for each cpu: 4,11,13 to be work
      from cpus: M, O, N resp. IOW. deadlock.
      
      Do away with the queueing trickery and introduce lg_double_lock() to
      lock both CPUs and fully serialize the stop_two_cpus() callers instead
      of the partial (and buggy) serialization we have now.
      Reported-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150605153023.GH19282@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b17718d0
    • S
      sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched · 82a0d276
      Srikar Dronamraju 提交于
      When CONFIG_SCHEDSTATS is enabled, /proc/<pid>/sched prints almost all
      sched statistics except sum_sleep_runtime. Since sum_sleep_runtime is
      a good info to collect, add this it to /proc/<pid>/sched.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433751041-11724-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      82a0d276
    • S
      sched/debug: Replace vruntime with wait_sum in /proc/sched_debug · c5f3ab1c
      Srikar Dronamraju 提交于
      Within runnable tasks in /proc/sched_debug, vruntime is printed twice,
      once as tree-key and again as exec-runtime.
      
      Since exec-runtime isnt populated in !CONFIG_SCHEDSTATS, use this field
      to print wait_sum.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433751041-11724-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5f3ab1c
    • S
      sched/debug: Properly format runnable tasks in /proc/sched_debug · 33d6176e
      Srikar Dronamraju 提交于
      With !CONFIG_SCHEDSTATS, runnable tasks in /proc/sched_debug has too
      many columns than required. Fix this by printing appropriate columns.
      
      While at this, print sum_exec_runtime, since this information is
      available even in !CONFIG_SCHEDSTATS case.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433751041-11724-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33d6176e
    • W
      locking/qrwlock: Don't contend with readers when setting _QW_WAITING · 405963b6
      Waiman Long 提交于
      The current cmpxchg() loop in setting the _QW_WAITING flag for writers
      in queue_write_lock_slowpath() will contend with incoming readers
      causing possibly extra cmpxchg() operations that are wasteful. This
      patch changes the code to do a byte cmpxchg() to eliminate contention
      with new readers.
      
      A multithreaded microbenchmark running 5M read_lock/write_lock loop
      on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
      with the qspinlock patch have the following execution times (in ms)
      with and without the patch:
      
      With R:W ratio = 5:1
      
      	Threads	   w/o patch	with patch	% change
      	-------	   ---------	----------	--------
      	   2	     990	    895		  -9.6%
      	   3	    2136	   1912		 -10.5%
      	   4	    3166	   2830		 -10.6%
      	   5	    3953	   3629		  -8.2%
      	   6	    4628	   4405		  -4.8%
      	   7	    5344	   5197		  -2.8%
      	   8	    6065	   6004		  -1.0%
      	   9	    6826	   6811		  -0.2%
      	  10	    7599	   7599		   0.0%
      	  15	    9757	   9766		  +0.1%
      	  20	   13767	  13817		  +0.4%
      
      With small number of contending threads, this patch can improve
      locking performance by up to 10%. With more contending threads,
      however, the gain diminishes.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433863153-30722-3-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      405963b6
    • O
      perf: Fix ring_buffer_attach() RCU sync, again · 2f993cf0
      Oleg Nesterov 提交于
      While looking for other users of get_state/cond_sync. I Found
      ring_buffer_attach() and it looks obviously buggy?
      
      Don't we need to ensure that we have "synchronize" _between_
      list_del() and list_add() ?
      
      IOW. Suppose that ring_buffer_attach() preempts right_after
      get_state_synchronize_rcu() and gp completes before spin_lock().
      
      In this case cond_synchronize_rcu() does nothing and we reuse
      ->rb_entry without waiting for gp in between?
      
      It also moves the ->rcu_pending check under "if (rb)", to make it
      more readable imo.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: der.herr@hofr.at
      Cc: josh@joshtriplett.org
      Cc: tj@kernel.org
      Fixes: b69cf536 ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
      Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2f993cf0
    • P
      hrtimer: Allow hrtimer::function() to free the timer · 887d9dc9
      Peter Zijlstra 提交于
      Currently an hrtimer callback function cannot free its own timer
      because __run_hrtimer() still needs to clear HRTIMER_STATE_CALLBACK
      after it. Freeing the timer would result in a clear use-after-free.
      
      Solve this by using a scheme similar to regular timers; track the
      current running timer in hrtimer_clock_base::running.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: wanpeng.li@linux.intel.com
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124743.471563047@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      887d9dc9
    • P
      hrtimer: Fix hrtimer_is_queued() hole · 8edfb036
      Peter Zijlstra 提交于
      A queued hrtimer that gets restarted (hrtimer_start*() while
      hrtimer_is_queued()) will briefly appear as unqueued/inactive, even
      though the timer has always been active, we just moved it.
      
      Close this hole by preserving timer->state in
      hrtimer_start_range_ns()'s remove_hrtimer() call.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124743.175989138@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8edfb036
    • O
      hrtimer: Remove HRTIMER_STATE_MIGRATE · c04dca02
      Oleg Nesterov 提交于
      I do not understand HRTIMER_STATE_MIGRATE. Unless I am totally
      confused it looks buggy and simply unneeded.
      
      migrate_hrtimer_list() sets it to keep hrtimer_active() == T, but this
      is not enough: this can fool, say, hrtimer_is_queued() in
      dequeue_signal().
      
      Can't migrate_hrtimer_list() simply use HRTIMER_STATE_ENQUEUED?
      This fixes the race and we can kill STATE_MIGRATE.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124743.072387650@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c04dca02
    • D
      locking/rtmutex: Implement lockless top-waiter wakeup · 45ab4eff
      Davidlohr Bueso 提交于
      Mark the task for later wakeup after the wait_lock has been released.
      This way, once the next task is awoken, it will have a better chance
      to of finding the wait_lock free when continuing executing in
      __rt_mutex_slowlock() when trying to acquire the rtmutex, calling
      try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
      take the lock may acquire it first, right after the wait_lock is released,
      but (a) this can also occur with the current code, as it relies on the
      spinlock fairness, and (b) we are dealing with the top-waiter anyway,
      so it will always take the lock next.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1432056298-18738-2-git-send-email-dave@stgolabs.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      45ab4eff
  4. 18 6月, 2015 3 次提交
  5. 17 6月, 2015 1 次提交
    • S
      tracing: Have filter check for balanced ops · 2cf30dc1
      Steven Rostedt 提交于
      When the following filter is used it causes a warning to trigger:
      
       # cd /sys/kernel/debug/tracing
       # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
      -bash: echo: write error: Invalid argument
       # cat events/ext4/ext4_truncate_exit/filter
      ((dev==1)blocks==2)
      ^
      parse_error: No error
      
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
       Modules linked in: bnep lockd grace bluetooth  ...
       CPU: 3 PID: 1223 Comm: bash Tainted: G        W       4.1.0-rc3-test+ #450
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
        0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
        0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
        ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
       Call Trace:
        [<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
        [<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
        [<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
        [<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff81159065>] replace_preds+0x3c5/0x990
        [<ffffffff811596b2>] create_filter+0x82/0xb0
        [<ffffffff81159944>] apply_event_filter+0xd4/0x180
        [<ffffffff81152bbf>] event_filter_write+0x8f/0x120
        [<ffffffff811db2a8>] __vfs_write+0x28/0xe0
        [<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
        [<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
        [<ffffffff811dc408>] vfs_write+0xb8/0x1b0
        [<ffffffff811dc72f>] SyS_write+0x4f/0xb0
        [<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
       ---[ end trace e11028bd95818dcd ]---
      
      Worse yet, reading the error message (the filter again) it says that
      there was no error, when there clearly was. The issue is that the
      code that checks the input does not check for balanced ops. That is,
      having an op between a closed parenthesis and the next token.
      
      This would only cause a warning, and fail out before doing any real
      harm, but it should still not caues a warning, and the error reported
      should work:
      
       # cd /sys/kernel/debug/tracing
       # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
      -bash: echo: write error: Invalid argument
       # cat events/ext4/ext4_truncate_exit/filter
      ((dev==1)blocks==2)
      ^
      parse_error: Meaningless filter expression
      
      And give no kernel warning.
      
      Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: stable@vger.kernel.org # 2.6.31+
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2cf30dc1
  6. 16 6月, 2015 2 次提交
  7. 12 6月, 2015 6 次提交
    • J
      genirq: Introduce helper function irq_data_get_node() · 6783011b
      Jiang Liu 提交于
      Introduce helper function irq_data_get_node() and variants thereof to
      hide struct irq_data implementation details.
      
      Convert the core code to use them.
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Kevin Cernekee <cernekee@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: http://lkml.kernel.org/r/1433145945-789-5-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6783011b
    • J
      genirq: Introduce struct irq_common_data to host shared irq data · 0d0b4c86
      Jiang Liu 提交于
      With the introduction of hierarchy irqdomain, struct irq_data becomes
      per-chip instead of per-irq and there may be multiple irq_datas
      associated with the same irq. Some per-irq data stored in struct
      irq_data now may get duplicated into multiple irq_datas, and causes
      inconsistent view.
      
      So introduce struct irq_common_data to host per-irq common data and to
      achieve consistent view among irq_chips.
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Kevin Cernekee <cernekee@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/1433145945-789-4-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0d0b4c86
    • J
      genirq: Prevent crash in irq_move_irq() · 77ed42f1
      Jiang Liu 提交于
      The functions irq_move_irq() and irq_move_masked_irq() expect that the
      caller passes the top-level irq_data to them when hierarchical
      irqdomains are enabled. But that's not true when called from
      apic_ack_edge(), which results in a null pointer dereference by
      idata->chip->irq_mask(idata).
      
      Instead of fixing callers to passing top-level irq_data, we rather
      change irq_move_irq()/irq_move_masked_irq() to accept any irq_data.
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/1433145945-789-3-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      77ed42f1
    • J
      genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain · 7bbf1dd2
      Jiang Liu 提交于
      For irq associated with hierarchy irqdomains, there will be multiple
      irq_datas for one irq_desc. So enhance irq_data_to_desc() to support
      hierarchy irqdomain. Also export irq_data_to_desc() as an inline
      function for later reuse.
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/1433145945-789-2-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7bbf1dd2
    • J
      ntp: Do leapsecond adjustment in adjtimex read path · 96efdcf2
      John Stultz 提交于
      Since the leapsecond is applied at tick-time, this means there is a
      small window of time at the start of a leap-second where we cross into
      the next second before applying the leap.
      
      This patch modified adjtimex so that the leap-second is applied on the
      second edge. Providing more correct leapsecond behavior.
      
      This does make it so that adjtimex()'s returned time values can be
      inconsistent with time values read from gettimeofday() or
      clock_gettime(CLOCK_REALTIME,...)  for a brief period of one tick at
      the leapsecond.  However, those other interfaces do not provide the
      TIME_OOP time_state return that adjtimex() provides, which allows the
      leapsecond to be properly represented. They instead only see a time
      discontinuity, and cannot tell the first 23:59:59 from the repeated
      23:59:59 leap second.
      
      This seems like a reasonable tradeoff given clock_gettime() /
      gettimeofday() cannot properly represent a leapsecond, and users
      likely care more about performance, while folks who are using
      adjtimex() more likely care about leap-second correctness.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: http://lkml.kernel.org/r/1434063297-28657-5-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      96efdcf2
    • J
      time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge · 833f32d7
      John Stultz 提交于
      Currently, leapsecond adjustments are done at tick time. As a result,
      the leapsecond was applied at the first timer tick *after* the
      leapsecond (~1-10ms late depending on HZ), rather then exactly on the
      second edge.
      
      This was in part historical from back when we were always tick based,
      but correcting this since has been avoided since it adds extra
      conditional checks in the gettime fastpath, which has performance
      overhead.
      
      However, it was recently pointed out that ABS_TIME CLOCK_REALTIME
      timers set for right after the leapsecond could fire a second early,
      since some timers may be expired before we trigger the timekeeping
      timer, which then applies the leapsecond.
      
      This isn't quite as bad as it sounds, since behaviorally it is similar
      to what is possible w/ ntpd made leapsecond adjustments done w/o using
      the kernel discipline. Where due to latencies, timers may fire just
      prior to the settimeofday call. (Also, one should note that all
      applications using CLOCK_REALTIME timers should always be careful,
      since they are prone to quirks from settimeofday() disturbances.)
      
      However, the purpose of having the kernel do the leap adjustment is to
      avoid such latencies, so I think this is worth fixing.
      
      So in order to properly keep those timers from firing a second early,
      this patch modifies the ntp and timekeeping logic so that we keep
      enough state so that the update_base_offsets_now accessor, which
      provides the hrtimer core the current time, can check and apply the
      leapsecond adjustment on the second edge. This prevents the hrtimer
      core from expiring timers too early.
      
      This patch does not modify any other time read path, so no additional
      overhead is incurred. However, this also means that the leap-second
      continues to be applied at tick time for all other read-paths.
      
      Apologies to Richard Cochran, who pushed for similar changes years
      ago, which I resisted due to the concerns about the performance
      overhead.
      
      While I suspect this isn't extremely critical, folks who care about
      strict leap-second correctness will likely want to watch
      this. Potentially a -stable candidate eventually.
      Originally-suggested-by: NRichard Cochran <richardcochran@gmail.com>
      Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Reported-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: http://lkml.kernel.org/r/1434063297-28657-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      833f32d7