1. 24 1月, 2015 1 次提交
    • N
      ktime: Optimize ktime_divns for constant divisors · 8b618628
      Nicolas Pitre 提交于
      At least on ARM, do_div() is optimized to turn constant divisors into
      an inline multiplication by the reciprocal value at compile time.
      However this optimization is missed entirely whenever ktime_divns() is
      used and the slow out-of-line division code is used all the time.
      
      Let ktime_divns() use do_div() inline whenever the divisor is constant
      and small enough.  This will make things like ktime_to_us() and
      ktime_to_ms() much faster.
      
      Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      8b618628
  2. 19 12月, 2014 1 次提交
    • T
      tick/powerclamp: Remove tick_nohz_idle abuse · a5fd9733
      Thomas Gleixner 提交于
      commit 4dbd2771 "tick: export nohz tick idle symbols for module
      use" was merged via the thermal tree without an explicit ack from the
      relevant maintainers.
      
      The exports are abused by the intel powerclamp driver which implements
      a fake idle state from a sched FIFO task. This causes all kinds of
      wreckage in the NOHZ core code which rightfully assumes that
      tick_nohz_idle_enter/exit() are only called from the idle task itself.
      
      Recent changes in the NOHZ core lead to a failure of the powerclamp
      driver and now people try to hack completely broken and backwards
      workarounds into the NOHZ core code. This is completely unacceptable
      and just papers over the real problem. There are way more subtle
      issues lurking around the corner.
      
      The real solution is to fix the powerclamp driver by rewriting it with
      a sane concept, but that's beyond the scope of this.
      
      So the only solution for now is to remove the calls into the core NOHZ
      code from the powerclamp trainwreck along with the exports. 
      
      Fixes: d6d71ee4 "PM: Introduce Intel PowerClamp Driver"
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
      Cc: LKP <lkp@01.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a5fd9733
  3. 05 12月, 2014 1 次提交
  4. 25 11月, 2014 1 次提交
  5. 22 11月, 2014 9 次提交
  6. 16 11月, 2014 1 次提交
  7. 04 11月, 2014 2 次提交
  8. 29 10月, 2014 4 次提交
  9. 25 10月, 2014 2 次提交
  10. 09 10月, 2014 1 次提交
  11. 19 9月, 2014 1 次提交
    • K
      sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule() · f139caf2
      Kirill Tkhai 提交于
      schedule(), io_schedule() and schedule_timeout() always return
      with TASK_RUNNING state set, so one more setting is unnecessary.
      
      (All places in patch are visible good, only exception is
       kiblnd_scheduler() from:
      
            drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
      
       Its schedule() is one line above standard 3 lines of unified diff)
      
      No places where set_current_state() is used for mb().
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Anil Belur <askb23@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry Eremin <dmitry.eremin@intel.com>
      Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Isaac Huang <he.huang@intel.com>
      Cc: James E.J. Bottomley <JBottomley@parallels.com>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Liang Zhen <liang.zhen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masaru Nomura <massa.nomura@gmail.com>
      Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleg Drokin <green@linuxhacker.ru>
      Cc: Peng Tao <bergwolf@gmail.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Robert Love <robert.w.love@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Ursula Braun <ursula.braun@de.ibm.com>
      Cc: Zi Shen Lim <zlim.lnx@gmail.com>
      Cc: devel@driverdev.osuosl.org
      Cc: dm-devel@redhat.com
      Cc: dri-devel@lists.freedesktop.org
      Cc: fcoe-devel@open-fcoe.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux390@de.ibm.com
      Cc: linux-afs@lists.infradead.org
      Cc: linux-cris-kernel@axis.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: qla2xxx-upstream@qlogic.com
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Cc: user-mode-linux-user@lists.sourceforge.net
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f139caf2
  12. 14 9月, 2014 4 次提交
    • F
      nohz: nohz full depends on irq work self IPI support · 9b01f5bf
      Frederic Weisbecker 提交于
      The nohz full functionality depends on IRQ work to trigger its own
      interrupts. As it's used to restart the tick, we can't rely on the tick
      fallback for irq work callbacks, ie: we can't use the tick to restart
      the tick itself.
      
      Lets reject the full dynticks initialization if that arch support isn't
      available.
      
      As a side effect, this makes sure that nohz kick is never called from
      the tick. That otherwise would result in illegal hrtimer self-cancellation
      and lockup.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      9b01f5bf
    • F
      nohz: Consolidate nohz full init code · 4327b15f
      Frederic Weisbecker 提交于
      The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
      parameter both have their own way to do the same thing: allocate
      full dynticks cpumasks, fill them and initialize some state variables.
      
      Lets consolidate that all in the same place.
      
      While at it, convert some regular printk message to warnings when
      fundamental allocations fail.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4327b15f
    • F
      irq_work: Force raised irq work to run on irq work interrupt · 76a33061
      Frederic Weisbecker 提交于
      The nohz full kick, which restarts the tick when any resource depend
      on it, can't be executed anywhere given the operation it does on timers.
      If it is called from the scheduler or timers code, chances are that
      we run into a deadlock.
      
      This is why we run the nohz full kick from an irq work. That way we make
      sure that the kick runs on a virgin context.
      
      However if that's the case when irq work runs in its own dedicated
      self-ipi, things are different for the big bunch of archs that don't
      support the self triggered way. In order to support them, irq works are
      also handled by the timer interrupt as fallback.
      
      Now when irq works run on the timer interrupt, the context isn't blank.
      More precisely, they can run in the context of the hrtimer that runs the
      tick. But the nohz kick cancels and restarts this hrtimer and cancelling
      an hrtimer from itself isn't allowed. This is why we run in an endless
      loop:
      
      	Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
      	CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
      	Workqueue: btrfs-endio-write normal_work_helper [btrfs]
      	 ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
      	 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
      	 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
      	Call Trace:
      	 <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
      	 [<ffffffff8a7ef928>] panic+0xd4/0x207
      	 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
      	 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
      	 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
      	 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
      	 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
      	 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
      	 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
      	 [<ffffffff8a008268>] do_nmi+0xb8/0x100
      	 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
      	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
      	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
      	 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
      	 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
      	 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
      	 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
      	 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
      	 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
      	 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
      	 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
      	 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
      	 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
      	 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
      	 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
      	 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
      	 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
      	 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80
      
      To fix this we force non-lazy irq works to run on irq work self-IPIs
      when available. That ability of the arch to trigger irq work self IPIs
      is available with arch_irq_work_has_interrupt().
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76a33061
    • F
      nohz: Move nohz full init call to tick init · a80e49e2
      Frederic Weisbecker 提交于
      This way we unbloat a bit main.c and more importantly we initialize
      nohz full after init_IRQ(). This dependency will be needed in further
      patches because nohz full needs irq work to raise its own IRQ.
      Information about the support for this ability on ARM64 is obtained on
      init_IRQ() which initialize the pointer to __smp_call_function.
      
      Since tick_init() is called right after init_IRQ(), this is a good place
      to call tick_nohz_init() and prepare for that dependency.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a80e49e2
  13. 13 9月, 2014 4 次提交
    • R
      alarmtimer: Lock k_itimer during timer callback · 474e941b
      Richard Larocque 提交于
      Locks the k_itimer's it_lock member when handling the alarm timer's
      expiry callback.
      
      The regular posix timers defined in posix-timers.c have this lock held
      during timout processing because their callbacks are routed through
      posix_timer_fn().  The alarm timers follow a different path, so they
      ought to grab the lock somewhere else.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      474e941b
    • R
      alarmtimer: Do not signal SIGEV_NONE timers · 265b81d2
      Richard Larocque 提交于
      Avoids sending a signal to alarm timers created with sigev_notify set to
      SIGEV_NONE by checking for that special case in the timeout callback.
      
      The regular posix timers avoid sending signals to SIGEV_NONE timers by
      not scheduling any callbacks for them in the first place.  Although it
      would be possible to do something similar for alarm timers, it's simpler
      to handle this as a special case in the timeout.
      
      Prior to this patch, the alarm timer would ignore the sigev_notify value
      and try to deliver signals to the process anyway.  Even worse, the
      sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
      specified, so the signal number could be bogus.  If sigev_signo was an
      unitialized value (as it often would be if SIGEV_NONE is used), then
      it's hard to predict which signal will be sent.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      265b81d2
    • R
      alarmtimer: Return relative times in timer_gettime · e86fea76
      Richard Larocque 提交于
      Returns the time remaining for an alarm timer, rather than the time at
      which it is scheduled to expire.  If the timer has already expired or it
      is not currently scheduled, the it_value's members are set to zero.
      
      This new behavior matches that of the other posix-timers and the POSIX
      specifications.
      
      This is a change in user-visible behavior, and may break existing
      applications.  Hopefully, few users rely on the old incorrect behavior.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      [jstultz: minor style tweak]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e86fea76
    • A
      jiffies: Fix timeval conversion to jiffies · d78c9300
      Andrew Hunter 提交于
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: NPaul Turner <pjt@google.com>
      Reported-by: NAaron Jacobs <jacobsa@google.com>
      Signed-off-by: NAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d78c9300
  14. 08 9月, 2014 1 次提交
    • R
      time, signal: Protect resource use statistics with seqlock · e78c3496
      Rik van Riel 提交于
      Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
      issues on large systems, due to both functions being serialized with a
      lock.
      
      The lock protects against reporting a wrong value, due to a thread in the
      task group exiting, its statistics reporting up to the signal struct, and
      that exited task's statistics being counted twice (or not at all).
      
      Protecting that with a lock results in times() and clock_gettime() being
      completely serialized on large systems.
      
      This can be fixed by using a seqlock around the events that gather and
      propagate statistics. As an additional benefit, the protection code can
      be moved into thread_group_cputime(), slightly simplifying the calling
      functions.
      
      In the case of posix_cpu_clock_get_task() things can be simplified a
      lot, because the calling function already ensures that the task sticks
      around, and the rest is now taken care of in thread_group_cputime().
      
      This way the statistics reporting code can run lockless.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e78c3496
  15. 06 9月, 2014 1 次提交
  16. 05 9月, 2014 1 次提交
    • F
      nohz: Restore NMI safe local irq work for local nohz kick · 40bea039
      Frederic Weisbecker 提交于
      The local nohz kick is currently used by perf which needs it to be
      NMI-safe. Recent commit though (7d1311b9)
      changed its implementation to fire the local kick using the remote kick
      API. It was convenient to make the code more generic but the remote kick
      isn't NMI-safe.
      
      As a result:
      
      	WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
      	CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
      	0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
      	0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
      	0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
      	Call Trace:
      	<NMI>  [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a
      	[<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0
      	[<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20
      	[<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140
      	[<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90
      	[<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350
      	[<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	[<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	[<ffffffff9a187934>] perf_event_overflow+0x14/0x20
      	[<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410
      	[<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130
      	[<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50
      	[<ffffffff9a007b72>] nmi_handle+0xd2/0x390
      	[<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390
      	[<ffffffff9a0d131b>] ? lock_release+0xab/0x330
      	[<ffffffff9a008062>] default_do_nmi+0x72/0x1c0
      	[<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200
      	[<ffffffff9a008268>] do_nmi+0xb8/0x100
      
      Lets fix this by restoring the use of local irq work for the nohz local
      kick.
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      40bea039
  17. 27 8月, 2014 3 次提交
  18. 23 8月, 2014 2 次提交
    • V
      nohz: Avoid tick's double reprogramming in highres mode · 2a16fc93
      Viresh Kumar 提交于
      In highres mode, the tick reschedules itself unconditionally to the
      next jiffies.
      
      However while this clock reprogramming is relevant when the tick is
      in periodic mode, it's not that interesting when we run in dynticks mode
      because irq exit is likely going to overwrite the next tick to some
      randomly deferred future.
      
      So lets just get rid of this tick self rescheduling in dynticks mode.
      This way we can avoid some clockevents double write in favourable
      scenarios like when we stop the tick completely in idle while no other
      hrtimer is pending.
      Suggested-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      2a16fc93
    • V
      nohz: Fix spurious periodic tick behaviour in low-res dynticks mode · b5e995e6
      Viresh Kumar 提交于
      When we reach the end of the tick handler, we unconditionally reschedule
      the next tick to the next jiffy. Then on irq exit, the nohz code
      overrides that setting if needed and defers the next tick as far away in
      the future as possible.
      
      Now in the best dynticks case, when we actually don't need any tick in
      the future (ie: expires == KTIME_MAX), low-res and high-res behave
      differently. What we want in this case is to cancel the next tick
      programmed by the previous one. That's what we do in high-res mode. OTOH
      we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't
      do anything in this case and the next tick remains scheduled to jiffies + 1.
      
      As a result, in low-res mode, when the dynticks code determines that no
      tick is needed in the future, we can recursively get a spurious tick
      every jiffy because then the next tick is always reprogrammed from the
      tick handler and is never cancelled. And this can happen indefinetly
      until some subsystem actually needs a precise tick in the future and only
      then we eventually overwrite the previous tick handler setting to defer
      the next tick.
      
      We are fixing this by introducing the ONESHOT_STOPPED mode which will
      let us pause a clockevent when no further interrupt is needed. Meanwhile
      we can't expect all drivers to support this new mode.
      
      So lets reduce much of the symptoms by skipping the nohz-blind tick
      rescheduling from the tick-handler when the CPU is in dynticks mode.
      That tick rescheduling wrongly assumed periodicity and the low-res
      dynticks code can't cancel such decision. This breaks the recursive (and
      thus the worst) part of the problem. In the worst case now, we'll get
      only one extra tick due to uncancelled tick scheduled before we entered
      dynticks mode.
      
      This also removes a needless clockevent write on idle ticks. Since those
      clock write are usually considered to be slow, it's a general win.
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      b5e995e6