1. 30 12月, 2017 2 次提交
  2. 22 11月, 2017 3 次提交
  3. 12 11月, 2017 1 次提交
    • D
      timers: Add a function to start/reduce a timer · b24591e2
      David Howells 提交于
      Add a function, similar to mod_timer(), that will start a timer if it isn't
      running and will modify it if it is running and has an expiry time longer
      than the new time.  If the timer is running with an expiry time that's the
      same or sooner, no change is made.
      
      The function looks like:
      
      	int timer_reduce(struct timer_list *timer, unsigned long expires);
      
      This can be used by code such as networking code to make it easier to share
      a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
      the rxrpc_call struct will maintain a number of timeouts:
      
      	unsigned long	ack_at;
      	unsigned long	resend_at;
      	unsigned long	ping_at;
      	unsigned long	expect_rx_by;
      	unsigned long	expect_req_by;
      	unsigned long	expect_term_by;
      
      each of which is set independently of the others.  With timer reduction
      available, when the code needs to set one of the timeouts, it only needs to
      look at that timeout and then call timer_reduce() to modify the timer,
      starting it or bringing it forward if necessary.  There is no need to refer
      to the other timeouts to see which is earliest and no need to take any lock
      other than, potentially, the timer lock inside timer_reduce().
      
      Note, that this does not protect against concurrent invocations of any of
      the timer functions.
      
      As an example, the expect_rx_by timeout above, which terminates a call if
      we don't get a packet from the server within a certain time window, would
      be set something like this:
      
      	unsigned long now = jiffies;
      	unsigned long expect_rx_by = now + packet_receive_timeout;
      	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
      	timer_reduce(&call->timer, expect_rx_by);
      
      The timer service code (which might, say, be in a work function) would then
      check all the timeouts to see which, if any, had triggered, deal with
      those:
      
      	t = READ_ONCE(call->ack_at);
      	if (time_after_eq(now, t)) {
      		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
      		set_bit(RXRPC_CALL_EV_ACK, &call->events);
      	}
      
      and then restart the timer if necessary by finding the soonest timeout that
      hasn't yet passed and then calling timer_reduce().
      
      The disadvantage of doing things this way rather than comparing the timers
      each time and calling mod_timer() is that you *will* take timer events
      unless you can finish what you're doing and delete the timer in time.
      
      The advantage of doing things this way is that you don't need to use a lock
      to work out when the next timer should be set, other than the timer's own
      lock - which you might not have to take.
      
      [ tglx: Fixed weird formatting and adopted it to pending changes ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keyrings@vger.kernel.org
      Cc: linux-afs@lists.infradead.org
      Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
      b24591e2
  4. 18 10月, 2017 2 次提交
  5. 05 10月, 2017 1 次提交
    • K
      timer: Convert schedule_timeout() to use from_timer() · 58e1177b
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new from_timer() helper and passing
      the timer pointer explicitly. Since this special timer is on the stack, it
      needs to have a wrapper structure to carry state once .data is eliminated.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-mips@linux-mips.org
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Sebastian Reichel <sre@kernel.org>
      Cc: Kalle Valo <kvalo@qca.qualcomm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: linux1394-devel@lists.sourceforge.net
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: linux-s390@vger.kernel.org
      Cc: linux-wireless@vger.kernel.org
      Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
      Cc: Wim Van Sebroeck <wim@iguana.be>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Harish Patil <harish.patil@cavium.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Manish Chopra <manish.chopra@cavium.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-pm@vger.kernel.org
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: linux-watchdog@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
      Cc: Michael Reed <mdr@sgi.com>
      Cc: netdev@vger.kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Link: https://lkml.kernel.org/r/1507159627-127660-2-git-send-email-keescook@chromium.org
      58e1177b
  6. 24 8月, 2017 1 次提交
    • N
      timers: Fix excessive granularity of new timers after a nohz idle · 2fe59f50
      Nicholas Piggin 提交于
      When a timer base is idle, it is forwarded when a new timer is added
      to ensure that granularity does not become excessive. When not idle,
      the timer tick is expected to increment the base.
      
      However there are several problems:
      
      - If an existing timer is modified, the base is forwarded only after
        the index is calculated.
      
      - The base is not forwarded by add_timer_on.
      
      - There is a window after a timer is restarted from a nohz idle, after
        it is marked not-idle and before the timer tick on this CPU, where a
        timer may be added but the ancient base does not get forwarded.
      
      These result in excessive granularity (a 1 jiffy timeout can blow out
      to 100s of jiffies), which cause the rcu lockup detector to trigger,
      among other things.
      
      Fix this by keeping track of whether the timer base has been idle
      since it was last run or forwarded, and if so then forward it before
      adding a new timer.
      
      There is still a case where mod_timer optimises the case of a pending
      timer mod with the same expiry time, where the timer can see excessive
      granularity relative to the new, shorter interval. A comment is added,
      but it's not changed because it is an important fastpath for
      networking.
      
      This has been tested and found to fix the RCU softlockup messages.
      
      Testing was also done with tracing to measure requested versus
      achieved wakeup latencies for all non-deferrable timers in an idle
      system (with no lockup watchdogs running). Wakeup latency relative to
      absolute latency is calculated (note this suffers from round-up skew
      at low absolute times) and analysed:
      
                   max     avg      std
      upstream   506.0    1.20     4.68
      patched      2.0    1.08     0.15
      
      The bug was noticed due to the lockup detector Kconfig changes
      dropping it out of people's .configs and resulting in larger base
      clk skew When the lockup detectors are enabled, no CPU can go idle for
      longer than 4 seconds, which limits the granularity errors.
      Sub-optimal timer behaviour is observable on a smaller scale in that
      case:
      
      	     max     avg      std
      upstream     9.0    1.05     0.19
      patched      2.0    1.04     0.11
      
      Fixes: Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: NDavid Miller <davem@davemloft.net>
      Cc: dzickus@redhat.com
      Cc: sfr@canb.auug.org.au
      Cc: mpe@ellerman.id.au
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: linuxarm@huawei.com
      Cc: abdhalee@linux.vnet.ibm.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: akpm@linux-foundation.org
      Cc: paulmck@linux.vnet.ibm.com
      Cc: torvalds@linux-foundation.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
      2fe59f50
  7. 01 8月, 2017 1 次提交
  8. 29 6月, 2017 1 次提交
  9. 21 6月, 2017 1 次提交
  10. 20 4月, 2017 1 次提交
  11. 24 3月, 2017 1 次提交
  12. 02 3月, 2017 3 次提交
  13. 10 2月, 2017 1 次提交
    • K
      time: Remove CONFIG_TIMER_STATS · dfb4357d
      Kees Cook 提交于
      Currently CONFIG_TIMER_STATS exposes process information across namespaces:
      
      kernel/time/timer_list.c print_timer():
      
              SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);
      
      /proc/timer_list:
      
       #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570
      
      Given that the tracer can give the same information, this patch entirely
      removes CONFIG_TIMER_STATS.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: linux-doc@vger.kernel.org
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Xing Gao <xgao01@email.wm.edu>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jessica Frazelle <me@jessfraz.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-api@vger.kernel.org
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: http://lkml.kernel.org/r/20170208192659.GA32582@beastSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      dfb4357d
  14. 25 12月, 2016 1 次提交
  15. 16 11月, 2016 2 次提交
  16. 26 10月, 2016 2 次提交
    • D
      timers: Fix documentation for schedule_timeout() and similar · 4b7e9cf9
      Douglas Anderson 提交于
      The documentation for schedule_timeout(), schedule_hrtimeout(), and
      schedule_hrtimeout_range() all claim that the routines couldn't possibly
      return early if the task state was TASK_UNINTERRUPTIBLE. This is simply
      not true since wake_up_process() will cause those routines to exit early.
      
      We cannot make schedule_[hr]timeout() loop until the timeout expires if the
      task state is uninterruptible because we have users which rely on the
      existing and designed behaviour.
      
      Make the documentation match the (correct) implementation.
      
      schedule_hrtimeout() returns -EINTR even when a uninterruptible task was
      woken up. This might look strange, but making the return code depend on the
      state is too much of an effort as it would affect all the call sites. There
      is no value in doing so, but we spell it out clearly in the documentation.
      Suggested-by: NDaniel Kurtz <djkurtz@chromium.org>
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Cc: huangtao@rock-chips.com
      Cc: heiko@sntech.de
      Cc: broonie@kernel.org
      Cc: briannorris@chromium.org
      Cc: Andreas Mohr <andi@lisas.de>
      Cc: linux-rockchip@lists.infradead.org
      Cc: tony.xie@rock-chips.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux@roeck-us.net
      Cc: tskd08@gmail.com
      Link: http://lkml.kernel.org/r/1477065531-30342-2-git-send-email-dianders@chromium.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4b7e9cf9
    • D
      timers: Fix usleep_range() in the context of wake_up_process() · 6c5e9059
      Douglas Anderson 提交于
      Users of usleep_range() expect that it will _never_ return in less time
      than the minimum passed parameter. However, nothing in the code ensures
      this, when the sleeping task is woken by wake_up_process() or any other
      mechanism which can wake a task from uninterruptible state.
      
      Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
      against wakeups. schedule_hrtimeout_range*() is designed this way despite
      the fact that the API documentation does not mention it.
      
      msleep() already has code to handle this case since it will loop as long
      as there was still time left.  usleep_range() has no such loop, add it.
      
      Presumably this problem was not detected before because usleep_range() is
      only used in a few places and the function is mostly used in contexts which
      are not exposed to wakeups of any form.
      
      An effort was made to look for users relying on the old behavior by
      looking for usleep_range() in the same file as wake_up_process().
      No problems were found by this search, though it is conceivable that
      someone could have put the sleep and wakeup in two different files.
      
      An effort was made to ask several upstream maintainers if they were aware
      of people relying on wake_up_process() to wake up usleep_range(). No
      maintainers were aware of that but they were aware of many people relying
      on usleep_range() never returning before the minimum.
      Reported-by: NTao Huang <huangtao@rock-chips.com>
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Cc: heiko@sntech.de
      Cc: broonie@kernel.org
      Cc: briannorris@chromium.org
      Cc: Andreas Mohr <andi@lisas.de>
      Cc: linux-rockchip@lists.infradead.org
      Cc: tony.xie@rock-chips.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: djkurtz@chromium.org
      Cc: linux@roeck-us.net
      Cc: tskd08@gmail.com
      Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6c5e9059
  17. 25 10月, 2016 4 次提交
    • T
      timers: Prevent base clock corruption when forwarding · 6bad6bcc
      Thomas Gleixner 提交于
      When a timer is enqueued we try to forward the timer base clock. This
      mechanism has two issues:
      
      1) Forwarding a remote base unlocked
      
      The forwarding function is called from get_target_base() with the current
      timer base lock held. But if the new target base is a different base than
      the current base (can happen with NOHZ, sigh!) then the forwarding is done
      on an unlocked base. This can lead to corruption of base->clk.
      
      Solution is simple: Invoke the forwarding after the target base is locked.
      
      2) Possible corruption due to jiffies advancing
      
      This is similar to the issue in get_net_timer_interrupt() which was fixed
      in the previous patch. jiffies can advance between check and assignement
      and therefore advancing base->clk beyond the next expiry value.
      
      So we need to read jiffies into a local variable once and do the checks and
      assignment with the local copy.
      
      Fixes: a683f390("timers: Forward the wheel clock whenever possible")
      Reported-by: NAshton Holmes <scoopta@gmail.com>
      Reported-by: NMichael Thayer <michael.thayer@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michal Necasek <michal.necasek@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: knut.osmundsen@oracle.com
      Cc: stable@vger.kernel.org
      Cc: stern@rowland.harvard.edu
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6bad6bcc
    • T
      timers: Prevent base clock rewind when forwarding clock · 041ad7bc
      Thomas Gleixner 提交于
      Ashton and Michael reported, that kernel versions 4.8 and later suffer from
      USB timeouts which are caused by the timer wheel rework.
      
      This is caused by a bug in the base clock forwarding mechanism, which leads
      to timers expiring early. The scenario which leads to this is:
      
      run_timers()
        while (jiffies >= base->clk) {
          collect_expired_timers();
          base->clk++;
          expire_timers();
        }          
      
      So base->clk = jiffies + 1. Now the cpu goes idle:
      
      idle()
        get_next_timer_interrupt()
          nextevt = __next_time_interrupt();
          if (time_after(nextevt, base->clk))
             	base->clk = jiffies;
      
      jiffies has not advanced since run_timers(), so this assignment effectively
      decrements base->clk by one.
      
      base->clk is the index into the timer wheel arrays. So let's assume the
      following state after the base->clk increment in run_timers():
      
       jiffies = 0
       base->clk = 1
      
      A timer gets enqueued with an expiry delta of 63 ticks (which is the case
      with the USB timeout and HZ=250) so the resulting bucket index is:
      
        base->clk + delta = 1 + 63 = 64
      
      The timer goes into the first wheel level. The array size is 64 so it ends
      up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
      to index into bucket 0 again.
      
      If the cpu goes idle before jiffies advance, then the bug in the forwarding
      mechanism sets base->clk back to 0, so the next invocation of run_timers()
      at the next tick will index into bucket 0 and therefore expire the timer 62
      ticks too early.
      
      Instead of blindly setting base->clk to jiffies we must make the forwarding
      conditional on jiffies > base->clk, but we cannot use jiffies for this as
      we might run into the following issue:
      
        if (time_after(jiffies, base->clk) {
          if (time_after(nextevt, base->clk))
             base->clk = jiffies;
      
      jiffies can increment between the check and the assigment far enough to
      advance beyond nextevt. So we need to use a stable value for checking.
      
      get_next_timer_interrupt() has the basej argument which is the jiffies
      value snapshot taken in the calling code. So we can just that.
      
      Thanks to Ashton for bisecting and providing trace data!
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Reported-by: NAshton Holmes <scoopta@gmail.com>
      Reported-by: NMichael Thayer <michael.thayer@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michal Necasek <michal.necasek@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: knut.osmundsen@oracle.com
      Cc: stable@vger.kernel.org
      Cc: stern@rowland.harvard.edu
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      041ad7bc
    • T
      timers: Lock base for same bucket optimization · 4da9152a
      Thomas Gleixner 提交于
      Linus stumbled over the unlocked modification of the timer expiry value in
      mod_timer() which is an optimization for timers which stay in the same
      bucket - due to the bucket granularity - despite their expiry time getting
      updated.
      
      The optimization itself still makes sense even if we take the lock, because
      in case that the bucket stays the same, we avoid the pointless
      queue/enqueue dance.
      
      Make the check and the modification of timer->expires protected by the base
      lock and shuffle the remaining code around so we can keep the lock held
      when we actually have to requeue the timer to a different bucket.
      
      Fixes: f00c0afd ("timers: Implement optimization for same expiry time in mod_timer()")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4da9152a
    • T
      timers: Plug locking race vs. timer migration · b831275a
      Thomas Gleixner 提交于
      Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
      timer flags. As a consequence the compiler is allowed to reload the flags
      between the initial check for TIMER_MIGRATION and the following timer base
      computation and the spin lock of the base.
      
      While this has not been observed (yet), we need to make sure that it never
      happens.
      
      Fixes: 0eeda71b ("timer: Replace timer base by a cpu index")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      b831275a
  18. 11 10月, 2016 1 次提交
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
  19. 09 8月, 2016 1 次提交
    • C
      timers: Fix get_next_timer_interrupt() computation · 46c8f0b0
      Chris Metcalf 提交于
      The tick_nohz_stop_sched_tick() routine is not properly
      canceling the sched timer when nothing is pending, because
      get_next_timer_interrupt() is no longer returning KTIME_MAX in
      that case.  This causes periodic interrupts when none are needed.
      
      When determining the next interrupt time, we first use
      __next_timer_interrupt() to get the first expiring timer in the
      timer wheel.  If no timer is found, we return the base clock value
      plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
      timer wheel.
      
      Back in get_next_timer_interrupt(), we set the "expires" value
      by converting the timer wheel expiry (in ticks) to a nsec value.
      But we don't want to do this if the timer wheel expiry value
      indicates no timer; we want to return KTIME_MAX.
      
      Prior to commit 500462a9 ("timers: Switch to a non-cascading
      wheel") we checked base->active_timers to see if any timers
      were active, and if not, we didn't touch the expiry value and so
      properly returned KTIME_MAX.  Now we don't have active_timers.
      
      To fix this, we now just check the timer wheel expiry value to
      see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
      try to compute a new value based on it, but instead simply let the
      KTIME_MAX value in expires remain.
      
      Fixes: 500462a9 "timers: Switch to a non-cascading wheel"
      Signed-off-by: NChris Metcalf <cmetcalf@mellanox.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      46c8f0b0
  20. 15 7月, 2016 1 次提交
  21. 07 7月, 2016 9 次提交
    • A
      timers: Implement optimization for same expiry time in mod_timer() · f00c0afd
      Anna-Maria Gleixner 提交于
      The existing optimization for same expiry time in mod_timer() checks whether
      the timer expiry time is the same as the new requested expiry time. In the old
      timer wheel implementation this does not take the slack batching into account,
      neither does the new implementation evaluate whether the new expiry time will
      requeue the timer to the same bucket.
      
      To optimize that, we can calculate the resulting bucket and check if the new
      expiry time is different from the current expiry time. This calculation
      happens outside the base lock held region. If the resulting bucket is the same
      we can avoid taking the base lock and requeueing the timer.
      
      If the timer needs to be requeued then we have to check under the base lock
      whether the base time has changed between the lockless calculation and taking
      the lock. If it has changed we need to recalculate under the lock.
      
      This optimization takes effect for timers which are enqueued into the less
      granular wheel levels (1 and above). With a simple test case the functionality
      has been verified:
      
                  Before        After
       Match:       5.5%        86.6%
       Requeue:    94.5%        13.4%
       Recalc:                  <0.01%
      
      In the non optimized case the timer is requeued in 94.5% of the cases. With
      the index optimization in place the requeue rate drops to 13.4%. The case
      where the lockless index calculation has to be redone is less than 0.01%.
      
      With a real world test case (networking) we observed the following changes:
      
                  Before        After
       Match:      97.8%        99.7%
       Requeue:     2.2%         0.3%
       Recalc:                  <0.001%
      
      That means two percent fewer lock/requeue/unlock operations done in one of
      the hot path use cases of timers.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f00c0afd
    • A
      timers: Split out index calculation · ffdf0477
      Anna-Maria Gleixner 提交于
      For further optimizations we need to seperate index calculation
      from queueing. No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ffdf0477
    • T
      timers: Only wake softirq if necessary · 4e85876a
      Thomas Gleixner 提交于
      With the wheel forwading in place and with the HZ=1000 4ms folding we can
      avoid running the softirq at all.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e85876a
    • T
      timers: Forward the wheel clock whenever possible · a683f390
      Thomas Gleixner 提交于
      The wheel clock is stale when a CPU goes into a long idle sleep. This has the
      side effect that timers which are queued end up in the outer wheel levels.
      That results in coarser granularity.
      
      To solve this, we keep track of the idle state and forward the wheel clock
      whenever possible.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a683f390
    • A
      timers: Optimize collect_expired_timers() for NOHZ · 23696838
      Anna-Maria Gleixner 提交于
      After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
      There might be expired timers so the current code loops and checks the expired
      buckets for timers. This can take quite some time for long NOHZ idle periods.
      
      The pending bitmask in the timer base allows us to do a quick search for the
      next expiring timer and therefore a fast forward of the base time which
      prevents pointless long lasting loops.
      
      For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23696838
    • A
      timers: Move __run_timers() function · 73420fea
      Anna-Maria Gleixner 提交于
      Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
      in preparation for __run_timers() NOHZ optimization.
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      73420fea
    • T
      timers: Remove set_timer_slack() leftovers · 53bf837b
      Thomas Gleixner 提交于
      We now have implicit batching in the timer wheel. The slack API is no longer
      used, so remove it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrew F. Davis <afd@ti.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jaehoon Chung <jh80.chung@samsung.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathias Nyman <mathias.nyman@intel.com>
      Cc: Pali Rohár <pali.rohar@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sebastian Reichel <sre@kernel.org>
      Cc: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: linux-usb@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      53bf837b
    • T
      timers: Switch to a non-cascading wheel · 500462a9
      Thomas Gleixner 提交于
      The current timer wheel has some drawbacks:
      
      1) Cascading:
      
         Cascading can be an unbound operation and is completely pointless in most
         cases because the vast majority of the timer wheel timers are canceled or
         rearmed before expiration. (They are used as timeout safeguards, not as
         real timers to measure time.)
      
      2) No fast lookup of the next expiring timer:
      
         In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
         must fast forward the base time to the current value of jiffies. As we
         have no way to find the next expiring timer fast, the code loops linearly
         and increments the base time one by one and checks for expired timers
         in each step. This causes unbound overhead spikes exactly in the moment
         when we should wake up as fast as possible.
      
      After a thorough analysis of real world data gathered on laptops,
      workstations, webservers and other machines (thanks Chris!) I came to the
      conclusion that the current 'classic' timer wheel implementation can be
      modified to address the above issues.
      
      The vast majority of timer wheel timers is canceled or rearmed before
      expiry. Most of them are timeouts for networking and other I/O tasks. The
      nature of timeouts is to catch the exception from normal operation (TCP ack
      timed out, disk does not respond, etc.). For these kinds of timeouts the
      accuracy of the timeout is not really a concern. Timeouts are very often
      approximate worst-case values and in case the timeout fires, we already
      waited for a long time and performance is down the drain already.
      
      The few timers which actually expire can be split into two categories:
      
       1) Short expiry times which expect halfways accurate expiry
      
       2) Long term expiry times are inaccurate today already due to the
          batching which is done for NOHZ automatically and also via the
          set_timer_slack() API.
      
      So for long term expiry timers we can avoid the cascading property and just
      leave them in the less granular outer wheels until expiry or
      cancelation. Timers which are armed with a timeout larger than the wheel
      capacity are no longer cascaded. We expire them with the longest possible
      timeout (6+ days). We have not observed such timeouts in our data collection,
      but at least we handle them, applying the rule of the least surprise.
      
      To avoid extending the wheel levels for HZ=1000 so we can accomodate the
      longest observed timeouts (5 days in the network conntrack code) we reduce the
      first level granularity on HZ=1000 to 4ms, which effectively is the same as
      the HZ=250 behaviour. From our data analysis there is nothing which relies on
      that 1ms granularity and as a side effect we get better batching and timer
      locality for the networking code as well.
      
      Contrary to the classic wheel the granularity of the next wheel is not the
      capacity of the first wheel. The granularities of the wheels are in the
      currently chosen setting 8 times the granularity of the previous wheel.
      
      So for HZ=250 we end up with the following granularity levels:
      
       Level Offset   Granularity                  Range
           0      0          4 ms                 0 ms -        252 ms
           1     64         32 ms               256 ms -       2044 ms (256ms - ~2s)
           2    128        256 ms              2048 ms -      16380 ms (~2s   - ~16s)
           3    192       2048 ms (~2s)       16384 ms -     131068 ms (~16s  - ~2m)
           4    256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m   - ~17m)
           5    320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m  - ~2h)
           6    384    1048576 ms (~17m)    8388608 ms -   67108863 ms (~2h   - ~18h)
           7    448    8388608 ms (~2h)    67108864 ms -  536870911 ms (~18h  - ~6d)
      
      That's a worst case inaccuracy of 12.5% for the timers which are queued at the
      beginning of a level.
      
      So the new wheel concept addresses the old issues:
      
      1) Cascading is avoided completely
      
      2) By keeping the timers in the bucket until expiry/cancelation we can track
         the buckets which have timers enqueued in a bucket bitmap and therefore can
         look up the next expiring timer very fast and O(1).
      
      A further benefit of the concept is that the slack calculation which is done
      on every timer start is no longer necessary because the granularity levels
      provide natural batching already.
      
      Our extensive testing with various loads did not show any performance
      degradation vs. the current wheel implementation.
      
      This patch does not address the 'fast lookup' issue as we wanted to make sure
      that there is no regression introduced by the wheel redesign. The
      optimizations are in follow up patches.
      
      This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      500462a9
    • T
      timers: Give a few structs and members proper names · 494af3ed
      Thomas Gleixner 提交于
      Some of the names in the internal implementation of the timer code
      are not longer correct and others are simply too long to type.
      
      Clean it up before we switch the wheel implementation over to
      the new scheme.
      
      No functional change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      494af3ed