1. 26 10月, 2016 1 次提交
    • D
      timers: Fix usleep_range() in the context of wake_up_process() · 6c5e9059
      Douglas Anderson 提交于
      Users of usleep_range() expect that it will _never_ return in less time
      than the minimum passed parameter. However, nothing in the code ensures
      this, when the sleeping task is woken by wake_up_process() or any other
      mechanism which can wake a task from uninterruptible state.
      
      Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
      against wakeups. schedule_hrtimeout_range*() is designed this way despite
      the fact that the API documentation does not mention it.
      
      msleep() already has code to handle this case since it will loop as long
      as there was still time left.  usleep_range() has no such loop, add it.
      
      Presumably this problem was not detected before because usleep_range() is
      only used in a few places and the function is mostly used in contexts which
      are not exposed to wakeups of any form.
      
      An effort was made to look for users relying on the old behavior by
      looking for usleep_range() in the same file as wake_up_process().
      No problems were found by this search, though it is conceivable that
      someone could have put the sleep and wakeup in two different files.
      
      An effort was made to ask several upstream maintainers if they were aware
      of people relying on wake_up_process() to wake up usleep_range(). No
      maintainers were aware of that but they were aware of many people relying
      on usleep_range() never returning before the minimum.
      Reported-by: NTao Huang <huangtao@rock-chips.com>
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Cc: heiko@sntech.de
      Cc: broonie@kernel.org
      Cc: briannorris@chromium.org
      Cc: Andreas Mohr <andi@lisas.de>
      Cc: linux-rockchip@lists.infradead.org
      Cc: tony.xie@rock-chips.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: djkurtz@chromium.org
      Cc: linux@roeck-us.net
      Cc: tskd08@gmail.com
      Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6c5e9059
  2. 11 10月, 2016 1 次提交
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
  3. 09 8月, 2016 1 次提交
    • C
      timers: Fix get_next_timer_interrupt() computation · 46c8f0b0
      Chris Metcalf 提交于
      The tick_nohz_stop_sched_tick() routine is not properly
      canceling the sched timer when nothing is pending, because
      get_next_timer_interrupt() is no longer returning KTIME_MAX in
      that case.  This causes periodic interrupts when none are needed.
      
      When determining the next interrupt time, we first use
      __next_timer_interrupt() to get the first expiring timer in the
      timer wheel.  If no timer is found, we return the base clock value
      plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
      timer wheel.
      
      Back in get_next_timer_interrupt(), we set the "expires" value
      by converting the timer wheel expiry (in ticks) to a nsec value.
      But we don't want to do this if the timer wheel expiry value
      indicates no timer; we want to return KTIME_MAX.
      
      Prior to commit 500462a9 ("timers: Switch to a non-cascading
      wheel") we checked base->active_timers to see if any timers
      were active, and if not, we didn't touch the expiry value and so
      properly returned KTIME_MAX.  Now we don't have active_timers.
      
      To fix this, we now just check the timer wheel expiry value to
      see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
      try to compute a new value based on it, but instead simply let the
      KTIME_MAX value in expires remain.
      
      Fixes: 500462a9 "timers: Switch to a non-cascading wheel"
      Signed-off-by: NChris Metcalf <cmetcalf@mellanox.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      46c8f0b0
  4. 15 7月, 2016 1 次提交
  5. 07 7月, 2016 11 次提交
    • A
      timers: Implement optimization for same expiry time in mod_timer() · f00c0afd
      Anna-Maria Gleixner 提交于
      The existing optimization for same expiry time in mod_timer() checks whether
      the timer expiry time is the same as the new requested expiry time. In the old
      timer wheel implementation this does not take the slack batching into account,
      neither does the new implementation evaluate whether the new expiry time will
      requeue the timer to the same bucket.
      
      To optimize that, we can calculate the resulting bucket and check if the new
      expiry time is different from the current expiry time. This calculation
      happens outside the base lock held region. If the resulting bucket is the same
      we can avoid taking the base lock and requeueing the timer.
      
      If the timer needs to be requeued then we have to check under the base lock
      whether the base time has changed between the lockless calculation and taking
      the lock. If it has changed we need to recalculate under the lock.
      
      This optimization takes effect for timers which are enqueued into the less
      granular wheel levels (1 and above). With a simple test case the functionality
      has been verified:
      
                  Before        After
       Match:       5.5%        86.6%
       Requeue:    94.5%        13.4%
       Recalc:                  <0.01%
      
      In the non optimized case the timer is requeued in 94.5% of the cases. With
      the index optimization in place the requeue rate drops to 13.4%. The case
      where the lockless index calculation has to be redone is less than 0.01%.
      
      With a real world test case (networking) we observed the following changes:
      
                  Before        After
       Match:      97.8%        99.7%
       Requeue:     2.2%         0.3%
       Recalc:                  <0.001%
      
      That means two percent fewer lock/requeue/unlock operations done in one of
      the hot path use cases of timers.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f00c0afd
    • A
      timers: Split out index calculation · ffdf0477
      Anna-Maria Gleixner 提交于
      For further optimizations we need to seperate index calculation
      from queueing. No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ffdf0477
    • T
      timers: Only wake softirq if necessary · 4e85876a
      Thomas Gleixner 提交于
      With the wheel forwading in place and with the HZ=1000 4ms folding we can
      avoid running the softirq at all.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e85876a
    • T
      timers: Forward the wheel clock whenever possible · a683f390
      Thomas Gleixner 提交于
      The wheel clock is stale when a CPU goes into a long idle sleep. This has the
      side effect that timers which are queued end up in the outer wheel levels.
      That results in coarser granularity.
      
      To solve this, we keep track of the idle state and forward the wheel clock
      whenever possible.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a683f390
    • A
      timers: Optimize collect_expired_timers() for NOHZ · 23696838
      Anna-Maria Gleixner 提交于
      After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
      There might be expired timers so the current code loops and checks the expired
      buckets for timers. This can take quite some time for long NOHZ idle periods.
      
      The pending bitmask in the timer base allows us to do a quick search for the
      next expiring timer and therefore a fast forward of the base time which
      prevents pointless long lasting loops.
      
      For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23696838
    • A
      timers: Move __run_timers() function · 73420fea
      Anna-Maria Gleixner 提交于
      Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
      in preparation for __run_timers() NOHZ optimization.
      
      No functional change.
      Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      73420fea
    • T
      timers: Remove set_timer_slack() leftovers · 53bf837b
      Thomas Gleixner 提交于
      We now have implicit batching in the timer wheel. The slack API is no longer
      used, so remove it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrew F. Davis <afd@ti.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jaehoon Chung <jh80.chung@samsung.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathias Nyman <mathias.nyman@intel.com>
      Cc: Pali Rohár <pali.rohar@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sebastian Reichel <sre@kernel.org>
      Cc: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: linux-usb@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      53bf837b
    • T
      timers: Switch to a non-cascading wheel · 500462a9
      Thomas Gleixner 提交于
      The current timer wheel has some drawbacks:
      
      1) Cascading:
      
         Cascading can be an unbound operation and is completely pointless in most
         cases because the vast majority of the timer wheel timers are canceled or
         rearmed before expiration. (They are used as timeout safeguards, not as
         real timers to measure time.)
      
      2) No fast lookup of the next expiring timer:
      
         In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
         must fast forward the base time to the current value of jiffies. As we
         have no way to find the next expiring timer fast, the code loops linearly
         and increments the base time one by one and checks for expired timers
         in each step. This causes unbound overhead spikes exactly in the moment
         when we should wake up as fast as possible.
      
      After a thorough analysis of real world data gathered on laptops,
      workstations, webservers and other machines (thanks Chris!) I came to the
      conclusion that the current 'classic' timer wheel implementation can be
      modified to address the above issues.
      
      The vast majority of timer wheel timers is canceled or rearmed before
      expiry. Most of them are timeouts for networking and other I/O tasks. The
      nature of timeouts is to catch the exception from normal operation (TCP ack
      timed out, disk does not respond, etc.). For these kinds of timeouts the
      accuracy of the timeout is not really a concern. Timeouts are very often
      approximate worst-case values and in case the timeout fires, we already
      waited for a long time and performance is down the drain already.
      
      The few timers which actually expire can be split into two categories:
      
       1) Short expiry times which expect halfways accurate expiry
      
       2) Long term expiry times are inaccurate today already due to the
          batching which is done for NOHZ automatically and also via the
          set_timer_slack() API.
      
      So for long term expiry timers we can avoid the cascading property and just
      leave them in the less granular outer wheels until expiry or
      cancelation. Timers which are armed with a timeout larger than the wheel
      capacity are no longer cascaded. We expire them with the longest possible
      timeout (6+ days). We have not observed such timeouts in our data collection,
      but at least we handle them, applying the rule of the least surprise.
      
      To avoid extending the wheel levels for HZ=1000 so we can accomodate the
      longest observed timeouts (5 days in the network conntrack code) we reduce the
      first level granularity on HZ=1000 to 4ms, which effectively is the same as
      the HZ=250 behaviour. From our data analysis there is nothing which relies on
      that 1ms granularity and as a side effect we get better batching and timer
      locality for the networking code as well.
      
      Contrary to the classic wheel the granularity of the next wheel is not the
      capacity of the first wheel. The granularities of the wheels are in the
      currently chosen setting 8 times the granularity of the previous wheel.
      
      So for HZ=250 we end up with the following granularity levels:
      
       Level Offset   Granularity                  Range
           0      0          4 ms                 0 ms -        252 ms
           1     64         32 ms               256 ms -       2044 ms (256ms - ~2s)
           2    128        256 ms              2048 ms -      16380 ms (~2s   - ~16s)
           3    192       2048 ms (~2s)       16384 ms -     131068 ms (~16s  - ~2m)
           4    256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m   - ~17m)
           5    320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m  - ~2h)
           6    384    1048576 ms (~17m)    8388608 ms -   67108863 ms (~2h   - ~18h)
           7    448    8388608 ms (~2h)    67108864 ms -  536870911 ms (~18h  - ~6d)
      
      That's a worst case inaccuracy of 12.5% for the timers which are queued at the
      beginning of a level.
      
      So the new wheel concept addresses the old issues:
      
      1) Cascading is avoided completely
      
      2) By keeping the timers in the bucket until expiry/cancelation we can track
         the buckets which have timers enqueued in a bucket bitmap and therefore can
         look up the next expiring timer very fast and O(1).
      
      A further benefit of the concept is that the slack calculation which is done
      on every timer start is no longer necessary because the granularity levels
      provide natural batching already.
      
      Our extensive testing with various loads did not show any performance
      degradation vs. the current wheel implementation.
      
      This patch does not address the 'fast lookup' issue as we wanted to make sure
      that there is no regression introduced by the wheel redesign. The
      optimizations are in follow up patches.
      
      This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      500462a9
    • T
      timers: Give a few structs and members proper names · 494af3ed
      Thomas Gleixner 提交于
      Some of the names in the internal implementation of the timer code
      are not longer correct and others are simply too long to type.
      
      Clean it up before we switch the wheel implementation over to
      the new scheme.
      
      No functional change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      494af3ed
    • T
      timers: Remove the deprecated mod_timer_pinned() API · 177ec0a0
      Thomas Gleixner 提交于
      We switched all users to initialize the timers as pinned and call
      mod_timer(). Remove the now unused timer API function.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      177ec0a0
    • T
      timers: Make 'pinned' a timer property · e675447b
      Thomas Gleixner 提交于
      We want to move the timer migration logic from a 'push' to a 'pull' model.
      
      Under the current 'push' model pinned timers are handled via
      a runtime API variant: mod_timer_pinned().
      
      The 'pull' model requires us to store the pinned attribute of a timer
      in the timer_list structure itself, as a new TIMER_PINNED bit in
      timer->flags.
      
      This flag must be set at initialization time and the timer APIs
      recognize the flag.
      
      This patch:
      
       - Implements the new flag and associated new-style initialization
         methods
      
       - makes mod_timer() recognize new-style pinned timers,
      
       - and adds some migration helper facility to allow
         step by step conversion of old-style to new-style
         pinned timers.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e675447b
  6. 10 6月, 2016 1 次提交
  7. 20 5月, 2016 2 次提交
  8. 26 3月, 2016 1 次提交
  9. 18 3月, 2016 1 次提交
    • J
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz 提交于
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
  10. 05 11月, 2015 1 次提交
    • T
      timers: Use proper base migration in add_timer_on() · 22b886dd
      Tejun Heo 提交于
      Regardless of the previous CPU a timer was on, add_timer_on()
      currently simply sets timer->flags to the new CPU.  As the caller must
      be seeing the timer as idle, this is locally fine, but the timer
      leaving the old base while unlocked can lead to race conditions as
      follows.
      
      Let's say timer was on cpu 0.
      
        cpu 0					cpu 1
        -----------------------------------------------------------------------------
        del_timer(timer) succeeds
      					del_timer(timer)
      					  lock_timer_base(timer) locks cpu_0_base
        add_timer_on(timer, 1)
          spin_lock(&cpu_1_base->lock)
          timer->flags set to cpu_1_base
          operates on @timer			  operates on @timer
      
      This triggered with mod_delayed_work_on() which contains
      "if (del_timer()) add_timer_on()" sequence eventually leading to the
      following oops.
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
        ...
        Workqueue: wqthrash wqthrash_workfunc [wqthrash]
        task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
        RIP: 0010:[<ffffffff810ca6e9>]  [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
        ...
        Call Trace:
         [<ffffffff810cb0b4>] del_timer+0x44/0x60
         [<ffffffff8106e836>] try_to_grab_pending+0xb6/0x160
         [<ffffffff8106e913>] mod_delayed_work_on+0x33/0x80
         [<ffffffffa0000081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
         [<ffffffff8106dba8>] process_one_work+0x1e8/0x650
         [<ffffffff8106e05e>] worker_thread+0x4e/0x450
         [<ffffffff810746af>] kthread+0xef/0x110
         [<ffffffff8185980f>] ret_from_fork+0x3f/0x70
      
      Fix it by updating add_timer_on() to perform proper migration as
      __mod_timer() does.
      Reported-and-tested-by: NJeff Layton <jlayton@poochiereds.net>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Worley <chris.worley@primarydata.com>
      Cc: bfields@fieldses.org
      Cc: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: kernel-team@fb.com
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20151029103113.2f893924@tlielax.poochiereds.net
      Link: http://lkml.kernel.org/r/20151104171533.GI5749@mtj.duckdns.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      22b886dd
  11. 12 10月, 2015 1 次提交
  12. 22 9月, 2015 1 次提交
  13. 18 8月, 2015 1 次提交
    • E
      timer: Write timer->flags atomically · d0023a14
      Eric Dumazet 提交于
      lock_timer_base() cannot prevent the following :
      
      CPU1 ( in __mod_timer()
      timer->flags |= TIMER_MIGRATING;
      spin_unlock(&base->lock);
      base = new_base;
      spin_lock(&base->lock);
      // The next line clears TIMER_MIGRATING
      timer->flags &= ~TIMER_BASEMASK;
                                        CPU2 (in lock_timer_base())
                                        see timer base is cpu0 base
                                        spin_lock_irqsave(&base->lock, *flags);
                                        if (timer->flags == tf)
                                             return base; // oops, wrong base
      timer->flags |= base->cpu // too late
      
      We must write timer->flags in one go, otherwise we can fool other cpus.
      
      Fixes: bc7a34b8 ("timer: Reduce timer migration overhead if disabled")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jon Christopherson <jon@jons.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: xen-devel@lists.xen.org
      Cc: david.vrabel@citrix.com
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Link: http://lkml.kernel.org/r/1439831928.32680.11.camel@edumazet-glaptop2.roam.corp.google.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      d0023a14
  14. 27 6月, 2015 1 次提交
    • T
      timer: Fix hotplug regression · 24bfcb10
      Thomas Gleixner 提交于
      The recent timer wheel rework removed the get/put_cpu_var() pair in
      the hotplug migration code, which results in:
      
      BUG: using smp_processor_id() in preemptible [00000000] code: hib.sh/2845
      ...
      [<ffffffff810d4fa3>] timer_cpu_notify+0x53/0x12
      
      That hunk is a leftover from an earlier iteration and went unnoticed
      so far.
      
      Restore the previous code which was obviously correct.
      
      Fixes: 0eeda71b 'timer: Replace timer base by a cpu index'
      Reported-and_tested-by: Borislav Petkov <bp@alien8.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      24bfcb10
  15. 19 6月, 2015 7 次提交
    • T
      timer: Minimize nohz off overhead · 683be13a
      Thomas Gleixner 提交于
      If nohz is disabled on the kernel command line the [hr]timer code
      still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
      pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
      avoid the overhead.
      
      Before:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.73%  hog       [.] main
        15.36%  [kernel]  [k] _raw_spin_lock_irqsave
         9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.61%  [kernel]  [k] lock_timer_base.isra.38
         6.42%  [kernel]  [k] mod_timer
         3.90%  [kernel]  [k] detach_if_pending
         3.76%  [kernel]  [k] del_timer
         2.41%  [kernel]  [k] internal_add_timer
         1.39%  [kernel]  [k] __internal_add_timer
         0.76%  [kernel]  [k] timerfn
      
      We probably should have a cached value for nohz full in the per cpu
      bases as well to avoid the cpumask check. The base cache line is hot
      already, the cpumask not necessarily.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      683be13a
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
    • T
      timer: Stats: Simplify the flags handling · c74441a1
      Thomas Gleixner 提交于
      Simplify the handling of the flag storage for the timer statistics. No
      intermediate storage anymore. Just hand over the flags field.
      
      I left the printout of 'deferrable' for now because changing this
      would be an ABI update and I have no idea how strong people feel about
      that. OTOH, I wonder whether we should kill the whole timer stats
      stuff because all of that information can be retrieved via ftrace/perf
      as well.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c74441a1
    • T
      timer: Replace timer base by a cpu index · 0eeda71b
      Thomas Gleixner 提交于
      Instead of storing a pointer to the per cpu tvec_base we can simply
      cache a CPU index in the timer_list and use that to get hold of the
      correct per cpu tvec_base. This is only used in lock_timer_base() and
      the slightly larger code is peanuts versus the spinlock operation and
      the d-cache foot print of the timer wheel.
      
      Aside of that this allows to get rid of following nuisances:
      
       - boot_tvec_base
      
         That statically allocated 4k bss data is just kept around so the
         timer has a home when it gets statically initialized. It serves no
         other purpose.
      
         With the CPU index we assign the timer to CPU0 at static
         initialization time and therefor can avoid the whole boot_tvec_base
         dance.  That also simplifies the init code, which just can use the
         per cpu base.
      
         Before:
           text	   data	    bss	    dec	    hex	filename
          17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
         After:
           text	   data	    bss	    dec	    hex	filename
          17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o
      
       - Overloading the base pointer with various flags
      
         The CPU index has enough space to hold the flags (deferrable,
         irqsafe) so we can get rid of the extra masking and bit fiddling
         with the base pointer.
      
      As a benefit we reduce the size of struct timer_list on 64 bit
      machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
      which is a real win as we have tons of them embedded in other structs.
      
      This changes also the newly added deferrable printout of the timer
      start trace point to capture and print all timer->flags, which allows
      us to decode the target cpu of the timer as well.
      
      We might have used bitfields for this, but that would change the
      static initializers and the init function for no value to accomodate
      big endian bitfields.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Badhri Jagan Sridharan <Badhri@google.com>
      Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0eeda71b
    • T
      timer: Use hlist for the timer wheel hash buckets · 1dabbcec
      Thomas Gleixner 提交于
      This reduces the size of struct tvec_base by 50% and results in
      slightly smaller code as well.
      
      Before:
         struct tvec_base: size: 8256, cachelines: 129
      
         text	   data	    bss	    dec	    hex	filename
        17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o
      
      After:
        struct tvec_base: 4160, cachelines: 65
      
         text	   data	    bss	    dec	    hex	filename
        17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1dabbcec
    • T
      timer: Remove FIFO "guarantee" · 1bd04bf6
      Thomas Gleixner 提交于
      The FIFO guarantee is only there if two timers are queued into the
      same bucket at the same jiffie on the same cpu:
      
       - The slack value depends on the delta between expiry and enqueue
         time, so the resulting expiry time can be different for timers
         which are queued in different jiffies.
      
       - Timers which are queued into the secondary array end up after a
         later queued timer which was queued into the primary array due to
         cascading.
      
       - Timers can end up on different cpus due to the NOHZ target moving
         around. Obviously there is no guarantee of expiry ordering between
         cpus.
      
      So anything which relies on FIFO behaviour of the timer wheel is
      broken already.
      
      This is a preparatory patch for converting the timer wheel to hlist
      which reduces the memory foot print of the wheel by 50%.
      
      It's a seperate patch so any (unlikely to happen) regression caused by
      this can be identified clearly.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Cc: George Spelvin <linux@horizon.com>
      Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1bd04bf6
    • T
      timers: Sanitize catchup_timer_jiffies() usage · 3bb475a3
      Thomas Gleixner 提交于
      catchup_timer_jiffies() has been applied blindly to several functions
      without looking for possible better ways to do it.
      
      1) internal_add_timer()
      
         Move the update to base->all_timers before we actually insert the
         timer into the wheel.
      
      2) detach_if_pending()
      
         Again the update to base->all_timers allows us to explicitely do
         the timer_jiffies update in place, if this was the last timer which
         got removed.
      
      3) __run_timers()
      
         We only check on entry, which is silly, because base->timer_jiffies
         can be behind - especially on NOHZ kernels - and if there is a
         single deferrable timer somewhere between base->timer_jiffies and
         jiffies we expire it and then loop until base->timer_jiffies ==
         jiffies.
      
         Move it into the loop.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3bb475a3
  16. 23 5月, 2015 1 次提交
  17. 05 5月, 2015 1 次提交
  18. 22 4月, 2015 4 次提交
  19. 02 4月, 2015 2 次提交