1. 05 9月, 2016 9 次提交
  2. 18 8月, 2016 11 次提交
    • S
      sched/cputime: Improve scalability by not accounting thread group tasks pending runtime · a1eb1411
      Stanislaw Gruszka 提交于
      Commit:
      
        d670ec13 ("posix-cpu-timers: Cure SMP wobbles")
      
      started accounting thread group tasks pending runtime in thread_group_cputime().
      
      Another commit:
      
        6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
      
      updated scheduler runtime statistics (call update_curr()) when reading task pending
      runtime. Those changes cause bad performance of SYS_times() and
      SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on
      larger systems with many CPUs.
      
      While we would like to have cpuclock monotonicity kept i.e. have
      problems fixed by above commits stay fixed, we also would like to have
      good performance.
      
      However when we notice that change from commit d670ec13 is not
      longer needed to solve problem addressed by that commit, because of
      change from the second commit 6e998916, we can get room for
      optimization. Since we update task while reading it's pending runtime
      in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
      see updated values and on testcase from d670ec13 process cpuclock
      will not be smaller than thread cpuclock.
      
      I tested the patch on testcases from commits d670ec13,
      6e998916 and some other cpuclock/cputimers testcases and
      did not found cpuclock monotonicity problems or other malfunction.
      
      This patch has the drawback that we will not provide thread group cputime
      up-to-date to the last moment. For example when arming cputime timer,
      we will arm it with possibly a bit outdated values and that timer will
      trigger earlier compared to behaviour without the patch. However that
      was the behaviour before d670ec13 commit (kernel v3.1) so it's
      unlikely to affect applications.
      
      Patch improves related syscall performance, as measured by Giovanni's
      benchmarks described in commit:
      
        6075620b ("sched/cputime: Mitigate performance regression in times()/clock_gettime()")
      
      The benchmark results are:
      
      SYS_clock_gettime():
      
        threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                               (pre-6e998916)
        2          3.48        2.23 ( 35.68%)        3.06 ( 11.83%)        1.08 ( 68.81%)
        5          3.33        2.83 ( 14.84%)        3.25 (  2.40%)        0.71 ( 78.55%)
        8          3.37        2.84 ( 15.80%)        3.26 (  3.30%)        0.56 ( 83.49%)
        12         3.32        3.09 (  6.69%)        3.37 ( -1.60%)        0.42 ( 87.28%)
        21         4.01        3.14 ( 21.70%)        3.90 (  2.74%)        0.35 ( 91.35%)
        30         3.63        3.28 (  9.75%)        3.36 (  7.41%)        0.28 ( 92.23%)
        48         3.71        3.02 ( 18.69%)        3.11 ( 16.27%)        0.39 ( 89.39%)
        79         3.75        2.88 ( 23.23%)        3.16 ( 15.74%)        0.46 ( 87.76%)
        110        3.81        2.95 ( 22.62%)        3.25 ( 14.80%)        0.56 ( 85.41%)
        128        3.88        3.05 ( 21.28%)        3.31 ( 14.76%)        0.62 ( 84.10%)
      
      SYS_times():
      
        threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                               (pre-6e998916)
        2          3.65        2.27 ( 37.94%)        3.25 ( 11.03%)        1.62 ( 55.71%)
        5          3.45        2.78 ( 19.34%)        3.17 (  7.92%)        2.33 ( 32.28%)
        8          3.52        2.79 ( 20.66%)        3.22 (  8.69%)        2.06 ( 41.44%)
        12         3.29        3.02 (  8.33%)        3.36 ( -2.04%)        2.00 ( 39.18%)
        21         4.07        3.10 ( 23.86%)        3.92 (  3.78%)        2.07 ( 49.18%)
        30         3.87        3.33 ( 13.80%)        3.40 ( 12.17%)        1.89 ( 51.12%)
        48         3.79        2.96 ( 21.94%)        3.16 ( 16.61%)        1.69 ( 55.46%)
        79         3.88        2.88 ( 25.82%)        3.28 ( 15.42%)        1.60 ( 58.81%)
        110        3.90        2.98 ( 23.73%)        3.38 ( 13.35%)        1.73 ( 55.61%)
        128        4.00        3.10 ( 22.40%)        3.38 ( 15.45%)        1.66 ( 58.52%)
      Reported-and-tested-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1eb1411
    • M
      sched/fair: Let asymmetric CPU configurations balance at wake-up · 3273163c
      Morten Rasmussen 提交于
      Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
      SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
      configurations SD_WAKE_AFFINE is only desirable if the waking task's
      compute demand (utilization) is suitable for the waking CPU and the
      previous CPU, and all CPUs within their respective
      SD_SHARE_PKG_RESOURCES domains (sd_llc). If not, let wakeup balancing
      take over (find_idlest_{group, cpu}()).
      
      This patch makes affine wake-ups conditional on whether both the waker
      CPU and the previous CPU has sufficient capacity for the waking task,
      or not, assuming that the CPU capacities within an SD_SHARE_PKG_RESOURCES
      domain (sd_llc) are homogeneous.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-10-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3273163c
    • D
      sched/core: Store maximum per-CPU capacity in root domain · cd92bfd3
      Dietmar Eggemann 提交于
      To be able to compare the capacity of the target CPU with the highest
      available CPU capacity, store the maximum per-CPU capacity in the root
      domain.
      
      The max per-CPU capacity should be 1024 for all systems except SMT,
      where the capacity is currently based on smt_gain and the number of
      hardware threads and is <1024. If SMT can be brought to work with a
      per-thread capacity of 1024, this patch can be dropped and replaced by a
      hard-coded max capacity of 1024 (=SCHED_CAPACITY_SCALE).
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/26c69258-9947-f830-a53e-0c54e7750646@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd92bfd3
    • M
      sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems · 9ee1cda5
      Morten Rasmussen 提交于
      A domain with the SD_ASYM_CPUCAPACITY flag set indicate that
      sched_groups at this level and below do not include CPUs of all
      capacities available (e.g. group containing little-only or big-only CPUs
      in big.LITTLE systems). It is therefore necessary to put in more effort
      in finding an appropriate CPU at task wake-up by enabling balancing at
      wake-up (SD_BALANCE_WAKE) on all lower (child) levels.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-8-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9ee1cda5
    • M
      sched/core: Pass child domain into sd_init() · 3676b13e
      Morten Rasmussen 提交于
      If behavioural sched_domain flags depend on topology flags set at higher
      domain levels we need a way to update the child domain flags. Moving the
      child pointer assignment inside sd_init() should make that possible.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-7-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3676b13e
    • M
      sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag · 1f6e6c7c
      Morten Rasmussen 提交于
      Add a topology flag to the sched_domain hierarchy indicating the lowest
      domain level where the full range of CPU capacities is represented by
      the domain members for asymmetric capacity topologies (e.g. ARM
      big.LITTLE).
      
      The flag is intended to indicate that extra care should be taken when
      placing tasks on CPUs and this level spans all the different types of
      CPUs found in the system (no need to look further up the domain
      hierarchy). This information is currently only available through
      iterating through the capacities of all the CPUs at parent levels in the
      sched_domain hierarchy.
      
        SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY
      
        SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY
      
        CPU:         0      1      2      3
        capacity:  756    756   1024   1024
      
      If the topology in the example above is duplicated to create an eight
      CPU example with third sched_domain level on top (SD 3), this level
      should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
      would both have all CPU capacities represented within them.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f6e6c7c
    • M
      sched/core: Remove unnecessary NULL-pointer check · 0e6d2a67
      Morten Rasmussen 提交于
      Checking if the sched_domain pointer returned by sd_init() is NULL seems
      pointless as sd_init() neither checks if it is valid to begin with nor
      set it to NULL.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-5-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0e6d2a67
    • P
      sched/core: Clarify SD_flags comment · 94f438c8
      Peter Zijlstra 提交于
      The SD_flags comment is very terse and doesn't explain why PACKING is
      odd.
      
      IIRC the distinction is that the 'normal' ones only describe topology,
      while the ASYM_PACKING one also prescribes behaviour. It is odd in the
      way that it doesn't only describe things.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/20160815105459.GS6879@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      94f438c8
    • W
      sched/cputime: Resync steal time when guest & host lose sync · 03cbc732
      Wanpeng Li 提交于
      Commit:
      
        57430218 ("sched/cputime: Count actually elapsed irq & softirq time")
      
      ... fixed a bug but also triggered a regression:
      
      On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
      CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
      on host one by one until there is only one left, then observe CPU utilization
      via 'top' in the guest, it shows:
      
        100% st for cpu0(housekeeping)
         75% st for other CPUs (nohz full mode)
      
      However, w/o this commit it shows the correct 75% for all four CPUs.
      
      When a guest is interrupted for a longer amount of time, missed clock ticks
      are not redelivered later. Because of that, we should not limit the amount
      of steal time accounted to the amount of time that the calling functions
      think have passed.
      
      However, the interval returned by account_other_time() is NOT rounded down
      to the nearest jiffy, while the base interval in get_vtime_delta() it is
      subtracted from is, so the max cputime limit is required to avoid underflow.
      
      This patch fixes the regression by limiting the account_other_time() from
      get_vtime_delta() to avoid underflow, and lets the other three call sites
      (in account_other_time() and steal_account_process_time()) account however
      much steal time the host told us elapsed.
      Suggested-by: NRik van Riel <riel@redhat.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      03cbc732
    • R
      sched: Remove struct rq::nohz_stamp · 1fc770d5
      Rik van Riel 提交于
      The nohz_stamp member of struct rq has been unused since 2010,
      when this commit removed the code that referenced it:
      
        396e894d ("sched: Revert nohz_ratelimit() for now")
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160815121410.5ea1c98f@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1fc770d5
    • P
      sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression · 173be9a1
      Peter Zijlstra 提交于
      Mike reports:
      
       Roughly 10% of the time, ltp testcase getrusage04 fails:
       getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
       getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
       getrusage04    0  TINFO  :  utime:           0us; stime:         179us
       getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
       getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased > 5000us:
      
      And tracked it down to the case where the task simply doesn't get
      _any_ [us]time ticks.
      
      Update the code to assume all rtime is utime when we lack information,
      thus ensuring a task that elides the tick gets time accounted.
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: stable@vger.kernel.org # 4.3+
      Fixes: 9d7fb042 ("sched/cputime: Guarantee stime + utime == rtime")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      173be9a1
  3. 11 8月, 2016 2 次提交
  4. 10 8月, 2016 13 次提交
    • V
      sched/debug: Add taint on "BUG: Sleeping function called from invalid context" · f0b22e39
      Vegard Nossum 提交于
      Seeing this, it occurs to me that we should probably add a taint here:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17160 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17198 ffffffff81158067 0000000000000de6
           ffff88011a3c4c80 ffffffff8390e07c 0000000000000184 0000000000000000
          Call Trace:
          [...]
      
          BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1309
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17e08 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17e40 ffffffff81158067 0000000000000000
           ffff88011a3c4c80 ffffffff83437b20 000000000000051d 0000000000000000
          Call Trace:
          [...]
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Link: http://lkml.kernel.org/r/1469216762-19626-1-git-send-email-vegard.nossum@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b22e39
    • V
      sched/debug: Make the "Preemption disabled at ..." message more useful · d1c6d149
      Vegard Nossum 提交于
      This message is currently really useless since it always prints a value
      that comes from the printk() we just did, e.g.:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          BUG: sleeping function called from invalid context at include/linux/freezer.h:56
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
      Here, both down_trylock() and console_unlock() is somewhere in the
      printk() path.
      
      We should save the value before calling printk() and use the saved value
      instead. That immediately reveals the offending callsite:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
          Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
      
      Bug report:
      
        http://marc.info/?l=linux-netdev&m=146925979821849&w=2Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d1c6d149
    • J
      sched/deadline: Remove useless parameter from setup_new_dl_entity() · 98b0a857
      Juri Lelli 提交于
      setup_new_dl_entity() takes two parameters, but it only actually uses
      one of them, under a different name, to setup a new dl_entity, after:
      
        2f9f3fdc928 "sched/deadline: Remove dl_new from struct sched_dl_entity"
      
      as we currently do:
      
        setup_new_dl_entity(&p->dl, &p->dl)
      
      However, before Luca's change we were doing:
      
        setup_new_dl_entity(dl_se, pi_se)
      
      in update_dl_entity() for a dl_se->new entity: we were using pi_se's
      parameters (the potential PI donor) for setting up a new entity.
      
      This change removes the useless second parameter of setup_new_dl_entity().
      
      While we are at it we also optimize things further calling setup_new_dl_
      entity() only for already queued tasks, since (as pointed out by Xunlei)
      we already do the very same update at tasks wakeup time anyway. By doing
      so, we don't need to worry about a potential PI donor anymore, as
      rt_mutex_setprio() takes care of that already for us.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xunlei Pang <xpang@redhat.com>
      Link: http://lkml.kernel.org/r/1470409675-20935-1-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      98b0a857
    • L
      sched/core: Add documentation for 'cookie' argument · 9279e0d2
      Luis de Bethencourt 提交于
      Add documentation for the cookie argument in try_to_wake_up_local().
      
      This caused the following warning when building documentation:
      
        kernel/sched/core.c:2088: warning: No description found for parameter 'cookie'
      Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Fixes: e7904a28 ("ilocking/lockdep, sched/core: Implement a better lock pinning scheme")
      Link: http://lkml.kernel.org/r/1468159226-17674-1-git-send-email-luisbg@osg.samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9279e0d2
    • M
      sched/fair: Optimize find_idlest_cpu() when there is no choice · eaecf41f
      Morten Rasmussen 提交于
      In the current find_idlest_group()/find_idlest_cpu() search we end up
      calling find_idlest_cpu() in a sched_group containing only one CPU in
      the end. Checking idle-states becomes pointless when there is no
      alternative, so bail out instead.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: linux-kernel@vger.kernel.org
      Cc: mgalbraith@suse.de
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1466615004-3503-4-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eaecf41f
    • M
      sched/fair: Make the use of prev_cpu consistent in the wakeup path · 772bd008
      Morten Rasmussen 提交于
      In commit:
      
        ac66f547 ("sched/numa: Introduce migrate_swap()")
      
      select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
      in special cases (NUMA task swapping).
      
      However, the select_task_rq_fair() helper functions: wake_affine() and
      select_idle_sibling(), still use task_cpu(p) directly to work out
      prev_cpu, which leads to inconsistencies.
      
      This patch passes prev_cpu (potentially overridden by NUMA code) into
      the helper functions to ensure prev_cpu is indeed the same CPU
      everywhere in the wakeup path.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: linux-kernel@vger.kernel.org
      Cc: mgalbraith@suse.de
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1466615004-3503-3-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      772bd008
    • P
      sched/fair: Improve PELT stuff some more · 7c3edd2c
      Peter Zijlstra 提交于
      Vincent noted that the update_tg_load_avg() usage in commit:
      
        3d30544f ("sched/fair: Apply more PELT fixes")
      
      isn't entirely sufficient. We need to call this function every time
      cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
      returns true, but {attach,detach}_entity_load_avg() themselves also
      change it. This means we need to unconditionally call
      update_tg_load_avg().
      
      Also, add more comments.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7c3edd2c
    • L
      sched/core: Fix one typo · a1fd4656
      Leo Yan 提交于
      Fix one minor typo in the comment: s/targer/target/.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470378758-15066-1-git-send-email-leo.yan@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1fd4656
    • L
      sched/fair: Remove 'cpu_busy' parameter from update_next_balance() · 31851a98
      Leo Yan 提交于
      The update_next_balance() function is only used by idle balancing, so its
      'cpu_busy' parameter is always 0.
      
      Open code it instead of passing it around.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470378689-14892-1-git-send-email-leo.yan@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      31851a98
    • W
      sched/deadline: Fix lock pinning warning during CPU hotplug · c0c8c9fa
      Wanpeng Li 提交于
      The following warning can be triggered by hot-unplugging the CPU
      on which an active SCHED_DEADLINE task is running on:
      
        WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
        releasing a pinned lock
        Call Trace:
         dump_stack+0x99/0xd0
         __warn+0xd1/0xf0
         ? dl_task_timer+0x1a1/0x2b0
         warn_slowpath_fmt+0x4f/0x60
         ? sched_clock+0x13/0x20
         lock_release+0x690/0x6a0
         ? enqueue_pushable_dl_task+0x9b/0xa0
         ? enqueue_task_dl+0x1ca/0x480
         _raw_spin_unlock+0x1f/0x40
         dl_task_timer+0x1a1/0x2b0
         ? push_dl_task.part.31+0x190/0x190
        WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
        unpinning an unpinned lock
        Call Trace:
         dump_stack+0x99/0xd0
         __warn+0xd1/0xf0
         warn_slowpath_fmt+0x4f/0x60
         lock_unpin_lock+0x181/0x1a0
         dl_task_timer+0x127/0x2b0
         ? push_dl_task.part.31+0x190/0x190
      
      As per the comment before this code, its safe to drop the RQ lock
      here, and since we (potentially) change rq, unpin and repin to avoid
      the splat.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [ Rewrote changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c0c8c9fa
    • G
      sched/cputime: Mitigate performance regression in times()/clock_gettime() · 6075620b
      Giovanni Gherdovich 提交于
      Commit:
      
        6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
      
      fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
      allow a task to wake early. It addressed the problem by calling the scheduling
      classes update_curr() when the cputimer starts.
      
      Said change induced a considerable performance regression on the syscalls
      times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
      debuggers and applications that monitor their own performance that
      accidentally depend on the performance of these specific calls.
      
      This patch mitigates the performace loss by prefetching data in the CPU
      cache, as stalls due to cache misses appear to be where most time is spent
      in our benchmarks.
      
      Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
      box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
      variable number of threads, from 2 to 4*num_cpus; the results are in
      seconds and correspond to the average of 10 runs; the percentage gain is
      computed with (before-after)/before so a positive value is an improvement
      (it's faster). The improvement varies between a few percents for 5-20
      threads and more than 10% for 2 or >20 threads.
      
      pound_clock_gettime:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.48        3.06 ( 11.83%)
            5           3.33        3.25 (  2.40%)
            8           3.37        3.26 (  3.30%)
           12           3.32        3.37 ( -1.60%)
           21           4.01        3.90 (  2.74%)
           30           3.63        3.36 (  7.41%)
           48           3.71        3.11 ( 16.27%)
           79           3.75        3.16 ( 15.74%)
          110           3.81        3.25 ( 14.80%)
          128           3.88        3.31 ( 14.76%)
      
      pound_times:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.65        3.25 ( 11.03%)
            5           3.45        3.17 (  7.92%)
            8           3.52        3.22 (  8.69%)
           12           3.29        3.36 ( -2.04%)
           21           4.07        3.92 (  3.78%)
           30           3.87        3.40 ( 12.17%)
           48           3.79        3.16 ( 16.61%)
           79           3.88        3.28 ( 15.42%)
          110           3.90        3.38 ( 13.35%)
          128           4.00        3.38 ( 15.45%)
      
      pound_clock_gettime and pound_clock_gettime are two benchmarks included in
      the MMTests framework. They launch a given number of threads which
      repeatedly call times() or clock_gettimes(). The results above can be
      reproduced with cloning MMTests from github.com and running the "poundtime"
      workload:
      
        $ git clone https://github.com/gormanm/mmtests.git
        $ cd mmtests
        $ cp configs/config-global-dhp__workload_poundtime config
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      The above will run "poundtime" measuring the kernel currently running on
      the machine; Once a new kernel is installed and the machine rebooted,
      running again
      
        $ cd mmtests
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      will produce results to compare with. A comparison table will be output
      with:
      
        $ cd mmtests/work/log
        $ ../../compare-kernels.sh
      
      the table will contain a lot of entries; grepping for "Amean" (as in
      "arithmetic mean") will give the tables presented above. The source code
      for the two benchmarks is reported at the end of this changelog for
      clairity.
      
      The cache misses addressed by this patch were found using a combination of
      `perf top`, `perf record` and `perf annotate`. The incriminated lines were
      found to be
      
          struct sched_entity *curr = cfs_rq->curr;
      
      and
      
          delta_exec = now - curr->exec_start;
      
      in the function update_curr() from kernel/sched/fair.c. This patch
      prefetches the data from memory just before update_curr is called in the
      interested execution path.
      
      A comparison of the total number of cycles before and after the patch
      follows; the data is obtained using `perf stat -r 10 -ddd <program>`
      running over the same sequence of number of threads used above (a positive
      gain is an improvement):
      
        threads   cycles before                 cycles after                gain
      
          2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
          5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
          8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
         12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
         21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
         30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
         48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
         79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
        110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
        128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%
      
      A comparison of cache miss vs total cache loads ratios, before and after
      the patch (again from the `perf stat -r 10 -ddd <program>` tables):
      
        threads   L1 misses/total*100     L1 misses/total*100            gain
      		         before                   after
            2           7.43  +-4.90%           7.36  +-4.70%           0.94%
            5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
            8          13.79  +-5.61%          12.90  +-3.27%           6.45%
           12          11.57  +-2.44%           8.71  +-1.40%          24.72%
           21          12.39  +-3.92%           9.97  +-1.84%          19.53%
           30          13.91  +-2.53%          11.73  +-2.28%          15.67%
           48          13.71  +-1.59%          12.32  +-1.97%          10.14%
           79          14.44  +-0.66%          13.40  +-1.06%           7.20%
          110          15.86  +-0.50%          14.46  +-0.59%           8.83%
          128          16.51  +-0.32%          15.06  +-0.78%           8.78%
      
      As a final note, the following shows the evolution of performance figures
      in the "poundtime" benchmark and pinpoints commit 6e998916
      ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
      major source of degradation, mostly unaddressed to this day (figures
      expressed in seconds).
      
      pound_clock_gettime:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
          5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
          8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
          12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
          21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
          30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
          48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
          79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
          110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
          128      3.05          6.42 (-110.08%)        3.88 (-27.04%)
      
      pound_times:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
          5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
          8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
          12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
          21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
          30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
          48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
          79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
          110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
          128      3.10          6.35 (-104.61%)        4.00 (-28.87%)
      
      The source code of the two benchmarks follows. To compile the two:
      
        NR_THREADS=42
        for FILE in pound_times pound_clock_gettime; do
            gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
        done
      
      ==== BEGIN pound_times.c ====
      
      struct tms start;
      
      void *pound (void *threadid)
      {
        struct tms end;
        int oldutime = 0;
        int utime;
        int i;
        for (i = 0; i < 5000000 / NUM_THREADS; i++) {
                times(&end);
                utime = ((int)end.tms_utime - (int)start.tms_utime);
                if (oldutime > utime) {
                  printf("utime decreased, was %d, now %d!\n", oldutime, utime);
                }
                oldutime = utime;
        }
        pthread_exit(NULL);
      }
      
      int main()
      {
        pthread_t th[NUM_THREADS];
        long i;
        times(&start);
        for (i = 0; i < NUM_THREADS; i++) {
          pthread_create (&th[i], NULL, pound, (void *)i);
        }
        pthread_exit(NULL);
        return 0;
      }
      ==== END pound_times.c ====
      
      ==== BEGIN pound_clock_gettime.c ====
      
      void *pound (void *threadid)
      {
      	struct timespec ts;
      	int rc, i;
      	unsigned long prev = 0, this = 0;
      
      	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
      		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
      		if (rc < 0)
      			perror("clock_gettime");
      		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
      		if (0 && this < prev)
      			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
      		prev = this;
      	}
      	pthread_exit(NULL);
      }
      
      int main()
      {
      	pthread_t th[NUM_THREADS];
      	long rc, i;
      	pid_t pgid;
      
      	for (i = 0; i < NUM_THREADS; i++) {
      		rc = pthread_create(&th[i], NULL, pound, (void *)i);
      		if (rc < 0)
      			perror("pthread_create");
      	}
      
      	pthread_exit(NULL);
      	return 0;
      }
      ==== END pound_clock_gettime.c ====
      Suggested-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6075620b
    • X
      sched/fair: Fix typo in sync_throttle() · b8922125
      Xunlei Pang 提交于
      We should update cfs_rq->throttled_clock_task, not
      pcfs_rq->throttle_clock_task.
      
      The effects of this bug was probably occasionally erratic
      group scheduling, particularly in cgroups-intense workloads.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      [ Added changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 55e16d30 ("sched/fair: Rework throttle_count sync")
      Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b8922125
    • T
      sched/deadline: Fix wrap-around in DL heap · a23eadfa
      Tommaso Cucinotta 提交于
      Current code in cpudeadline.c has a bug in re-heapifying when adding a
      new element at the end of the heap, because a deadline value of 0 is
      temporarily set in the new elem, then cpudl_change_key() is called
      with the actual elem deadline as param.
      
      However, the function compares the new deadline to set with the one
      previously in the elem, which is 0.  So, if current absolute deadlines
      grew so much to have negative values as s64, the comparison in
      cpudl_change_key() makes the wrong decision.  Instead, as from
      dl_time_before(), the kernel should handle correctly abs deadlines
      wrap-arounds.
      
      This patch fixes the problem with a minimally invasive change that
      forces cpudl_change_key() to heapify up in this case.
      Signed-off-by: NTommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NLuca Abeni <luca.abeni@unitn.it>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a23eadfa
  5. 22 7月, 2016 1 次提交
    • S
      cpufreq: schedutil: map raw required frequency to driver frequency · 5cbea469
      Steve Muckle 提交于
      The slow-path frequency transition path is relatively expensive as it
      requires waking up a thread to do work. Should support be added for
      remote CPU cpufreq updates that is also expensive since it requires an
      IPI. These activities should be avoided if they are not necessary.
      
      To that end, calculate the actual driver-supported frequency required by
      the new utilization value in schedutil by using the recently added
      cpufreq_driver_resolve_freq API. If it is the same as the previously
      requested driver frequency then there is no need to continue with the
      update assuming the cpu frequency limits have not changed. This will
      have additional benefits should the semantics of the rate limit be
      changed to apply solely to frequency transitions rather than to
      frequency calculations in schedutil.
      
      The last raw required frequency is cached. This allows the driver
      frequency lookup to be skipped in the event that the new raw required
      frequency matches the last one, assuming a frequency update has not been
      forced due to limits changing (indicated by a next_freq value of
      UINT_MAX, see sugov_should_update_freq).
      Signed-off-by: NSteve Muckle <smuckle@linaro.org>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5cbea469
  6. 14 7月, 2016 4 次提交
    • R
      sched/cputime: Drop local_irq_save/restore from irqtime_account_irq() · 553bf6bb
      Rik van Riel 提交于
      Paolo pointed out that irqs are already blocked when irqtime_account_irq()
      is called. That means there is no reason to call local_irq_save/restore()
      again.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1468421405-20056-6-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      553bf6bb
    • F
      sched/cputime: Clean up the old vtime gen irqtime accounting completely · 0cfdf9a1
      Frederic Weisbecker 提交于
      Vtime generic irqtime accounting has been removed but there are a few
      remnants to clean up:
      
      * The vtime_accounting_cpu_enabled() check in irq entry was only used
        by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.
      
      * Without the vtime_accounting_cpu_enabled(), we no longer need to
        have a vtime_common_account_irq_enter() indirect function.
      
      * Move vtime_account_irq_enter() implementation under
        CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.
      
      * The vtime_account_user() call was only used on irq entry for
        CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1468421405-20056-4-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0cfdf9a1
    • R
      sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code · b58c3584
      Rik van Riel 提交于
      The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
      appear to currently work right.
      
      On CPUs without nohz_full=, only tick based irq time sampling is
      done, which breaks down when dealing with a nohz_idle CPU.
      
      On firewalls and similar systems, no ticks may happen on a CPU for a
      while, and the irq time spent may never get accounted properly. This
      can cause issues with capacity planning and power saving, which use
      the CPU statistics as inputs in decision making.
      
      Remove the VTIME_GEN vtime irq time code, and replace it with the
      IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b58c3584
    • R
      sched/cputime: Count actually elapsed irq & softirq time · 57430218
      Rik van Riel 提交于
      Currently, if there was any irq or softirq time during 'ticks'
      jiffies, the entire period will be accounted as irq or softirq
      time.
      
      This is inaccurate if only a subset of the time was actually spent
      handling irqs, and could conceivably mis-count all of the ticks during
      a period as irq time, when there was some irq and some softirq time.
      
      This can actually happen when irqtime_account_process_tick is called
      from account_idle_ticks, which can pass a larger number of ticks down
      all at once.
      
      Fix this by changing irqtime_account_hi_update(), irqtime_account_si_update(),
      and steal_account_process_ticks() to work with cputime_t time units, and
      return the amount of time spent in each mode.
      
      Rename steal_account_process_ticks() to steal_account_process_time(), to
      reflect that time is now accounted in cputime_t, instead of ticks.
      
      Additionally, have irqtime_account_process_tick() take into account how
      much time was spent in each of steal, irq, and softirq time.
      
      The latter could help improve the accuracy of cputime
      accounting when returning from idle on a NO_HZ_IDLE CPU.
      
      Properly accounting how much time was spent in hardirq and
      softirq time will also allow the NO_HZ_FULL code to re-use
      these same functions for hardirq and softirq accounting.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      [ Make nsecs_to_cputime64() actually return cputime64_t. ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1468421405-20056-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      57430218