1. 14 1月, 2017 1 次提交
    • M
      sched/core: Add wrappers for lockdep_(un)pin_lock() · d8ac8971
      Matt Fleming 提交于
      In preparation for adding diagnostic checks to catch missing calls to
      update_rq_clock(), provide wrappers for (re)pinning and unpinning
      rq->lock.
      
      Because the pending diagnostic checks allow state to be maintained in
      rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
      for 'struct rq_flags *'.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d8ac8971
  2. 24 11月, 2016 1 次提交
  3. 16 11月, 2016 3 次提交
    • V
      sched/fair: Propagate load during synchronous attach/detach · 09a43ace
      Vincent Guittot 提交于
      When a task moves from/to a cfs_rq, we set a flag which is then used to
      propagate the change at parent level (sched_entity and cfs_rq) during
      next update. If the cfs_rq is throttled, the flag will stay pending until
      the cfs_rq is unthrottled.
      
      For propagating the utilization, we copy the utilization of group cfs_rq to
      the sched_entity.
      
      For propagating the load, we have to take into account the load of the
      whole task group in order to evaluate the load of the sched_entity.
      Similarly to what was done before the rewrite of PELT, we add a correction
      factor in case the task group's load is greater than its share so it will
      contribute the same load of a task of equal weight.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: kernellwp@gmail.com
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      09a43ace
    • V
      sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list · 9c2791f9
      Vincent Guittot 提交于
      Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
      child will always be called before its parent.
      
      The hierarchical order in shares update list has been introduced by
      commit:
      
        67e86250 ("sched: Introduce hierarchal order on shares update list")
      
      With the current implementation a child can be still put after its
      parent.
      
      Lets take the example of:
      
             root
              \
               b
               /\
               c d*
                 |
                 e*
      
      with root -> b -> c already enqueued but not d -> e so the
      leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
      
      The branch d -> e will be added the first time that they are enqueued,
      starting with e then d.
      
      When e is added, its parents is not already on the list so e is put at
      the tail : head -> c -> b -> root -> e -> tail
      
      Then, d is added at the head because its parent is already on the
      list: head -> d -> c -> b -> root -> e -> tail
      
      e is not placed at the right position and will be called the last
      whereas it should be called at the beginning.
      
      Because it follows the bottom-up enqueue sequence, we are sure that we
      will finished to add either a cfs_rq without parent or a cfs_rq with a
      parent that is already on the list. We can use this event to detect
      when we have finished to add a new branch. For the others, whose
      parents are not already added, we have to ensure that they will be
      added after their children that have just been inserted the steps
      before, and after any potential parents that are already in the list.
      The easiest way is to put the cfs_rq just after the last inserted one
      and to keep track of it untl the branch is fully added.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: kernellwp@gmail.com
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c2791f9
    • M
      sched/fair: Add per-CPU min capacity to sched_group_capacity · bf475ce0
      Morten Rasmussen 提交于
      struct sched_group_capacity currently represents the compute capacity
      sum of all CPUs in the sched_group.
      
      Unless it is divided by the group_weight to get the average capacity
      per CPU, it hides differences in CPU capacity for mixed capacity systems
      (e.g. high RT/IRQ utilization or ARM big.LITTLE).
      
      But even the average may not be sufficient if the group covers CPUs of
      different capacities.
      
      Instead, by extending struct sched_group_capacity to indicate min per-CPU
      capacity in the group a suitable group for a given task utilization can
      more easily be found such that CPUs with reduced capacity can be avoided
      for tasks with high utilization (not implemented by this patch).
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bf475ce0
  4. 30 9月, 2016 6 次提交
    • F
      sched/irqtime: Consolidate accounting synchronization with u64_stats API · 19d23dbf
      Frederic Weisbecker 提交于
      The irqtime accounting currently implement its own ad hoc implementation
      of u64_stats API. Lets rather consolidate it with the appropriate
      library.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19d23dbf
    • P
      sched/debug: Add SCHED_WARN_ON() · 9148a3a1
      Peter Zijlstra 提交于
      Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
      CONFIG_SCHED_DEBUG wrappery.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9148a3a1
    • P
      sched/fair: Introduce set_curr_task() helper · b2bf6c31
      Peter Zijlstra 提交于
      Now that the ia64 only set_curr_task() symbol is gone, provide a
      helper just like put_prev_task().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b2bf6c31
    • P
      sched/core: Optimize SCHED_SMT · 1b568f0a
      Peter Zijlstra 提交于
      Avoid pointless SCHED_SMT code when running on !SMT hardware.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b568f0a
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
  5. 15 9月, 2016 1 次提交
  6. 18 8月, 2016 2 次提交
  7. 17 8月, 2016 2 次提交
  8. 13 7月, 2016 1 次提交
  9. 27 6月, 2016 3 次提交
    • P
      sched/fair: Rework throttle_count sync · 55e16d30
      Peter Zijlstra 提交于
      Since we already take rq->lock when creating a cgroup, use it to also
      sync the throttle_count and avoid the extra state and enqueue path
      branch.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      [ Fixed build warning. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      55e16d30
    • P
      sched/fair: Reorder cgroup creation code · 8663e24d
      Peter Zijlstra 提交于
      A future patch needs rq->lock held _after_ we link the task_group into
      the hierarchy. In order to avoid taking every rq->lock twice, reorder
      things a little and create online_fair_sched_group() to be called
      after we link the task_group.
      
      All this code is still ran from css_alloc() so css_online() isn't in
      fact used for this.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8663e24d
    • V
      sched/cgroup: Fix cpu_cgroup_fork() handling · ea86cb4b
      Vincent Guittot 提交于
      A new fair task is detached and attached from/to task_group with:
      
        cgroup_post_fork()
          ss->fork(child) := cpu_cgroup_fork()
            sched_move_task()
              task_move_group_fair()
      
      Which is wrong, because at this point in fork() the task isn't fully
      initialized and it cannot 'move' to another group, because its not
      attached to any group as yet.
      
      In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
      can just call this small part directly instead sched_move_task(). And
      the task doesn't really migrate because it is not yet attached so we
      need the following sequence:
      
        do_fork()
          sched_fork()
            __set_task_cpu()
      
          cgroup_post_fork()
            set_task_rq() # set task group and runqueue
      
          wake_up_new_task()
            select_task_rq() can select a new cpu
            __set_task_cpu
            post_init_entity_util_avg
              attach_task_cfs_rq()
            activate_task
              enqueue_task
      
      This patch makes that happen.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added TASK_SET_GROUP to set depth properly. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea86cb4b
  10. 24 6月, 2016 1 次提交
  11. 14 6月, 2016 2 次提交
  12. 12 5月, 2016 1 次提交
    • P
      sched/core: Kill sched_class::task_waking to clean up the migration logic · 59efa0ba
      Peter Zijlstra 提交于
      With sched_class::task_waking being called only when we do
      set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
      and eliminate sched_class::task_waking entirely.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      59efa0ba
  13. 06 5月, 2016 1 次提交
  14. 05 5月, 2016 6 次提交
    • Y
      sched/fair: Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT and remove SCHED_LOAD_SCALE · 172895e6
      Yuyang Du 提交于
      After cleaning up the sched metrics, there are two definitions that are
      ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT.
      
      Resolve this:
      
       - Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what
         it is.
      
       - Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE.
      Suggested-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: lizefan@huawei.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com
      [ Rewrote the changelog and fixed the build on 32-bit kernels. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      172895e6
    • Y
      sched/fair: Generalize the load/util averages resolution definition · 6ecdd749
      Yuyang Du 提交于
      Integer metric needs fixed point arithmetic. In sched/fair, a few
      metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
      may have different fixed point ranges, which makes their update and
      usage error-prone.
      
      In order to avoid the errors relating to the fixed point range, we
      definie a basic fixed point range, and then formalize all metrics to
      base on the basic range.
      
      The basic range is 1024 or (1 << 10). Further, one can recursively
      apply the basic range to have larger range.
      
      Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
      1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
      must be well calibrated.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: lizefan@huawei.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6ecdd749
    • P
      sched/core: Enable increased load resolution on 64-bit kernels · 2159197d
      Peter Zijlstra 提交于
      Mike ran into the low load resolution limitation on his big machine.
      
      So reenable these bits; nobody could ever reproduce/analyze the
      reported power usage claim and Google has been running with this for
      years as well.
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2159197d
    • P
      locking/lockdep, sched/core: Implement a better lock pinning scheme · e7904a28
      Peter Zijlstra 提交于
      The problem with the existing lock pinning is that each pin is of
      value 1; this mean you can simply unpin if you know its pinned,
      without having any extra information.
      
      This scheme generates a random (16 bit) cookie for each pin and
      requires this same cookie to unpin. This means you have to keep the
      cookie in context.
      
      No objsize difference for !LOCKDEP kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e7904a28
    • P
      sched/core: Introduce 'struct rq_flags' · eb580751
      Peter Zijlstra 提交于
      In order to be able to pass around more than just the IRQ flags in the
      future, add a rq_flags structure.
      
      No difference in code generation for the x86_64-defconfig build I
      tested.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      eb580751
    • P
      sched/core: Move task_rq_lock() out of line · 3e71a462
      Peter Zijlstra 提交于
      Its a rather large function, inline doesn't seems to make much sense:
      
       $ size defconfig-build/kernel/sched/core.o{.orig,}
          text    data     bss     dec     hex filename
         56533   21037    2320   79890   13812 defconfig-build/kernel/sched/core.o.orig
         55733   21037    2320   79090   134f2 defconfig-build/kernel/sched/core.o
      
      The 'perf bench sched messaging' micro-benchmark shows a visible improvement
      of 4-5%:
      
        $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done
        $ perf stat --null --repeat 25 -- perf bench sched messaging -g 40 -l 5000
      
        pre:
             4.582798193 seconds time elapsed          ( +-  1.41% )
             4.733374877 seconds time elapsed          ( +-  2.10% )
             4.560955136 seconds time elapsed          ( +-  1.43% )
             4.631062303 seconds time elapsed          ( +-  1.40% )
      
        post:
             4.364765213 seconds time elapsed          ( +-  0.91% )
             4.454442734 seconds time elapsed          ( +-  1.18% )
             4.448893817 seconds time elapsed          ( +-  1.41% )
             4.424346872 seconds time elapsed          ( +-  0.97% )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3e71a462
  15. 23 4月, 2016 2 次提交
  16. 02 4月, 2016 1 次提交
    • R
      cpufreq: schedutil: New governor based on scheduler utilization data · 9bdcb44e
      Rafael J. Wysocki 提交于
      Add a new cpufreq scaling governor, called "schedutil", that uses
      scheduler-provided CPU utilization information as input for making
      its decisions.
      
      Doing that is possible after commit 34e2c555 (cpufreq: Add
      mechanism for registering utilization update callbacks) that
      introduced cpufreq_update_util() called by the scheduler on
      utilization changes (from CFS) and RT/DL task status updates.
      In particular, CPU frequency scaling decisions may be based on
      the the utilization data passed to cpufreq_update_util() by CFS.
      
      The new governor is relatively simple.
      
      The frequency selection formula used by it depends on whether or not
      the utilization is frequency-invariant.  In the frequency-invariant
      case the new CPU frequency is given by
      
      	next_freq = 1.25 * max_freq * util / max
      
      where util and max are the last two arguments of cpufreq_update_util().
      In turn, if util is not frequency-invariant, the maximum frequency in
      the above formula is replaced with the current frequency of the CPU:
      
      	next_freq = 1.25 * curr_freq * util / max
      
      The coefficient 1.25 corresponds to the frequency tipping point at
      (util / max) = 0.8.
      
      All of the computations are carried out in the utilization update
      handlers provided by the new governor.  One of those handlers is
      used for cpufreq policies shared between multiple CPUs and the other
      one is for policies with one CPU only (and therefore it doesn't need
      to use any extra synchronization means).
      
      The governor supports fast frequency switching if that is supported
      by the cpufreq driver in use and possible for the given policy.
      In the fast switching case, all operations of the governor take
      place in its utilization update handlers.  If fast switching cannot
      be used, the frequency switch operations are carried out with the
      help of a work item which only calls __cpufreq_driver_target()
      (under a mutex) to trigger a frequency update (to a value already
      computed beforehand in one of the utilization update handlers).
      
      Currently, the governor treats all of the RT and DL tasks as
      "unknown utilization" and sets the frequency to the allowed
      maximum when updated from the RT or DL sched classes.  That
      heavy-handed approach should be replaced with something more
      subtle and specifically targeted at RT and DL tasks.
      
      The governor shares some tunables management code with the
      "ondemand" and "conservative" governors and uses some common
      definitions from cpufreq_governor.h, but apart from that it
      is stand-alone.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      9bdcb44e
  17. 31 3月, 2016 1 次提交
    • Y
      sched/fair: Initiate a new task's util avg to a bounded value · 2b8c41da
      Yuyang Du 提交于
      A new task's util_avg is set to full utilization of a CPU (100% time
      running). This accelerates a new task's utilization ramp-up, useful to
      boost its execution in early time. However, it may result in
      (insanely) high utilization for a transient time period when a flood
      of tasks are spawned. Importantly, it violates the "fundamentally
      bounded" CPU utilization, and its side effect is negative if we don't
      take any measure to bound it.
      
      This patch proposes an algorithm to address this issue. It has
      two methods to approach a sensible initial util_avg:
      
      (1) An expected (or average) util_avg based on its cfs_rq's util_avg:
      
        util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
      
      (2) A trajectory of how successive new tasks' util develops, which
      gives 1/2 of the left utilization budget to a new task such that
      the additional util is noticeably large (when overall util is low) or
      unnoticeably small (when overall util is high enough). In the meantime,
      the aggregate utilization is well bounded:
      
        util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
      
      where n denotes the nth task.
      
      If util_avg is larger than util_avg_cap, then the effective util is
      clamped to the util_avg_cap.
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2b8c41da
  18. 11 3月, 2016 1 次提交
  19. 09 3月, 2016 1 次提交
  20. 05 3月, 2016 1 次提交
    • T
      sched/cputime: Fix steal time accounting vs. CPU hotplug · e9532e69
      Thomas Gleixner 提交于
      On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
      value over CPU down and up. So after the CPU comes up again the delta
      calculation in steal_account_process_tick() wreckages itself due to the
      unsigned math:
      
      	 u64 steal = paravirt_steal_clock(smp_processor_id());
      
      	 steal -= this_rq()->prev_steal_time;
      
      So if steal is smaller than rq->prev_steal_time we end up with an insane large
      value which then gets added to rq->prev_steal_time, resulting in a permanent
      wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
      become stale.
      
      Nice trick to tell the world how idle the system is (100%) while the CPU is
      100% busy running tasks. Though we prefer realistic numbers.
      
      None of the accounting values which use a previous value to account for
      fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
      check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
      deals with clock warps and limits the /proc/stat visible wreckage. The
      prev_time values are still wrong.
      
      Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: commit 095c0aa8 "sched: adjust scheduler cpu power for stolen time"
      Fixes: commit aa483808 "sched: Remove irq time from available CPU power"
      Fixes: commit e6e6685a "KVM guest: Steal time accounting"
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9532e69
  21. 02 3月, 2016 2 次提交
    • F
      sched: Migrate sched to use new tick dependency mask model · 76d92ac3
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      sched tick dependency, migrate sched to the new mask.
      
      Everytime a task is enqueued or dequeued, we evaluate the state of the
      tick dependency on top of the policy of the tasks in the runqueue, by
      order of priority:
      
      SCHED_DEADLINE: Need the tick in order to periodically check for runtime
      SCHED_FIFO    : Don't need the tick (no round-robin)
      SCHED_RR      : Need the tick if more than 1 task of the same priority
                      for round robin (simplified with checking if more than
                      one SCHED_RR task no matter what priority).
      SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.
      
      We could optimize that further with one flag per sched policy on the tick
      dependency mask and perform only the checks relevant to the policy
      concerned by an enqueue/dequeue operation.
      
      Since the checks aren't based on the current task anymore, we could get
      rid of the task switch hook but it's still needed for posix cpu
      timers.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76d92ac3
    • F
      sched: Account rr tasks · 01d36d0a
      Frederic Weisbecker 提交于
      In order to evaluate the scheduler tick dependency without probing
      context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks
      are enqueued as those policies don't have the same preemption
      requirements.
      
      To prepare for that, let's account SCHED_RR tasks, we'll be able to
      deduce SCHED_FIFO tasks as well from it and the total RT tasks in the
      runqueue.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      01d36d0a