1. 16 11月, 2016 8 次提交
  2. 27 10月, 2016 1 次提交
  3. 20 10月, 2016 1 次提交
  4. 19 10月, 2016 1 次提交
    • V
      sched/fair: Fix incorrect task group ->load_avg · b5a9b340
      Vincent Guittot 提交于
      A scheduler performance regression has been reported by Joseph Salisbury,
      which he bisected back to:
      
        3d30544f ("sched/fair: Apply more PELT fixes)
      
      The regression triggers when several levels of task groups are involved
      (read: SystemD) and cpu_possible_mask != cpu_present_mask.
      
      The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
      is initialized to scale_load_down(se->load.weight). During the creation of
      a child task group, its group entities on possible CPUs are attached to
      parent's cfs_rq (tg_parent) and their loads are added to the parent's load
      (tg_parent->load_avg) with update_tg_load_avg().
      
      But only the load on online CPUs will then be updated to reflect real load,
      whereas load on other CPUs will stay at the initial value.
      
      The result is a tg_parent->load_avg that is higher than the real load, the
      weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
      smaller than it should be, and the task group gets a less running time than
      what it could expect.
      
      ( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
        of the task group will be much higher than sum of ".tg_load_avg_contrib"
        of online cfs_rqs of the task group. )
      
      The load of group entities don't have to be intialized to something else
      than 0 because their load will increase when an entity is attached.
      Reported-by: NJoseph Salisbury <joseph.salisbury@canonical.com>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org> # 4.8.x
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: joonwoop@codeaurora.org
      Fixes: 3d30544f ("sched/fair: Apply more PELT fixes)
      Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5a9b340
  5. 11 10月, 2016 2 次提交
    • W
      sched/fair: Fix sched domains NULL dereference in select_idle_sibling() · 9cfb38a7
      Wanpeng Li 提交于
      Commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... improved select_idle_sibling(), but also triggered a regression (crash)
      during CPU-hotplug:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
        IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
        Call Trace:
         <IRQ>
          select_task_rq_fair+0x749/0x930
          ? select_task_rq_fair+0xb4/0x930
          ? __lock_is_held+0x54/0x70
          try_to_wake_up+0x19a/0x5b0
          default_wake_function+0x12/0x20
          autoremove_wake_function+0x12/0x40
          __wake_up_common+0x55/0x90
          __wake_up+0x39/0x50
          wake_up_klogd_work_func+0x40/0x60
          irq_work_run_list+0x57/0x80
          irq_work_run+0x2c/0x30
          smp_irq_work_interrupt+0x2e/0x40
          irq_work_interrupt+0x96/0xa0
         <EOI>
          ? _raw_spin_unlock_irqrestore+0x45/0x80
          try_to_wake_up+0x4a/0x5b0
          wake_up_state+0x10/0x20
          __kthread_unpark+0x67/0x70
          kthread_unpark+0x22/0x30
          cpuhp_online_idle+0x3e/0x70
          cpu_startup_entry+0x6a/0x450
          start_secondary+0x154/0x180
      
      This can be reproduced by running the ftrace test case of kselftest, the
      test case will hot-unplug the CPU and the CPU will attach to the NULL
      sched-domain during scheduler teardown.
      
      The step 2 for the rewrite select_idle_siblings():
      
        | Step 2) tracks the average cost of the scan and compares this to the
        | average idle time guestimate for the CPU doing the wakeup.
      
      If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
      sched domain will be dereferenced to acquire the average cost of the scan.
      
      This patch fix it by failing the search of an idle CPU in the LLC process
      if this sched domain is NULL.
      Tested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9cfb38a7
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
  6. 30 9月, 2016 6 次提交
    • P
      sched/fair: Fix min_vruntime tracking · b60205c7
      Peter Zijlstra 提交于
      While going through enqueue/dequeue to review the movement of
      set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
      dequeue_entity() is suspect.
      
      It turns out, its actually wrong because it will consider
      cfs_rq->curr, which could be the entry we just normalized. This mixes
      different vruntime forms and leads to fail.
      
      The purpose of the second update_min_vruntime() is to move
      min_vruntime forward if the entity we just removed is the one that was
      holding it back; _except_ for the DEQUEUE_SAVE case, because then we
      know its a temporary removal and it will come back.
      
      However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
      will still be set (and per the above, can be tranformed into a
      different unit), so update_min_vruntime() should also consider
      curr->on_rq. This also fixes another corner case where the enqueue
      (which also does update_curr()->update_min_vruntime()) happens on the
      rq->lock break in schedule(), between dequeue and put_prev_task.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 1e876231 ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b60205c7
    • P
      sched/debug: Add SCHED_WARN_ON() · 9148a3a1
      Peter Zijlstra 提交于
      Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
      CONFIG_SCHED_DEBUG wrappery.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9148a3a1
    • P
      sched/core: Optimize SCHED_SMT · 1b568f0a
      Peter Zijlstra 提交于
      Avoid pointless SCHED_SMT code when running on !SMT hardware.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b568f0a
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
    • D
      sched/fair: Fix fixed point arithmetic width for shares and effective load · ab522e33
      Dietmar Eggemann 提交于
      Since commit:
      
        2159197d ("sched/core: Enable increased load resolution on 64-bit kernels")
      
      we now have two different fixed point units for load:
      
      - 'shares' in calc_cfs_shares() has 20 bit fixed point unit on 64-bit
        kernels. Therefore use scale_load() on MIN_SHARES.
      
      - 'wl' in effective_load() has 10 bit fixed point unit. Therefore use
        scale_load_down() on tg->shares which has 20 bit fixed point unit on
        64-bit kernels.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1471874441-24701-1-git-send-email-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab522e33
  7. 22 9月, 2016 1 次提交
  8. 14 9月, 2016 1 次提交
    • R
      cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition · 8c34ab19
      Rafael J. Wysocki 提交于
      Testing indicates that it is possible to improve performace
      significantly without increasing energy consumption too much by
      teaching cpufreq governors to bump up the CPU performance level if
      the in_iowait flag is set for the task in enqueue_task_fair().
      
      For this purpose, define a new cpufreq_update_util() flag
      SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that
      flag to cpufreq_update_util() in the in_iowait case.  That generally
      requires cpufreq_update_util() to be called directly from there,
      because update_load_avg() may not be invoked in that case.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Looks-good-to: Steve Muckle <smuckle@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      8c34ab19
  9. 10 9月, 2016 1 次提交
  10. 05 9月, 2016 5 次提交
  11. 18 8月, 2016 1 次提交
  12. 17 8月, 2016 2 次提交
  13. 10 8月, 2016 5 次提交
  14. 27 6月, 2016 5 次提交
    • P
      sched/fair: Rework throttle_count sync · 55e16d30
      Peter Zijlstra 提交于
      Since we already take rq->lock when creating a cgroup, use it to also
      sync the throttle_count and avoid the extra state and enqueue path
      branch.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      [ Fixed build warning. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      55e16d30
    • P
      sched/fair: Reorder cgroup creation code · 8663e24d
      Peter Zijlstra 提交于
      A future patch needs rq->lock held _after_ we link the task_group into
      the hierarchy. In order to avoid taking every rq->lock twice, reorder
      things a little and create online_fair_sched_group() to be called
      after we link the task_group.
      
      All this code is still ran from css_alloc() so css_online() isn't in
      fact used for this.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8663e24d
    • P
      sched/fair: Apply more PELT fixes · 3d30544f
      Peter Zijlstra 提交于
      One additional 'rule' for using update_cfs_rq_load_avg() is that one
      should call update_tg_load_avg() if it returns true.
      
      Add a bunch of comments to hopefully clarify some of the rules:
      
       o  You need to update cfs_rq _before_ any entity attach/detach,
          this is important, because while for mathmatical consisency this
          isn't strictly needed, it is required for the physical
          interpretation of the model, you attach/detach _now_.
      
       o  When you modify the cfs_rq avg, you have to then call
          update_tg_load_avg() in order to propagate changes upwards.
      
       o  (Fair) entities are always attached, switched_{to,from}_fair()
          deal with !fair. This directly follows from the definition of the
          cfs_rq averages, namely that they are a direct sum of all
          (runnable or blocked) entities on that rq.
      
      It is the second rule that this patch enforces, but it adds comments
      pertaining to all of them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d30544f
    • P
      sched/fair: Fix PELT integrity for new tasks · 7dc603c9
      Peter Zijlstra 提交于
      Vincent and Yuyang found another few scenarios in which entity
      tracking goes wobbly.
      
      The scenarios are basically due to the fact that new tasks are not
      immediately attached and thereby differ from the normal situation -- a
      task is always attached to a cfs_rq load average (such that it
      includes its blocked contribution) and are explicitly
      detached/attached on migration to another cfs_rq.
      
      Scenario 1: switch to fair class
      
        p->sched_class = fair_class;
        if (queued)
          enqueue_task(p);
            ...
              enqueue_entity()
      	  enqueue_entity_load_avg()
      	    migrated = !sa->last_update_time (true)
      	    if (migrated)
      	      attach_entity_load_avg()
        check_class_changed()
          switched_from() (!fair)
          switched_to()   (fair)
            switched_to_fair()
              attach_entity_load_avg()
      
      If @p is a new task that hasn't been fair before, it will have
      !last_update_time and, per the above, end up in
      attach_entity_load_avg() _twice_.
      
      Scenario 2: change between cgroups
      
        sched_move_group(p)
          if (queued)
            dequeue_task()
          task_move_group_fair()
            detach_task_cfs_rq()
              detach_entity_load_avg()
            set_task_rq()
            attach_task_cfs_rq()
              attach_entity_load_avg()
          if (queued)
            enqueue_task();
              ...
                enqueue_entity()
      	    enqueue_entity_load_avg()
      	      migrated = !sa->last_update_time (true)
      	      if (migrated)
      	        attach_entity_load_avg()
      
      Similar as with scenario 1, if @p is a new task, it will have
      !load_update_time and we'll end up in attach_entity_load_avg()
      _twice_.
      
      Furthermore, notice how we do a detach_entity_load_avg() on something
      that wasn't attached to begin with.
      
      As stated above; the problem is that the new task isn't yet attached
      to the load tracking and thereby violates the invariant assumption.
      
      This patch remedies this by ensuring a new task is indeed properly
      attached to the load tracking on creation, through
      post_init_entity_util_avg().
      
      Of course, this isn't entirely as straightforward as one might think,
      since the task is hashed before we call wake_up_new_task() and thus
      can be poked at. We avoid this by adding TASK_NEW and teaching
      cpu_cgroup_can_attach() to refuse such tasks.
      Reported-by: NYuyang Du <yuyang.du@intel.com>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7dc603c9
    • V
      sched/cgroup: Fix cpu_cgroup_fork() handling · ea86cb4b
      Vincent Guittot 提交于
      A new fair task is detached and attached from/to task_group with:
      
        cgroup_post_fork()
          ss->fork(child) := cpu_cgroup_fork()
            sched_move_task()
              task_move_group_fair()
      
      Which is wrong, because at this point in fork() the task isn't fully
      initialized and it cannot 'move' to another group, because its not
      attached to any group as yet.
      
      In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
      can just call this small part directly instead sched_move_task(). And
      the task doesn't really migrate because it is not yet attached so we
      need the following sequence:
      
        do_fork()
          sched_fork()
            __set_task_cpu()
      
          cgroup_post_fork()
            set_task_rq() # set task group and runqueue
      
          wake_up_new_task()
            select_task_rq() can select a new cpu
            __set_task_cpu
            post_init_entity_util_avg
              attach_task_cfs_rq()
            activate_task
              enqueue_task
      
      This patch makes that happen.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added TASK_SET_GROUP to set depth properly. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea86cb4b