1. 04 12月, 2015 7 次提交
    • W
      sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline · b0367629
      Waiman Long 提交于
      If a system with large number of sockets was driven to full
      utilization, it was found that the clock tick handling occupied a
      rather significant proportion of CPU time when fair group scheduling
      and autogroup were enabled.
      
      Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
      profile looked like:
      
        10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
         5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      In particular, the high CPU time consumed by update_cfs_shares()
      was mostly due to contention on the cacheline that contained the
      task_group's load_avg statistical counter. This cacheline may also
      contains variables like shares, cfs_rq & se which are accessed rather
      frequently during clock tick processing.
      
      This patch moves the load_avg variable into another cacheline
      separated from the other frequently accessed variables. It also
      creates a cacheline aligned kmemcache for task_group to make sure
      that all the allocated task_group's are cacheline aligned.
      
      By doing so, the perf profile became:
      
         9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
         4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      The %cpu time is still pretty high, but it is better than before. The
      benchmark results before and after the patch was as follows:
      
        Before patch - Max-jOPs: 907533    Critical-jOps: 134877
        After patch  - Max-jOPs: 916011    Critical-jOps: 142366
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0367629
    • A
      sched/core: Move the sched_to_prio[] arrays out of line · ed82b8a1
      Andi Kleen 提交于
      When building a kernel with a gcc 6 snapshot the compiler complains
      about unused const static variables for prio_to_weight and prio_to_mult
      for multiple scheduler files (all but core.c and autogroup.c)
      
      The way the array is currently declared it will be duplicated in
      every scheduler file that includes sched.h, which seems rather wasteful.
      
      Move the array out of line into core.c. I also added a sched_ prefix
      to avoid any potential name space collisions.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ed82b8a1
    • B
      sched/fair: Make it possible to account fair load avg consistently · ad936d86
      Byungchul Park 提交于
      The current code accounts for the time a task was absent from the fair
      class (per ATTACH_AGE_LOAD). However it does not work correctly when a
      task got migrated or moved to another cgroup while outside of the fair
      class.
      
      This patch tries to address that by aging on migration. We locklessly
      read the 'last_update_time' stamp from both the old and new cfs_rq,
      ages the load upto the old time, and sets it to the new time.
      
      These timestamps should in general not be more than 1 tick apart from
      one another, so there is a definite bound on things.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      [ Changelog, a few edits and !SMP build fix ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad936d86
    • P
      sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() · ecf7d01c
      Peter Zijlstra 提交于
      Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
      that we'll prematurely continue with the wakeup and effectively run p on
      two CPUs at the same time.
      
      Even though the overlap is very limited; the task is in the middle of
      being scheduled out; it could still result in corruption of the
      scheduler data structures.
      
              CPU0                            CPU1
      
              set_current_state(...)
      
              <preempt_schedule>
                context_switch(X, Y)
                  prepare_lock_switch(Y)
                    Y->on_cpu = 1;
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
                                              try_to_wake_up(X)
                                                LOCK(p->pi_lock);
      
                                                t = X->on_cpu; // 0
      
                context_switch(Y, X)
                  prepare_lock_switch(X)
                    X->on_cpu = 1;
                  finish_lock_switch(Y)
                    store_release(Y->on_cpu, 0);
              </preempt_schedule>
      
              schedule();
                deactivate_task(X);
                X->on_rq = 0;
      
                                                if (X->on_rq) // false
      
                                                if (t) while (X->on_cpu)
                                                  cpu_relax();
      
                context_switch(X, ..)
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
      Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ecf7d01c
    • P
      sched/core: Better document the try_to_wake_up() barriers · b75a2253
      Peter Zijlstra 提交于
      Explain how the control dependency and smp_rmb() end up providing
      ACQUIRE semantics and pair with smp_store_release() in
      finish_lock_switch().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b75a2253
    • X
      sched/core: Clear the root_domain cpumasks in init_rootdomain() · 8295c699
      Xunlei Pang 提交于
      root_domain::rto_mask allocated through alloc_cpumask_var()
      contains garbage data, this may cause problems. For instance,
      When doing pull_rt_task(), it may do useless iterations if
      rto_mask retains some extra garbage bits. Worse still, this
      violates the isolated domain rule for clustered scheduling
      using cpuset, because the tasks(with all the cpus allowed)
      belongs to one root domain can be pulled away into another
      root domain.
      
      The patch cleans the garbage by using zalloc_cpumask_var()
      instead of alloc_cpumask_var() for root_domain::rto_mask
      allocation, thereby addressing the issues.
      
      Do the same thing for root_domain's other cpumask memembers:
      dlo_mask, span, and online.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8295c699
    • S
      sched/core: Remove false-positive warning from wake_up_process() · 119d6f6a
      Sasha Levin 提交于
      Because wakeups can (fundamentally) be late, a task might not be in
      the expected state. Therefore testing against a task's state is racy,
      and can yield false positives.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: oleg@redhat.com
      Fixes: 9067ac85 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
      Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      119d6f6a
  2. 23 11月, 2015 2 次提交
  3. 23 10月, 2015 1 次提交
  4. 20 10月, 2015 4 次提交
  5. 16 10月, 2015 1 次提交
    • T
      cgroup: keep zombies associated with their original cgroups · 2e91fa7f
      Tejun Heo 提交于
      cgroup_exit() is called when a task exits and disassociates the
      exiting task from its cgroups and half-attach it to the root cgroup.
      This is unnecessary and undesirable.
      
      No controller actually needs an exiting task to be disassociated with
      non-root cgroups.  Both cpu and perf_event controllers update the
      association to the root cgroup from their exit callbacks just to keep
      consistent with the cgroup core behavior.
      
      Also, this disassociation makes it difficult to track resources held
      by zombies or determine where the zombies came from.  Currently, pids
      controller is completely broken as it uncharges on exit and zombies
      always escape the resource restriction.  With cgroup association being
      reset on exit, fixing it is pretty painful.
      
      There's no reason to reset cgroup membership on exit.  The zombie can
      be removed from its css_set so that it doesn't show up on
      "cgroup.procs" and thus can't be migrated or interfere with cgroup
      removal.  It can still pin and point to the css_set so that its cgroup
      membership is maintained.  This patch makes cgroup core keep zombies
      associated with their cgroups at the time of exit.
      
      * Previous patches decoupled populated_cnt tracking from css_set
        lifetime, so a dying task can be simply unlinked from its css_set
        while pinning and pointing to the css_set.  This keeps css_set
        association from task side alive while hiding it from "cgroup.procs"
        and populated_cnt tracking.  The css_set reference is dropped when
        the task_struct is freed.
      
      * ->exit() callback no longer needs the css arguments as the
        associated css never changes once PF_EXITING is set.  Removed.
      
      * cpu and perf_events controllers no longer need ->exit() callbacks.
        There's no reason to explicitly switch away on exit.  The final
        schedule out is enough.  The callbacks are removed.
      
      * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
        still reports "/" for all zombies.  On the default hierarchy,
        "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
        to at the time of exit.  If the cgroup gets removed before the task
        is reaped, " (deleted)" is appended.
      
      v2: Build brekage due to missing dummy cgroup_free() when
          !CONFIG_CGROUP fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      2e91fa7f
  6. 07 10月, 2015 1 次提交
  7. 06 10月, 2015 13 次提交
  8. 23 9月, 2015 1 次提交
  9. 18 9月, 2015 3 次提交
  10. 13 9月, 2015 7 次提交