1. 23 2月, 2016 1 次提交
  2. 05 12月, 2015 1 次提交
    • P
      rcu: Stop disabling interrupts in scheduler fastpaths · 46a5d164
      Paul E. McKenney 提交于
      We need the scheduler's fastpaths to be, well, fast, and unnecessarily
      disabling and re-enabling interrupts is not necessarily consistent with
      this goal.  Especially given that there are regions of the scheduler that
      already have interrupts disabled.
      
      This commit therefore moves the call to rcu_note_context_switch()
      to one of the interrupts-disabled regions of the scheduler, and
      removes the now-redundant disabling and re-enabling of interrupts from
      rcu_note_context_switch() and the functions it calls.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
        by Peter Zijlstra. ]
      46a5d164
  3. 04 12月, 2015 9 次提交
    • W
      sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline · b0367629
      Waiman Long 提交于
      If a system with large number of sockets was driven to full
      utilization, it was found that the clock tick handling occupied a
      rather significant proportion of CPU time when fair group scheduling
      and autogroup were enabled.
      
      Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
      profile looked like:
      
        10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
         5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      In particular, the high CPU time consumed by update_cfs_shares()
      was mostly due to contention on the cacheline that contained the
      task_group's load_avg statistical counter. This cacheline may also
      contains variables like shares, cfs_rq & se which are accessed rather
      frequently during clock tick processing.
      
      This patch moves the load_avg variable into another cacheline
      separated from the other frequently accessed variables. It also
      creates a cacheline aligned kmemcache for task_group to make sure
      that all the allocated task_group's are cacheline aligned.
      
      By doing so, the perf profile became:
      
         9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
         4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      The %cpu time is still pretty high, but it is better than before. The
      benchmark results before and after the patch was as follows:
      
        Before patch - Max-jOPs: 907533    Critical-jOps: 134877
        After patch  - Max-jOPs: 916011    Critical-jOps: 142366
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0367629
    • A
      sched/core: Move the sched_to_prio[] arrays out of line · ed82b8a1
      Andi Kleen 提交于
      When building a kernel with a gcc 6 snapshot the compiler complains
      about unused const static variables for prio_to_weight and prio_to_mult
      for multiple scheduler files (all but core.c and autogroup.c)
      
      The way the array is currently declared it will be duplicated in
      every scheduler file that includes sched.h, which seems rather wasteful.
      
      Move the array out of line into core.c. I also added a sched_ prefix
      to avoid any potential name space collisions.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ed82b8a1
    • B
      sched/fair: Make it possible to account fair load avg consistently · ad936d86
      Byungchul Park 提交于
      The current code accounts for the time a task was absent from the fair
      class (per ATTACH_AGE_LOAD). However it does not work correctly when a
      task got migrated or moved to another cgroup while outside of the fair
      class.
      
      This patch tries to address that by aging on migration. We locklessly
      read the 'last_update_time' stamp from both the old and new cfs_rq,
      ages the load upto the old time, and sets it to the new time.
      
      These timestamps should in general not be more than 1 tick apart from
      one another, so there is a definite bound on things.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      [ Changelog, a few edits and !SMP build fix ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad936d86
    • P
      sched/core, locking: Document Program-Order guarantees · 8643cda5
      Peter Zijlstra 提交于
      These are some notes on the scheduler locking and how it provides
      program order guarantees on SMP systems.
      
      ( This commit is in the locking tree, because the new documentation
        refers to a newly introduced locking primitive. )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8643cda5
    • P
      locking, sched: Introduce smp_cond_acquire() and use it · b3e0b1b6
      Peter Zijlstra 提交于
      Introduce smp_cond_acquire() which combines a control dependency and a
      read barrier to form acquire semantics.
      
      This primitive has two benefits:
      
       - it documents control dependencies,
       - its typically cheaper than using smp_load_acquire() in a loop.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3e0b1b6
    • P
      sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() · ecf7d01c
      Peter Zijlstra 提交于
      Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
      that we'll prematurely continue with the wakeup and effectively run p on
      two CPUs at the same time.
      
      Even though the overlap is very limited; the task is in the middle of
      being scheduled out; it could still result in corruption of the
      scheduler data structures.
      
              CPU0                            CPU1
      
              set_current_state(...)
      
              <preempt_schedule>
                context_switch(X, Y)
                  prepare_lock_switch(Y)
                    Y->on_cpu = 1;
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
                                              try_to_wake_up(X)
                                                LOCK(p->pi_lock);
      
                                                t = X->on_cpu; // 0
      
                context_switch(Y, X)
                  prepare_lock_switch(X)
                    X->on_cpu = 1;
                  finish_lock_switch(Y)
                    store_release(Y->on_cpu, 0);
              </preempt_schedule>
      
              schedule();
                deactivate_task(X);
                X->on_rq = 0;
      
                                                if (X->on_rq) // false
      
                                                if (t) while (X->on_cpu)
                                                  cpu_relax();
      
                context_switch(X, ..)
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
      Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ecf7d01c
    • P
      sched/core: Better document the try_to_wake_up() barriers · b75a2253
      Peter Zijlstra 提交于
      Explain how the control dependency and smp_rmb() end up providing
      ACQUIRE semantics and pair with smp_store_release() in
      finish_lock_switch().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b75a2253
    • X
      sched/core: Clear the root_domain cpumasks in init_rootdomain() · 8295c699
      Xunlei Pang 提交于
      root_domain::rto_mask allocated through alloc_cpumask_var()
      contains garbage data, this may cause problems. For instance,
      When doing pull_rt_task(), it may do useless iterations if
      rto_mask retains some extra garbage bits. Worse still, this
      violates the isolated domain rule for clustered scheduling
      using cpuset, because the tasks(with all the cpus allowed)
      belongs to one root domain can be pulled away into another
      root domain.
      
      The patch cleans the garbage by using zalloc_cpumask_var()
      instead of alloc_cpumask_var() for root_domain::rto_mask
      allocation, thereby addressing the issues.
      
      Do the same thing for root_domain's other cpumask memembers:
      dlo_mask, span, and online.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8295c699
    • S
      sched/core: Remove false-positive warning from wake_up_process() · 119d6f6a
      Sasha Levin 提交于
      Because wakeups can (fundamentally) be late, a task might not be in
      the expected state. Therefore testing against a task's state is racy,
      and can yield false positives.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: oleg@redhat.com
      Fixes: 9067ac85 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
      Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      119d6f6a
  4. 03 12月, 2015 2 次提交
    • O
      cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6
      Oleg Nesterov 提交于
      Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
      kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b53202e6
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  5. 23 11月, 2015 2 次提交
  6. 23 10月, 2015 1 次提交
  7. 20 10月, 2015 4 次提交
  8. 16 10月, 2015 1 次提交
    • T
      cgroup: keep zombies associated with their original cgroups · 2e91fa7f
      Tejun Heo 提交于
      cgroup_exit() is called when a task exits and disassociates the
      exiting task from its cgroups and half-attach it to the root cgroup.
      This is unnecessary and undesirable.
      
      No controller actually needs an exiting task to be disassociated with
      non-root cgroups.  Both cpu and perf_event controllers update the
      association to the root cgroup from their exit callbacks just to keep
      consistent with the cgroup core behavior.
      
      Also, this disassociation makes it difficult to track resources held
      by zombies or determine where the zombies came from.  Currently, pids
      controller is completely broken as it uncharges on exit and zombies
      always escape the resource restriction.  With cgroup association being
      reset on exit, fixing it is pretty painful.
      
      There's no reason to reset cgroup membership on exit.  The zombie can
      be removed from its css_set so that it doesn't show up on
      "cgroup.procs" and thus can't be migrated or interfere with cgroup
      removal.  It can still pin and point to the css_set so that its cgroup
      membership is maintained.  This patch makes cgroup core keep zombies
      associated with their cgroups at the time of exit.
      
      * Previous patches decoupled populated_cnt tracking from css_set
        lifetime, so a dying task can be simply unlinked from its css_set
        while pinning and pointing to the css_set.  This keeps css_set
        association from task side alive while hiding it from "cgroup.procs"
        and populated_cnt tracking.  The css_set reference is dropped when
        the task_struct is freed.
      
      * ->exit() callback no longer needs the css arguments as the
        associated css never changes once PF_EXITING is set.  Removed.
      
      * cpu and perf_events controllers no longer need ->exit() callbacks.
        There's no reason to explicitly switch away on exit.  The final
        schedule out is enough.  The callbacks are removed.
      
      * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
        still reports "/" for all zombies.  On the default hierarchy,
        "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
        to at the time of exit.  If the cgroup gets removed before the task
        is reaped, " (deleted)" is appended.
      
      v2: Build brekage due to missing dummy cgroup_free() when
          !CONFIG_CGROUP fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      2e91fa7f
  9. 07 10月, 2015 1 次提交
  10. 06 10月, 2015 13 次提交
  11. 23 9月, 2015 1 次提交
  12. 18 9月, 2015 3 次提交
  13. 13 9月, 2015 1 次提交