1. 19 1月, 2016 1 次提交
  2. 15 1月, 2016 1 次提交
    • C
      vmstat: make vmstat_updater deferrable again and shut down on idle · 0eb77e98
      Christoph Lameter 提交于
      Currently the vmstat updater is not deferrable as a result of commit
      ba4877b9 ("vmstat: do not use deferrable delayed work for
      vmstat_update").  This in turn can cause multiple interruptions of the
      applications because the vmstat updater may run at
      
      Make vmstate_update deferrable again and provide a function that folds
      the differentials when the processor is going to idle mode thus
      addressing the issue of the above commit in a clean way.
      
      Note that the shepherd thread will continue scanning the differentials
      from another processor and will reenable the vmstat workers if it
      detects any changes.
      
      Fixes: ba4877b9 ("vmstat: do not use deferrable delayed work for vmstat_update")
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eb77e98
  3. 06 1月, 2016 3 次提交
  4. 21 12月, 2015 1 次提交
    • S
      missing include asm/paravirt.h in cputime.c · 1fe7c4ef
      Stefano Stabellini 提交于
      Add include asm/paravirt.h to cputime.c, as steal_account_process_tick
      calls paravirt_steal_clock, which is defined in asm/paravirt.h.
      
      The ifdef CONFIG_PARAVIRT is necessary because not all archs have an
      asm/paravirt.h to include.
      
      The reason why currently cputime.c compiles, even though include
      <asm/paravirt.h> is missing, is that on x86 asm/paravirt.h is included
      by one of the other headers included in kernel/sched/cputime.c:
      
      On arm and arm64, where I am about to introduce asm/paravirt.h and
      stolen time support, without #include <asm/paravirt.h> in cputime.c, I
      would get an error.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      1fe7c4ef
  5. 14 12月, 2015 1 次提交
  6. 09 12月, 2015 1 次提交
    • T
      watchdog: introduce touch_softlockup_watchdog_sched() · 03e0d461
      Tejun Heo 提交于
      touch_softlockup_watchdog() is used to tell watchdog that scheduler
      stall is expected.  One group of usage is from paths where the task
      may not be able to yield for a long time such as performing slow PIO
      to finicky device and coming out of suspend.  The other is to account
      for scheduler and timer going idle.
      
      For scheduler softlockup detection, there's no reason to distinguish
      the two cases; however, workqueue lockup detector is planned and it
      can use the same signals from the former group while the latter would
      spuriously prevent detection.  This patch introduces a new function
      touch_softlockup_watchdog_sched() and convert the latter group to call
      it instead.  For now, it just calls touch_softlockup_watchdog() and
      there's no functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      03e0d461
  7. 05 12月, 2015 1 次提交
    • P
      rcu: Stop disabling interrupts in scheduler fastpaths · 46a5d164
      Paul E. McKenney 提交于
      We need the scheduler's fastpaths to be, well, fast, and unnecessarily
      disabling and re-enabling interrupts is not necessarily consistent with
      this goal.  Especially given that there are regions of the scheduler that
      already have interrupts disabled.
      
      This commit therefore moves the call to rcu_note_context_switch()
      to one of the interrupts-disabled regions of the scheduler, and
      removes the now-redundant disabling and re-enabling of interrupts from
      rcu_note_context_switch() and the functions it calls.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
        by Peter Zijlstra. ]
      46a5d164
  8. 04 12月, 2015 19 次提交
    • W
      sched/fair: Disable the task group load_avg update for the root_task_group · aa0b7ae0
      Waiman Long 提交于
      Currently, the update_tg_load_avg() function attempts to update the
      tg's load_avg value whenever the load changes even for root_task_group
      where the load_avg value will never be used. This patch will disable
      the load_avg update when the given task group is the root_task_group.
      
      Running a Java benchmark with noautogroup and a 4.3 kernel on a
      16-socket IvyBridge-EX system, the amount of CPU time (as reported by
      perf) consumed by task_tick_fair() which includes update_tg_load_avg()
      decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
      results also increased slightly from 983015 to 986449.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aa0b7ae0
    • W
      sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline · b0367629
      Waiman Long 提交于
      If a system with large number of sockets was driven to full
      utilization, it was found that the clock tick handling occupied a
      rather significant proportion of CPU time when fair group scheduling
      and autogroup were enabled.
      
      Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
      profile looked like:
      
        10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
         5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      In particular, the high CPU time consumed by update_cfs_shares()
      was mostly due to contention on the cacheline that contained the
      task_group's load_avg statistical counter. This cacheline may also
      contains variables like shares, cfs_rq & se which are accessed rather
      frequently during clock tick processing.
      
      This patch moves the load_avg variable into another cacheline
      separated from the other frequently accessed variables. It also
      creates a cacheline aligned kmemcache for task_group to make sure
      that all the allocated task_group's are cacheline aligned.
      
      By doing so, the perf profile became:
      
         9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
         4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      The %cpu time is still pretty high, but it is better than before. The
      benchmark results before and after the patch was as follows:
      
        Before patch - Max-jOPs: 907533    Critical-jOps: 134877
        After patch  - Max-jOPs: 916011    Critical-jOps: 142366
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0367629
    • W
      sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() · a426f99c
      Waiman Long 提交于
      Part of the responsibility of the update_sg_lb_stats() function is to
      update the idle_cpus statistical counter in struct sg_lb_stats. This
      check is done by calling idle_cpu(). The idle_cpu() function, in
      turn, checks a number of fields within the run queue structure such
      as rq->curr and rq->nr_running.
      
      With the current layout of the run queue structure, rq->curr and
      rq->nr_running are in separate cachelines. The rq->curr variable is
      checked first followed by nr_running. As nr_running is also accessed
      by update_sg_lb_stats() earlier, it makes no sense to load another
      cacheline when nr_running is not 0 as idle_cpu() will always return
      false in this case.
      
      This patch eliminates this redundant cacheline load by checking the
      cached nr_running before calling idle_cpu().
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a426f99c
    • A
      sched/core: Move the sched_to_prio[] arrays out of line · ed82b8a1
      Andi Kleen 提交于
      When building a kernel with a gcc 6 snapshot the compiler complains
      about unused const static variables for prio_to_weight and prio_to_mult
      for multiple scheduler files (all but core.c and autogroup.c)
      
      The way the array is currently declared it will be duplicated in
      every scheduler file that includes sched.h, which seems rather wasteful.
      
      Move the array out of line into core.c. I also added a sched_ prefix
      to avoid any potential name space collisions.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ed82b8a1
    • F
      sched/cputime: Convert vtime_seqlock to seqcount · b7ce2277
      Frederic Weisbecker 提交于
      The cputime can only be updated by the current task itself, even in
      vtime case. So we can safely use seqcount instead of seqlock as there
      is no writer concurrency involved.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7ce2277
    • F
      sched/cputime: Introduce vtime accounting check for readers · e5925394
      Frederic Weisbecker 提交于
      Readers need to know if vtime runs at all on some CPU somewhere, this
      is a fast-path check to determine if we need to check further the need
      to add up any tickless cputime delta.
      
      This fast path check uses context tracking state because vtime is tied
      to context tracking as of now. This check appears to be confusing though
      so lets use a vtime function that deals with context tracking details
      in vtime implementation instead.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-7-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e5925394
    • F
      sched/cputime: Rename vtime_accounting_enabled() to vtime_accounting_cpu_enabled() · 55dbdcfa
      Frederic Weisbecker 提交于
      vtime_accounting_enabled() checks if vtime is running on the current CPU
      and is as such a misnomer. Lets rename it to a function that reflect its
      locality. We are going to need the current name for a function that tells
      if vtime runs at all on some CPU.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-6-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      55dbdcfa
    • F
      sched/cputime: Correctly handle task guest time on housekeepers · cab245d6
      Frederic Weisbecker 提交于
      When a task runs on a housekeeper (a CPU running with the periodic tick
      with neighbours running tickless), it doesn't account cputime using vtime
      but relies on the tick. Such a task has its vtime_snap_whence value set
      to VTIME_INACTIVE.
      
      Readers won't handle that correctly though. As long as vtime is running
      on some CPU, readers incorretly assume that vtime runs on all CPUs and
      always compute the tickless cputime delta, which is only junk on
      housekeepers.
      
      So lets fix this with checking that the target runs on a vtime CPU through
      the appropriate state check before computing the tickless delta.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-5-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cab245d6
    • F
      sched/cputime: Clarify vtime symbols and document them · 7098c1ea
      Frederic Weisbecker 提交于
      VTIME_SLEEPING state happens either when:
      
      1) The task is sleeping and no tickless delta is to be added on the task
         cputime stats.
      2) The CPU isn't running vtime at all, so the same properties of 1) applies.
      
      Lets rename the vtime symbol to reflect both states.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7098c1ea
    • H
      sched/cputime: Remove extra cost in task_cputime() · 7877a0ba
      Hiroshi Shimamoto 提交于
      There is an extra cost in task_cputime() and task_cputime_scaled() when
      nohz_full is not activated. When vtime accounting is not enabled, we
      don't need to get deltas of utime and stime under vtime seqlock.
      
      This patch removes that cost with adding a shortcut route if vtime
      accounting is not enabled.
      
      Use context_tracking_is_enabled() to check if vtime is accounting on
      some cpu, in which case only we need to check the tickless cputime delta.
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7877a0ba
    • B
      sched/fair: Make it possible to account fair load avg consistently · ad936d86
      Byungchul Park 提交于
      The current code accounts for the time a task was absent from the fair
      class (per ATTACH_AGE_LOAD). However it does not work correctly when a
      task got migrated or moved to another cgroup while outside of the fair
      class.
      
      This patch tries to address that by aging on migration. We locklessly
      read the 'last_update_time' stamp from both the old and new cfs_rq,
      ages the load upto the old time, and sets it to the new time.
      
      These timestamps should in general not be more than 1 tick apart from
      one another, so there is a definite bound on things.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      [ Changelog, a few edits and !SMP build fix ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad936d86
    • P
      sched/core, locking: Document Program-Order guarantees · 8643cda5
      Peter Zijlstra 提交于
      These are some notes on the scheduler locking and how it provides
      program order guarantees on SMP systems.
      
      ( This commit is in the locking tree, because the new documentation
        refers to a newly introduced locking primitive. )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8643cda5
    • P
      locking, sched: Introduce smp_cond_acquire() and use it · b3e0b1b6
      Peter Zijlstra 提交于
      Introduce smp_cond_acquire() which combines a control dependency and a
      read barrier to form acquire semantics.
      
      This primitive has two benefits:
      
       - it documents control dependencies,
       - its typically cheaper than using smp_load_acquire() in a loop.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3e0b1b6
    • P
      sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() · ecf7d01c
      Peter Zijlstra 提交于
      Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
      that we'll prematurely continue with the wakeup and effectively run p on
      two CPUs at the same time.
      
      Even though the overlap is very limited; the task is in the middle of
      being scheduled out; it could still result in corruption of the
      scheduler data structures.
      
              CPU0                            CPU1
      
              set_current_state(...)
      
              <preempt_schedule>
                context_switch(X, Y)
                  prepare_lock_switch(Y)
                    Y->on_cpu = 1;
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
                                              try_to_wake_up(X)
                                                LOCK(p->pi_lock);
      
                                                t = X->on_cpu; // 0
      
                context_switch(Y, X)
                  prepare_lock_switch(X)
                    X->on_cpu = 1;
                  finish_lock_switch(Y)
                    store_release(Y->on_cpu, 0);
              </preempt_schedule>
      
              schedule();
                deactivate_task(X);
                X->on_rq = 0;
      
                                                if (X->on_rq) // false
      
                                                if (t) while (X->on_cpu)
                                                  cpu_relax();
      
                context_switch(X, ..)
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
      Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ecf7d01c
    • P
      sched/core: Better document the try_to_wake_up() barriers · b75a2253
      Peter Zijlstra 提交于
      Explain how the control dependency and smp_rmb() end up providing
      ACQUIRE semantics and pair with smp_store_release() in
      finish_lock_switch().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b75a2253
    • H
      sched/cputime: Fix invalid gtime in proc · 2541117b
      Hiroshi Shimamoto 提交于
      /proc/stats shows invalid gtime when the thread is running in guest.
      When vtime accounting is not enabled, we cannot get a valid delta.
      The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
      is only updated when vtime accounting is runtime enabled.
      
      This patch makes task_gtime() just return gtime without computing the
      buggy non-existing tickless delta when vtime accounting is not enabled.
      
      Use context_tracking_is_enabled() to check if vtime is accounting on
      some cpu, in which case only we need to check the tickless delta. This
      way we fix the gtime value regression on machines not running nohz full.
      
      The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
      CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.
      
      I ran and stop a busy loop in VM and see the gtime in host.
      Dump the 43rd field which shows the gtime in every second:
      
      	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
      	S 4348
      	R 7064566
      	R 7064766
      	R 7064967
      	R 7065168
      	S 4759
      	S 4759
      
      During running busy loop, it returns large value.
      
      After applying this patch, we can see right gtime.
      
      	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
      	S 5338
      	R 5365
      	R 5465
      	R 5566
      	R 5666
      	S 5726
      	S 5726
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541117b
    • X
      sched/core: Clear the root_domain cpumasks in init_rootdomain() · 8295c699
      Xunlei Pang 提交于
      root_domain::rto_mask allocated through alloc_cpumask_var()
      contains garbage data, this may cause problems. For instance,
      When doing pull_rt_task(), it may do useless iterations if
      rto_mask retains some extra garbage bits. Worse still, this
      violates the isolated domain rule for clustered scheduling
      using cpuset, because the tasks(with all the cpus allowed)
      belongs to one root domain can be pulled away into another
      root domain.
      
      The patch cleans the garbage by using zalloc_cpumask_var()
      instead of alloc_cpumask_var() for root_domain::rto_mask
      allocation, thereby addressing the issues.
      
      Do the same thing for root_domain's other cpumask memembers:
      dlo_mask, span, and online.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8295c699
    • S
      sched/core: Remove false-positive warning from wake_up_process() · 119d6f6a
      Sasha Levin 提交于
      Because wakeups can (fundamentally) be late, a task might not be in
      the expected state. Therefore testing against a task's state is racy,
      and can yield false positives.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: oleg@redhat.com
      Fixes: 9067ac85 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
      Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      119d6f6a
    • P
      sched/wait: Fix signal handling in bit wait helpers · 68985633
      Peter Zijlstra 提交于
      Vladimir reported getting RCU stall warnings and bisected it back to
      commit:
      
        74316201 ("sched: Remove proliferation of wait_on_bit() action functions")
      
      That commit inadvertently reversed the calls to schedule() and signal_pending(),
      thereby not handling the case where the signal receives while we sleep.
      Reported-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Tested-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mark.rutland@arm.com
      Cc: neilb@suse.de
      Cc: oleg@redhat.com
      Fixes: 74316201 ("sched: Remove proliferation of wait_on_bit() action functions")
      Fixes: cbbce822 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
      Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68985633
  9. 03 12月, 2015 2 次提交
    • O
      cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6
      Oleg Nesterov 提交于
      Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
      kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b53202e6
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  10. 23 11月, 2015 10 次提交