1. 13 9月, 2015 1 次提交
  2. 11 9月, 2015 1 次提交
    • W
      sched: 'Annotate' migrate_tasks() · 5473e0cc
      Wanpeng Li 提交于
      Kernel testing triggered this warning:
      
      | WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
      | Modules linked in:
      | CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c73 #2
      | Call Trace:
      |   dump_stack+0x4b/0x75
      |   warn_slowpath_common+0x8b/0xc0
      |   warn_slowpath_null+0x22/0x30
      |   do_set_cpus_allowed+0x7e/0x80
      |   cpuset_cpus_allowed_fallback+0x7c/0x170
      |   select_fallback_rq+0x221/0x280
      |   migration_call+0xe3/0x250
      |   notifier_call_chain+0x53/0x70
      |   __raw_notifier_call_chain+0x1e/0x30
      |   cpu_notify+0x28/0x50
      |   take_cpu_down+0x22/0x40
      |   multi_cpu_stop+0xd5/0x140
      |   cpu_stopper_thread+0xbc/0x170
      |   smpboot_thread_fn+0x174/0x2f0
      |   kthread+0xc4/0xe0
      |   ret_from_kernel_thread+0x21/0x30
      
      As Peterz pointed out:
      
      | So the normal rules for changing task_struct::cpus_allowed are holding
      | both pi_lock and rq->lock, such that holding either stabilizes the mask.
      |
      | This is so that wakeup can happen without rq->lock and load-balance
      | without pi_lock.
      |
      | From this we already get the relaxation that we can omit acquiring
      | rq->lock if the task is not on the rq, because in that case
      | load-balancing will not apply to it.
      |
      | ** these are the rules currently tested in do_set_cpus_allowed() **
      |
      | Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
      | unconditionally acquires both locks, we could get away with holding just
      | rq->lock when on_rq for modification because that'd still exclude
      | __set_cpus_allowed_ptr(), it would also work against
      | __kthread_bind_mask() because that assumes !on_rq.
      |
      | That said, this is all somewhat fragile.
      |
      | Now, I don't think dropping rq->lock is quite as disastrous as it
      | usually is because !cpu_active at this point, which means load-balance
      | will not interfere, but that too is somewhat fragile.
      |
      | So we end up with a choice of two fragile..
      
      This patch fixes it by following the rules for changing
      task_struct::cpus_allowed with both pi_lock and rq->lock held.
      Reported-by: Nkernel test robot <ying.huang@intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [ Modified changelog and patch. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gblSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5473e0cc
  3. 02 9月, 2015 1 次提交
  4. 25 8月, 2015 1 次提交
    • J
      sched: Fix cpu_active_mask/cpu_online_mask race · dd9d3843
      Jan H. Schönherr 提交于
      There is a race condition in SMP bootup code, which may result
      in
      
          WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
          workqueue_cpu_up_callback()
      or
          kernel BUG at kernel/smpboot.c:135!
      
      It can be triggered with a bit of luck in Linux guests running
      on busy hosts.
      
      	CPU0                        CPUn
      	====                        ====
      
      	_cpu_up()
      	  __cpu_up()
      				    start_secondary()
      				      set_cpu_online()
      					cpumask_set_cpu(cpu,
      						   to_cpumask(cpu_online_bits));
      	  cpu_notify(CPU_ONLINE)
      	    <do stuff, see below>
      					cpumask_set_cpu(cpu,
      						   to_cpumask(cpu_active_bits));
      
      During the various CPU_ONLINE callbacks CPUn is online but not
      active. Several things can go wrong at that point, depending on
      the scheduling of tasks on CPU0.
      
      Variant 1:
      
        cpu_notify(CPU_ONLINE)
          workqueue_cpu_up_callback()
            rebind_workers()
              set_cpus_allowed_ptr()
      
        This call fails because it requires an active CPU; rebind_workers()
        ends with a warning:
      
          WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
          workqueue_cpu_up_callback()
      
      Variant 2:
      
        cpu_notify(CPU_ONLINE)
          smpboot_thread_call()
            smpboot_unpark_threads()
             ..
              __kthread_unpark()
                __kthread_bind()
                wake_up_state()
                 ..
                  select_task_rq()
                    select_fallback_rq()
      
        The ->wake_cpu of the unparked thread is not allowed, making a call
        to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
        find an allowed, active CPU and promptly resets the allowed CPUs, so
        that the task in question ends up on CPU0.
      
        When those unparked tasks are eventually executed, they run
        immediately into a BUG:
      
          kernel BUG at kernel/smpboot.c:135!
      
      Just changing the order in which the online/active bits are set
      (and adding some memory barriers), would solve the two issues
      above. However, it would change the order of operations back to
      the one before commit 6acbfb96 ("sched: Fix hotplug vs.
      set_cpus_allowed_ptr()"), thus, reintroducing that particular
      problem.
      
      Going further back into history, we have at least the following
      commits touching this topic:
      - commit 2baab4e9 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
      - commit 5fbd036b ("sched: Cleanup cpu_active madness")
      
      Together, these give us the following non-working solutions:
      
        - secondary CPU sets active before online, because active is assumed to
          be a subset of online;
      
        - secondary CPU sets online before active, because the primary CPU
          assumes that an online CPU is also active;
      
        - secondary CPU sets online and waits for primary CPU to set active,
          because it might deadlock.
      
      Commit 875ebe94 ("powerpc/smp: Wait until secondaries are
      active & online") introduces an arch-specific solution to this
      arch-independent problem.
      
      Now, go for a more general solution without explicit waiting and
      simply set active twice: once on the secondary CPU after online
      was set and once on the primary CPU after online was seen.
      
      set_cpus_allowed_ptr()")
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Wilson <msw@amazon.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6acbfb96 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
      Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dd9d3843
  5. 12 8月, 2015 4 次提交
  6. 04 8月, 2015 1 次提交
  7. 03 8月, 2015 6 次提交
    • Y
      sched/fair: Init cfs_rq's sched_entity load average · 540247fb
      Yuyang Du 提交于
      The runnable load and utilization averages of cfs_rq's sched_entity
      were not initiated. Like done to a task, give new cfs_rq' sched_entity
      start values to heavy its load in infant time.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      540247fb
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • K
      sched/preempt: Fix cond_resched_lock() and cond_resched_softirq() · fe32d3cd
      Konstantin Khlebnikov 提交于
      These functions check should_resched() before unlocking spinlock/bh-enable:
      preempt_count always non-zero => should_resched() always returns false.
      cond_resched_lock() worked iff spin_needbreak is set.
      
      This patch adds argument "preempt_offset" to should_resched().
      
      preempt_count offset constants for that:
      
        PREEMPT_DISABLE_OFFSET  - offset after preempt_disable()
        PREEMPT_LOCK_OFFSET     - offset after spin_lock()
        SOFTIRQ_DISABLE_OFFSET  - offset after local_bh_distable()
        SOFTIRQ_LOCK_OFFSET     - offset after spin_lock_bh()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: bdb43806 ("sched: Extract the basic add/sub preempt_count modifiers")
      Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe32d3cd
    • P
      sched: Introduce the 'trace_sched_waking' tracepoint · fbd705a0
      Peter Zijlstra 提交于
      Mathieu reported that since 317f3941 ("sched: Move the second half
      of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of
      context of the waker.
      
      This is a problem when you want to analyse wakeup paths because it is
      now very hard to correlate the wakeup event to whoever issued the
      wakeup.
      
      OTOH trace_sched_wakeup() is issued at the point where we set
      p->state = TASK_RUNNING, which is right were we hand the task off to
      the scheduler, so this is an important point when looking at
      scheduling behaviour, up to here its been the wakeup path everything
      hereafter is due to scheduler policy.
      
      To bridge this gap, introduce a second tracepoint: trace_sched_waking.
      It is guaranteed to be called in the waker context.
      Reported-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Francis Giraldeau <francis.giraldeau@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150609091336.GQ3644@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fbd705a0
    • M
      sched, sysctl: Delete an unnecessary check before unregister_sysctl_table() · 781b0203
      Markus Elfring 提交于
      The unregister_sysctl_table() function tests whether its argument is NULL
      and then returns immediately. Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/5597877E.3060503@users.sourceforge.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      781b0203
    • P
      locking/static_keys: Add static_key_{en,dis}able() helpers · e33886b3
      Peter Zijlstra 提交于
      Add two helpers to make it easier to treat the refcount as boolean.
      Suggested-by: NJason Baron <jasonbaron0@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e33886b3
  8. 29 7月, 2015 1 次提交
  9. 23 7月, 2015 1 次提交
  10. 15 7月, 2015 1 次提交
    • A
      cgroup: allow a cgroup subsystem to reject a fork · 7e47682e
      Aleksa Sarai 提交于
      Add a new cgroup subsystem callback can_fork that conditionally
      states whether or not the fork is accepted or rejected by a cgroup
      policy. In addition, add a cancel_fork callback so that if an error
      occurs later in the forking process, any state modified by can_fork can
      be reverted.
      
      Allow for a private opaque pointer to be passed from cgroup_can_fork to
      cgroup_post_fork, allowing for the fork state to be stored by each
      subsystem separately.
      
      Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG>
      enumerations to be be defined and used. In addition, explicitly add a
      CGROUP_CANFORK_COUNT macro to make arrays easier to define.
      
      This is in preparation for implementing the pids cgroup subsystem.
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7e47682e
  11. 04 7月, 2015 2 次提交
  12. 19 6月, 2015 10 次提交
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
    • W
      sched: Remove superfluous resetting of the p->dl_throttled flag · 6713c3aa
      Wanpeng Li 提交于
      Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is going
      to be boosted) is superfluous, as the natural place to do so is in
      replenish_dl_entity().
      
      If the task was on the runqueue and it is boosted by a DL task, it will be enqueued
      back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is
      reset in replenish_dl_entity().
      
      This patch drops the resetting of throttled status in function rt_mutex_setprio().
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431496867-4194-6-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6713c3aa
    • P
      sched/preempt: Add static_key() to preempt_notifiers · 1cde2930
      Peter Zijlstra 提交于
      Avoid touching the curr->preempt_notifier cacheline when not needed.
      
      Provides a small improvement on pipe-bench:
      
        taskset 01 perf stat --repeat 10 -- perf bench sched pipe
      
      before:
      
       Performance counter stats for 'perf bench sched pipe' (10 runs):
      
            12385.016204      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.34% )
               2,000,023      context-switches          #    0.161 M/sec                    ( +-  0.00% )
                       0      cpu-migrations            #    0.000 K/sec
                     175      page-faults               #    0.014 K/sec                    ( +-  0.26% )
          41,376,162,250      cycles                    #    3.341 GHz                      ( +-  0.11% )
          17,389,139,321      stalled-cycles-frontend   #   42.03% frontend cycles idle     ( +-  0.25% )
         <not supported>      stalled-cycles-backend
          68,788,588,003      instructions              #    1.66  insns per cycle
                                                        #    0.25  stalled cycles per insn  ( +-  0.02% )
          13,449,387,620      branches                  # 1085.940 M/sec                    ( +-  0.02% )
              20,880,690      branch-misses             #    0.16% of all branches          ( +-  0.98% )
      
            12.372646094 seconds time elapsed                                          ( +-  0.34% )
      
      after:
      
       Performance counter stats for 'perf bench sched pipe' (10 runs):
      
            12180.936528      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.33% )
               2,000,077      context-switches          #    0.164 M/sec                    ( +-  0.00% )
                       0      cpu-migrations            #    0.000 K/sec
                     174      page-faults               #    0.014 K/sec                    ( +-  0.27% )
          40,691,545,577      cycles                    #    3.341 GHz                      ( +-  0.06% )
          16,446,333,371      stalled-cycles-frontend   #   40.42% frontend cycles idle     ( +-  0.18% )
         <not supported>      stalled-cycles-backend
          68,570,100,387      instructions              #    1.69  insns per cycle
                                                        #    0.24  stalled cycles per insn  ( +-  0.01% )
          13,389,740,014      branches                  # 1099.237 M/sec                    ( +-  0.01% )
              20,175,440      branch-misses             #    0.15% of all branches          ( +-  0.52% )
      
            12.169253010 seconds time elapsed                                          ( +-  0.33% )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cde2930
    • M
      sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration · d84525a8
      Mathieu Desnoyers 提交于
      preempt_notifier_unregister() documents:
      
        "This is safe to call from within a preemption notifier."
      
      However, both fire_sched_in_preempt_notifiers() and
      fire_sched_out_preempt_notifiers() are using hlist_for_each_entry(),
      which is not safe against entry removal during iteration.
      
      Inspection of the KVM code does not reveal any use of
      preempt_notifier_unregister() within the preempt notifiers.
      
      Therefore, fix the comment.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431881590-1456-1-git-send-email-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d84525a8
    • P
      sched,lockdep: Employ lock pinning · cbce1a68
      Peter Zijlstra 提交于
      Employ the new lockdep lock pinning annotation to ensure no
      'accidental' lock-breaks happen with rq->lock.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: oleg@redhat.com
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      cbce1a68
    • P
      sched: Streamline the task migration locking a little · 5e16bbc2
      Peter Zijlstra 提交于
      The whole migrate_task{,s}() locking seems a little shaky, there's a
      lot of dropping an require happening. Pull the locking up into the
      callers as far as possible to streamline the lot.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: oleg@redhat.com
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124743.755256708@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5e16bbc2
    • P
      sched: Move code around · 5cc389bc
      Peter Zijlstra 提交于
      In preparation to reworking set_cpus_allowed_ptr() move some code
      around. This also removes some superfluous #ifdefs and adds comments
      to some #endifs.
      
         text    data     bss     dec     hex filename
      12211532        1738144 1081344 15031020         e55aec defconfig-build/vmlinux.pre
      12211532        1738144 1081344 15031020         e55aec defconfig-build/vmlinux.post
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: oleg@redhat.com
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124743.662086684@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5cc389bc
    • P
      sched: Allow balance callbacks for check_class_changed() · 4c9a4bc8
      Peter Zijlstra 提交于
      In order to remove dropping rq->lock from the
      switched_{to,from}()/prio_changed() sched_class methods, run the
      balance callbacks after it.
      
      We need to remove dropping rq->lock because its buggy,
      suppose using sched_setattr()/sched_setscheduler() to change a running
      task from FIFO to OTHER.
      
      By the time we get to switched_from_rt() the task is already enqueued
      on the cfs runqueues. If switched_from_rt() does pull_rt_task() and
      drops rq->lock, load-balancing can come in and move our task @p to
      another rq.
      
      The subsequent switched_to_fair() still assumes @p is on @rq and bad
      things will happen.
      
      By using balance callbacks we delay the load-balancing operations
      {rt,dl}x{push,pull} until we've done all the important work and the
      task is fully set up.
      
      Furthermore, the balance callbacks do not know about @p, therefore
      they cannot get confused like this.
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: oleg@redhat.com
      Cc: wanpeng.li@linux.intel.com
      Link: http://lkml.kernel.org/r/20150611124742.615343911@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4c9a4bc8
    • P
      sched: Use replace normalize_task() with __sched_setscheduler() · dbc7f069
      Peter Zijlstra 提交于
      Reduce duplicate logic; normalize_task() is a simplified version of
      __sched_setscheduler(). Parametrize the difference and collapse.
      
      This reduces the amount of check_class_changed() sites.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: oleg@redhat.com
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124742.532642391@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      dbc7f069
    • P
      sched: Replace post_schedule with a balance callback list · e3fca9e7
      Peter Zijlstra 提交于
      Generalize the post_schedule() stuff into a balance callback list.
      This allows us to more easily use it outside of schedule() and cross
      sched_class.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: ktkhai@parallels.com
      Cc: rostedt@goodmis.org
      Cc: juri.lelli@gmail.com
      Cc: pang.xunlei@linaro.org
      Cc: oleg@redhat.com
      Cc: wanpeng.li@linux.intel.com
      Cc: umgwanakikbuti@gmail.com
      Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e3fca9e7
  13. 17 6月, 2015 1 次提交
  14. 07 6月, 2015 2 次提交
  15. 19 5月, 2015 2 次提交
    • F
      sched/preempt: Optimize preemption operations on __schedule() callers · b30f0e3f
      Frederic Weisbecker 提交于
      __schedule() disables preemption and some of its callers
      (the preempt_schedule*() family) also set PREEMPT_ACTIVE.
      
      So we have two preempt_count() modifications that could be performed
      at once.
      
      Lets remove the preemption disablement from __schedule() and pull
      this responsibility to its callers in order to optimize preempt_count()
      operations in a single place.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-5-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b30f0e3f
    • S
      sched: always use blk_schedule_flush_plug in io_schedule_out · 10d784ea
      Shaohua Li 提交于
      block plug callback could sleep, so we introduce a parameter
      'from_schedule' and corresponding drivers can use it to destinguish a
      schedule plug flush or a plug finish. Unfortunately io_schedule_out
      still uses blk_flush_plug(). This causes below output (Note, I added a
      might_sleep() in raid1_unplug to make it trigger faster, but the whole
      thing doesn't matter if I add might_sleep). In raid1/10, this can cause
      deadlock.
      
      This patch makes io_schedule_out always uses blk_schedule_flush_plug.
      This should only impact drivers (as far as I know, raid 1/10) which are
      sensitive to the 'from_schedule' parameter.
      
      [  370.817949] ------------[ cut here ]------------
      [  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
      [  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
      [  370.817971] Modules linked in: raid1
      [  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
      [  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
      [  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
      [  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
      [  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
      [  370.817993] Call Trace:
      [  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
      [  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
      [  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
      [  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
      [  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
      [  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
      [  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
      [  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
      [  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
      [  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
      [  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
      [  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
      [  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
      [  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
      [  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
      [  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
      [  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
      [  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
      [  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
      [  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
      [  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
      [  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
      [  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
      [  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
      [  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
      [  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
      [  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
      [  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
      [  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
      [  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
      [  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
      [  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
      [  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
      [  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
      [  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
      [  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
      [  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
      [  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818132] ---[ end trace 7b4deb71e68b6605 ]---
      
      V2: don't change ->in_iowait
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      10d784ea
  16. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  17. 09 5月, 2015 1 次提交
    • S
      sched: always use blk_schedule_flush_plug in io_schedule_out · 5596d0d5
      Shaohua Li 提交于
      block plug callback could sleep, so we introduce a parameter
      'from_schedule' and corresponding drivers can use it to destinguish a
      schedule plug flush or a plug finish. Unfortunately io_schedule_out
      still uses blk_flush_plug(). This causes below output (Note, I added a
      might_sleep() in raid1_unplug to make it trigger faster, but the whole
      thing doesn't matter if I add might_sleep). In raid1/10, this can cause
      deadlock.
      
      This patch makes io_schedule_out always uses blk_schedule_flush_plug.
      This should only impact drivers (as far as I know, raid 1/10) which are
      sensitive to the 'from_schedule' parameter.
      
      [  370.817949] ------------[ cut here ]------------
      [  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
      [  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
      [  370.817971] Modules linked in: raid1
      [  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
      [  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
      [  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
      [  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
      [  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
      [  370.817993] Call Trace:
      [  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
      [  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
      [  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
      [  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
      [  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
      [  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
      [  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
      [  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
      [  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
      [  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
      [  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
      [  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
      [  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
      [  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
      [  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
      [  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
      [  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
      [  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
      [  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
      [  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
      [  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
      [  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
      [  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
      [  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
      [  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
      [  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
      [  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
      [  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
      [  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
      [  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
      [  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
      [  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
      [  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
      [  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
      [  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
      [  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
      [  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
      [  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818132] ---[ end trace 7b4deb71e68b6605 ]---
      
      V2: don't change ->in_iowait
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5596d0d5
  18. 08 5月, 2015 3 次提交
    • P
      perf: Fix software migrate events · ff303e66
      Peter Zijlstra 提交于
      Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
      was borken:
      
       > The problem is that the task isn't actually scheduled while its being
       > migrated (obviously), and if its not scheduled, the counters aren't
       > scheduled either, so there's no observing of the fact.
       >
       > A further problem with migrations is that many migrations happen from
       > softirq context, which is nested inside the 'random' task context of
       > whoemever happens to run at that time, similarly for the wakeup
       > migrations triggered from (soft)irq context. All those end up being
       > accounted in the task that's currently running, eg. your 'ls'.
      
      The below cures this by marking a task as migrated and accounting it
      on the subsequent sched_in().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff303e66
    • P
      sched: Implement lockless wake-queues · 76751049
      Peter Zijlstra 提交于
      This is useful for locking primitives that can effect multiple
      wakeups per operation and want to avoid lock internal lock contention
      by delaying the wakeups until we've released the lock internal locks.
      
      Alternatively it can be used to avoid issuing multiple wakeups, and
      thus save a few cycles, in packet processing. Queue all target tasks
      and wakeup once you've processed all packets. That way you avoid
      waking the target task multiple times if there were multiple packets
      for the same task.
      
      Properties of a wake_q are:
      - Lockless, as queue head must reside on the stack.
      - Being a queue, maintains wakeup order passed by the callers. This can
        be important for otherwise, in scenarios where highly contended locks
        could affect any reliance on lock fairness.
      - A queued task cannot be added again until it is woken up.
      
      This patch adds the needed infrastructure into the scheduler code
      and uses the new wake_list to delay the futex wakeups until
      after we've released the hash bucket locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      [tweaks, adjustments, comments, etc.]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: George Spelvin <linux@horizon.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      76751049
    • J
      sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE() · 316c1608
      Jason Low 提交于
      ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
      the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
      the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      316c1608