1. 29 2月, 2016 10 次提交
    • S
      sched/rt: Kick RT bandwidth timer immediately on start up · c3a990dc
      Steven Rostedt 提交于
      I've been debugging why deadline tasks can cause the RT scheduler to
      throttle, even when the deadline tasks are only taking up 50% of the
      CPU and RT tasks are not even using 1% of the CPU. Here's what I found.
      
      In order to keep a CPU from being hogged by RT tasks, the deadline
      scheduler adds its run time (delta_exec) to the rt_time of the RT
      bandwidth. That way, if the two use more than 95% of the CPU within one
      second (default settings), the RT tasks are throttled to allow non RT
      tasks to run.
      
      Although the deadline tasks add their run time to the RT bandwidth, it
      lets the RT tasks do the accounting. This is where the problem lies. If
      a deadline task runs for a bit, and no RT tasks are running, then it
      will continually add to the RT rt_time that is used to calculate how
      much CPU the RT tasks use. But no RT period is in play, and this
      accumulation of the runtime never gets reset.
      
      When an RT task finally gets to run, and the watchdog goes off, it can
      see that the RT task has used more than it should of, because the
      deadline task added all this runtime to its rt_time. Then the RT task
      that just woke up gets throttled for no good reason.
      
      I also noticed that when an RT task is queued, it starts the timer to
      account for overload and such. But that timer goes off one period
      later, which may be too late and the extra rt_time will trigger a
      throttle.
      
      This is a quick work around to the problem. When a new RT task is
      queued, the bandwidth timer is set to go off immediately. Then the
      timer can clear out the extra time added to the rt_time while there was
      no RT task running. This stops my tests from triggering the throttle,
      and it will still throttle if an RT task runs too much, even while a
      deadline task is running.
      
      A better solution may be to subtract the bandwidth that the deadline
      task uses from the rt_runtime, and add it back when its finished. Then
      there wont be a need for runtime tracking of the time used by deadline
      tasks.
      
      I may play with that solution tomorrow.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <juri.lelli@gmail.com>
      Cc: <williams@redhat.com>
      Cc: Clark Williams
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160216183746.349ec98b@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c3a990dc
    • S
      sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug · ef477183
      Steven Rostedt (Red Hat) 提交于
      Playing with SCHED_DEADLINE and cpusets, I found that I was unable to create
      new SCHED_DEADLINE tasks, with the error of EBUSY as if the bandwidth was
      already used up. I then realized there wa no way to see what bandwidth is
      used by the runqueues to debug the issue.
      
      By adding the dl_bw->bw and dl_bw->total_bw to the output of the deadline
      info in /proc/sched_debug, this allows us to see what bandwidth has been
      reserved and where a problem may exist.
      
      For example, before the issue we see the ratio of the bandwidth:
      
       # cat /proc/sys/kernel/sched_rt_runtime_us
       950000
       # cat /proc/sys/kernel/sched_rt_period_us
       1000000
      
        # grep dl /proc/sched_debug
        dl_rq[0]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[1]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[2]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[3]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[4]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[5]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[6]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
        dl_rq[7]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 0
      
      Note: (950000 / 1000000) << 20 == 996147
      
      After I played with cpusets and hit the issue, the result is now:
      
        # grep dl /proc/sched_debug
        dl_rq[0]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : -104857
        dl_rq[1]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 104857
        dl_rq[2]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 104857
        dl_rq[3]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : 104857
        dl_rq[4]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : -104857
        dl_rq[5]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : -104857
        dl_rq[6]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : -104857
        dl_rq[7]:
          .dl_nr_running                 : 0
          .dl_bw->bw                     : 996147
          .dl_bw->total_bw               : -104857
      
      This shows that there is definitely a problem as we should never have a
      negative total bandwidth.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160222212825.756849091@goodmis.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ef477183
    • S
      sched/debug: Move sched_domain_sysctl to debug.c · 3866e845
      Steven Rostedt (Red Hat) 提交于
      The sched_domain_sysctl setup is only enabled when SCHED_DEBUG is
      configured. As debug.c is only compiled when SCHED_DEBUG is configured as
      well, move the setup of sched_domain_sysctl into that file.
      
      Note, the (un)register_sched_domain_sysctl() functions had to be changed
      from static to allow access to them from core.c.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160222212825.599278093@goodmis.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3866e845
    • S
      sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c · d6ca41d7
      Steven Rostedt (Red Hat) 提交于
      As /sys/kernel/debug/sched_features is only created when SCHED_DEBUG is enabled, and the file
      debug.c is only compiled when SCHED_DEBUG is enabled, it makes sense to move
      sched_feature setup into that file and get rid of the #ifdef.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160222212825.464193063@goodmis.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6ca41d7
    • P
      sched/rt: Fix PI handling vs. sched_setscheduler() · ff77e468
      Peter Zijlstra 提交于
      Andrea Parri reported:
      
      > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
      > handled correctly:
      >
      >     T1 (prio = 20)
      >        lock(rtmutex);
      >
      >     T2 (prio = 20)
      >        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
      >
      >     T1 (prio = 20)
      >        sys_set_scheduler(prio = 0)
      >           [new_effective_prio == oldprio]
      >           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
      >
      > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
      > in particular, if we continue with
      >
      >    T1 (prio = 20)
      >       unlock(rtmutex)
      >          wakeup(T2)
      >          adjust_prio(T1)
      >             [prio != rt_mutex_getprio(T1)]
      >	    dequeue(T1)
      >	       rt_nr_boosted = (unsigned long)(-1)
      >	       ...
      >             T1 prio = 0
      >
      > then we end up leaving rt_nr_boosted in an "inconsistent" state.
      >
      > The simple program attached could reproduce the previous scenario; note
      > that, as a consequence of the presence of this state, the "assertion"
      >
      >     WARN_ON(!rt_nr_running && rt_nr_boosted)
      >
      > from dec_rt_group() may trigger.
      
      So normally we dequeue/enqueue tasks in sched_setscheduler(), which
      would ensure the accounting stays correct. However in the early PI path
      we fail to do so.
      
      So this was introduced at around v3.14, by:
      
        c365c292 ("sched: Consider pi boosting in setscheduler()")
      
      which fixed another problem exactly because that dequeue/enqueue, joy.
      
      Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
      preserve runqueue location with that option. This requires decoupling
      the on_rt_rq() state from being on the list.
      
      In order to allow for explicit movement during the SAVE/RESTORE,
      introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
      cases to preserve other invariants.
      
      Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
      things like sys_nice()/sys_sched_setaffinity() also do not reorder
      FIFO tasks (whereas they used to before this patch).
      Reported-by: NAndrea Parri <parri.andrea@gmail.com>
      Tested-by: NAndrea Parri <parri.andrea@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff77e468
    • D
      sched/core: Remove duplicated sched_group_set_shares() prototype · 41d93397
      Dongsheng Yang 提交于
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.fnst@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      41d93397
    • F
      sched/fair: Consolidate nohz CPU load update code · be68a682
      Frederic Weisbecker 提交于
      Lets factorize a bit of code there. We'll even have a third user soon.
      While at it, standardize the idle update function name against the
      others.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NByungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1452700891-21807-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be68a682
    • B
      sched/fair: Avoid using decay_load_missed() with a negative value · 7400d3bb
      Byungchul Park 提交于
      decay_load_missed() cannot handle nagative values, so we need to prevent
      using the function with a negative value.
      Reported-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: perterz@infradead.org
      Fixes: 59543275 ("sched/fair: Prepare __update_cpu_load() to handle active tickless")
      Link: http://lkml.kernel.org/r/20160115070749.GA1914@X58A-UD3RSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7400d3bb
    • P
      sched/deadline: Always calculate end of period on sched_yield() · 48be3a67
      Peter Zijlstra 提交于
      Steven noticed that occasionally a sched_yield() call would not result
      in a wait for the next period edge as expected.
      
      It turns out that when we call update_curr_dl() and end up with
      delta_exec <= 0, we will bail early and fail to throttle.
      
      Further inspection of the yield code revealed that yield_task_dl()
      clearing dl.runtime is wrong too, it will not account the last bit of
      runtime which could result in dl.runtime < 0, which in turn means that
      replenish would gift us with too much runtime.
      
      Fix both issues by not relying on the dl.runtime value for yield.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160223122822.GP6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      48be3a67
    • P
      sched/cgroup: Fix cgroup entity load tracking tear-down · 6fe1f348
      Peter Zijlstra 提交于
      When a cgroup's CPU runqueue is destroyed, it should remove its
      remaining load accounting from its parent cgroup.
      
      The current site for doing so it unsuited because its far too late and
      unordered against other cgroup removal (->css_free() will be, but we're also
      in an RCU callback).
      
      Put it in the ->css_offline() callback, which is the start of cgroup
      destruction, right after the group has been made unavailable to
      userspace. The ->css_offline() callbacks are called in hierarchical order
      after the following v4.4 commit:
      
        aa226ff4 ("cgroup: make sure a parent css isn't offlined before its children")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fe1f348
  2. 25 2月, 2016 15 次提交
    • P
      rcu: Use simple wait queues where possible in rcutree · abedf8e2
      Paul Gortmaker 提交于
      As of commit dae6e64d ("rcu: Introduce proper blocking to no-CBs kthreads
      GP waits") the RCU subsystem started making use of wait queues.
      
      Here we convert all additions of RCU wait queues to use simple wait queues,
      since they don't need the extra overhead of the full wait queue features.
      
      Originally this was done for RT kernels[1], since we would get things like...
      
        BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
        in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
        Pid: 8, comm: rcu_preempt Not tainted
        Call Trace:
         [<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0
         [<ffffffff817d77b4>] rt_spin_lock+0x24/0x50
         [<ffffffff8106fcf6>] __wake_up+0x36/0x70
         [<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680
         [<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50
         [<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80
         [<ffffffff8105eabb>] kthread+0xdb/0xe0
         [<ffffffff8106b912>] ? finish_task_switch+0x52/0x100
         [<ffffffff817e0754>] kernel_thread_helper+0x4/0x10
         [<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60
         [<ffffffff817e0750>] ? gs_change+0xb/0xb
      
      ...and hence simple wait queues were deployed on RT out of necessity
      (as simple wait uses a raw lock), but mainline might as well take
      advantage of the more streamline support as well.
      
      [1] This is a carry forward of work from v3.10-rt; the original conversion
      was by Thomas on an earlier -rt version, and Sebastian extended it to
      additional post-3.10 added RCU waiters; here I've added a commit log and
      unified the RCU changes into one, and uprev'd it to match mainline RCU.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      abedf8e2
    • D
      rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lock · 065bb78c
      Daniel Wagner 提交于
      rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
      this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
      not enable the IRQs. lockdep is happy.
      
      By switching over using swait this is not true anymore. swake_up_all()
      enables the IRQs while processing the waiters. __do_softirq() can now
      run and will eventually call rcu_process_callbacks() which wants to
      grap nrp->lock.
      
      Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
      switch over to swait.
      
      If we would hold the rnp->lock and use swait, lockdep reports
      following:
      
       =================================
       [ INFO: inconsistent lock state ]
       4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
       ---------------------------------
       inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
       rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
        (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
       {IN-SOFTIRQ-W} state was registered at:
         [<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0
         [<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0
         [<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0
         [<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0
         [<ffffffff810b1a9d>] __do_softirq+0x14d/0x670
         [<ffffffff810b2214>] irq_exit+0x104/0x110
         [<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60
         [<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80
         [<ffffffff810dba66>] rq_attach_root+0xa6/0x100
         [<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650
         [<ffffffff810e4b42>] build_sched_domains+0x942/0xb00
         [<ffffffff821777c2>] sched_init_smp+0x509/0x5c1
         [<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f
         [<ffffffff8182cdce>] kernel_init+0xe/0xe0
         [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
       irq event stamp: 76
       hardirqs last  enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
       hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90
       softirqs last  enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0
       softirqs last disabled at (0): [<          (null)>]           (null)
       other info that might help us debug this:
        Possible unsafe locking scenario:
              CPU0
              ----
         lock(rcu_node_1);
         <Interrupt>
           lock(rcu_node_1);
        *** DEADLOCK ***
       1 lock held by rcu_preempt/8:
        #0:  (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
       stack backtrace:
       CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
       Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
        0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
        0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
        0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
       Call Trace:
        [<ffffffff818379e0>] dump_stack+0x4f/0x7b
        [<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0
        [<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50
        [<ffffffff811087ad>] mark_lock+0x66d/0x6e0
        [<ffffffff81107790>] ? check_usage_forwards+0x150/0x150
        [<ffffffff81108898>] mark_held_locks+0x78/0xa0
        [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
        [<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220
        [<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10
        [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
        [<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0
        [<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0
        [<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220
        [<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60
        [<ffffffff81137c30>] ? rcu_barrier+0x20/0x20
        [<ffffffff810d2014>] kthread+0x104/0x120
        [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
        [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
        [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
        [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      065bb78c
    • P
      wait.[ch]: Introduce the simple waitqueue (swait) implementation · 13b35686
      Peter Zijlstra (Intel) 提交于
      The existing wait queue support has support for custom wake up call
      backs, wake flags, wake key (passed to call back) and exclusive
      flags that allow wakers to be tagged as exclusive, for limiting
      the number of wakers.
      
      In a lot of cases, none of these features are used, and hence we
      can benefit from a slimmed down version that lowers memory overhead
      and reduces runtime overhead.
      
      The concept originated from -rt, where waitqueues are a constant
      source of trouble, as we can't convert the head lock to a raw
      spinlock due to fancy and long lasting callbacks.
      
      With the removal of custom callbacks, we can use a raw lock for
      queue list manipulations, hence allowing the simple wait support
      to be used in -rt.
      
      [Patch is from PeterZ which is based on Thomas version. Commit message is
       written by Paul G.
       Daniel:  - Fixed some compile issues
       	  - Added non-lazy implementation of swake_up_locked as suggested
      	     by Boqun Feng.]
      Originally-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-2-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      13b35686
    • P
      perf: Robustify task_function_call() · 0da4cf3e
      Peter Zijlstra 提交于
      Since there is no serialization between task_function_call() doing
      task_curr() and the other CPU doing context switches, we could end
      up not sending an IPI even if we had to.
      
      And I'm not sure I still buy my own argument we're OK.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174948.340031200@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0da4cf3e
    • P
      perf: Fix scaling vs. perf_install_in_context() · a096309b
      Peter Zijlstra 提交于
      Completely reworks perf_install_in_context() (again!) in order to
      ensure that there will be no ctx time hole between add_event_to_ctx()
      and any potential ctx_sched_in().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174948.279399438@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a096309b
    • P
      perf: Fix scaling vs. perf_event_enable() · bd2afa49
      Peter Zijlstra 提交于
      Similar to the perf_enable_on_exec(), ensure that event timings are
      consistent across perf_event_enable().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174948.218288698@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd2afa49
    • P
      perf: Fix scaling vs. perf_event_enable_on_exec() · 7fce2509
      Peter Zijlstra 提交于
      The recent commit 3e349507 ("perf: Fix perf_enable_on_exec() event
      scheduling") caused this by moving task_ctx_sched_out() from before
      __perf_event_mask_enable() to after it.
      
      The overlooked consequence of that change is that task_ctx_sched_out()
      would update the ctx time fields, and now __perf_event_mask_enable()
      uses stale time.
      
      In order to fix this, explicitly stop our context's time before
      enabling the event(s).
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Fixes: 3e349507 ("perf: Fix perf_enable_on_exec() event scheduling")
      Link: http://lkml.kernel.org/r/20160224174948.159242158@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7fce2509
    • P
      perf: Fix ctx time tracking by introducing EVENT_TIME · 3cbaa590
      Peter Zijlstra 提交于
      Currently any ctx_sched_in() call will re-start the ctx time tracking,
      this means that calls like:
      
      	ctx_sched_in(.event_type = EVENT_PINNED);
      	ctx_sched_in(.event_type = EVENT_FLEXIBLE);
      
      will have a hole in their ctx time tracking. This is likely harmless
      but can confuse things a little. By adding EVENT_TIME, we can have the
      first ctx_sched_in() (is_active: 0 -> !0) start the time and any
      further ctx_sched_in() will leave the timestamps alone.
      
      Secondly, this allows for an early disable like:
      
      	ctx_sched_out(.event_type = EVENT_TIME);
      
      which would update the ctx time (if the ctx is active) and any further
      calls to ctx_sched_out() would not further modify the ctx time.
      
      For ctx_sched_in() any 0 -> !0 transition will automatically include
      EVENT_TIME.
      
      For ctx_sched_out(), any transition that clears EVENT_ALL will
      automatically clear EVENT_TIME.
      
      These two rules ensure that under normal circumstances we need not
      bother with EVENT_TIME and get natural ctx time behaviour.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174948.100446561@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3cbaa590
    • P
      perf: Cure event->pending_disable race · 28a967c3
      Peter Zijlstra 提交于
      Because event_sched_out() checks event->pending_disable _before_
      actually disabling the event, it can happen that the event fires after
      it checks but before it gets disabled.
      
      This would leave event->pending_disable set and the queued irq_work
      will try and process it.
      
      However, if the event trigger was during schedule(), the event might
      have been de-scheduled by the time the irq_work runs, and
      perf_event_disable_local() will fail.
      
      Fix this by checking event->pending_disable _after_ we call
      event->pmu->del(). This depends on the latter being a compiler
      barrier, such that the compiler does not lift the load and re-creates
      the problem.
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28a967c3
    • P
      perf: Fix race between event install and jump_labels · 9107c89e
      Peter Zijlstra 提交于
      perf_install_in_context() relies upon the context switch hooks to have
      scheduled in events when the IPI misses its target -- after all, if
      the task has moved from the CPU (or wasn't running at all), it will
      have to context switch to run elsewhere.
      
      This however doesn't appear to be happening.
      
      It is possible for the IPI to not happen (task wasn't running) only to
      later observe the task running with an inactive context.
      
      The only possible explanation is that the context switch hooks are not
      called. Therefore put in a sync_sched() after toggling the jump_label
      to guarantee all CPUs will have them enabled before we install an
      event.
      
      A simple if (0->1) sync_sched() will not in fact work, because any
      further increment can race and complete before the sync_sched().
      Therefore we must jump through some hoops.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9107c89e
    • P
      perf: Fix cloning · a69b0ca4
      Peter Zijlstra 提交于
      Alexander reported that when the 'original' context gets destroyed, no
      new clones happen.
      
      This can happen irrespective of the ctx switch optimization, any task
      can die, even the parent, and we want to continue monitoring the task
      hierarchy until we either close the event or no tasks are left in the
      hierarchy.
      
      perf_event_init_context() will attempt to pin the 'parent' context
      during clone(). At that point current is the parent, and since current
      cannot have exited while executing clone(), its context cannot have
      passed through perf_event_exit_task_context(). Therefore
      perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE.
      
      However, since inherit_event() does:
      
      	if (parent_event->parent)
      		parent_event = parent_event->parent;
      
      it looks at the 'original' event when it does: is_orphaned_event().
      This can return true if the context that contains the this event has
      passed through perf_event_exit_task_context(). And thus we'll fail to
      clone the perf context.
      
      Fix this by adding a new state: STATE_DEAD, which is set by
      perf_release() to indicate that the filedesc (or kernel reference) is
      dead and there are no observers for our data left.
      
      Only for STATE_DEAD will is_orphaned_event() be true and inhibit
      cloning.
      
      STATE_EXIT is otherwise preserved such that is_event_hup() remains
      functional and will report when the observed task hierarchy becomes
      empty.
      Reported-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Fixes: c6e5b732 ("perf: Synchronously clean up child events")
      Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a69b0ca4
    • P
      perf: Only update context time when active · 6f932e5b
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.860690919@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6f932e5b
    • P
      perf: Allow perf_release() with !event->ctx · a4f4bb6d
      Peter Zijlstra 提交于
      In the err_file: fput(event_file) case, the event will not yet have
      been attached to a context. However perf_release() does assume it has
      been. Cure this.
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.793996260@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4f4bb6d
    • P
      perf: Do not double free · 13005627
      Peter Zijlstra 提交于
      In case of: err_file: fput(event_file), we'll end up calling
      perf_release() which in turn will free the event.
      
      Do not then free the event _again_.
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13005627
    • P
      perf: Close install vs. exit race · 84c4e620
      Peter Zijlstra 提交于
      Consider the following scenario:
      
        CPU0					CPU1
      
        ctx = find_get_ctx();
      					perf_event_exit_task_context()
        mutex_lock(&ctx->mutex);
        perf_install_in_context(ctx, ...);
          /* NO-OP */
        mutex_unlock(&ctx->mutex);
      
        ...
      
        perf_release()
          WARN_ON_ONCE(event->state != STATE_EXIT);
      
      Since the event doesn't pass through perf_remove_from_context()
      because perf_install_in_context() NO-OPs because the ctx is dead, and
      perf_event_exit_task_context() will not observe the event because its
      not attached yet, the event->state will not be set.
      
      Solve this by revalidating ctx->task after we acquire ctx->mutex and
      failing the event creation as a whole.
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dvyukov@google.com
      Cc: eranian@google.com
      Cc: oleg@redhat.com
      Cc: panand@redhat.com
      Cc: sasha.levin@oracle.com
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160224174947.626853419@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      84c4e620
  3. 24 2月, 2016 2 次提交
    • S
      tracing: Fix showing function event in available_events · d045437a
      Steven Rostedt (Red Hat) 提交于
      The ftrace:function event is only displayed for parsing the function tracer
      data. It is not used to enable function tracing, and does not include an
      "enable" file in its event directory.
      
      Originally, this event was kept separate from other events because it did
      not have a ->reg parameter. But perf added a "reg" parameter for its use
      which caused issues, because it made the event available to functions where
      it was not compatible for.
      
      Commit 9b63776f "tracing: Do not enable function event with enable"
      added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event
      from being enabled by normal trace events. But this commit missed keeping
      the function event from being displayed by the "available_events" directory,
      which is used to show what events can be enabled by set_event.
      
      One documented way to enable all events is to:
      
       cat available_events > set_event
      
      But because the function event is displayed in the available_events, this
      now causes an INVALID error:
      
       cat: write error: Invalid argument
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Fixes: 9b63776f "tracing: Do not enable function event with enable"
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d045437a
    • T
      devm_memremap: Fix error value when memremap failed · 93f834df
      Toshi Kani 提交于
      devm_memremap() returns an ERR_PTR() value in case of error.
      However, it returns NULL when memremap() failed.  This causes
      the caller, such as the pmem driver, to proceed and oops later.
      
      Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
      failed.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      93f834df
  4. 21 2月, 2016 1 次提交
    • S
      kernel/resource.c: fix muxed resource handling in __request_region() · 59ceeaaf
      Simon Guinot 提交于
      In __request_region, if a conflict with a BUSY and MUXED resource is
      detected, then the caller goes to sleep and waits for the resource to be
      released.  A pointer on the conflicting resource is kept.  At wake-up
      this pointer is used as a parent to retry to request the region.
      
      A first problem is that this pointer might well be invalid (if for
      example the conflicting resource have already been freed).  Another
      problem is that the next call to __request_region() fails to detect a
      remaining conflict.  The previously conflicting resource is passed as a
      parameter and __request_region() will look for a conflict among the
      children of this resource and not at the resource itself.  It is likely
      to succeed anyway, even if there is still a conflict.
      
      Instead, the parent of the conflicting resource should be passed to
      __request_region().
      
      As a fix, this patch doesn't update the parent resource pointer in the
      case we have to wait for a muxed region right after.
      Reported-and-tested-by: NVincent Pelletier <plr.vincent@gmail.com>
      Signed-off-by: NSimon Guinot <simon.guinot@sequanux.org>
      Tested-by: NVincent Donnefort <vdonnefort@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ceeaaf
  5. 20 2月, 2016 1 次提交
    • Y
      tracing, kasan: Silence Kasan warning in check_stack of stack_tracer · 6e22c836
      Yang Shi 提交于
      When enabling stack trace via "echo 1 > /proc/sys/kernel/stack_tracer_enabled",
      the below KASAN warning is triggered:
      
      BUG: KASAN: stack-out-of-bounds in check_stack+0x344/0x848 at addr ffffffc0689ebab8
      Read of size 8 by task ksoftirqd/4/29
      page:ffffffbdc3a27ac0 count:0 mapcount:0 mapping:          (null) index:0x0
      flags: 0x0()
      page dumped because: kasan: bad access detected
      CPU: 4 PID: 29 Comm: ksoftirqd/4 Not tainted 4.5.0-rc1 #129
      Hardware name: Freescale Layerscape 2085a RDB Board (DT)
      Call trace:
      [<ffffffc000091300>] dump_backtrace+0x0/0x3a0
      [<ffffffc0000916c4>] show_stack+0x24/0x30
      [<ffffffc0009bbd78>] dump_stack+0xd8/0x168
      [<ffffffc000420bb0>] kasan_report_error+0x6a0/0x920
      [<ffffffc000421688>] kasan_report+0x70/0xb8
      [<ffffffc00041f7f0>] __asan_load8+0x60/0x78
      [<ffffffc0002e05c4>] check_stack+0x344/0x848
      [<ffffffc0002e0c8c>] stack_trace_call+0x1c4/0x370
      [<ffffffc0002af558>] ftrace_ops_no_ops+0x2c0/0x590
      [<ffffffc00009f25c>] ftrace_graph_call+0x0/0x14
      [<ffffffc0000881bc>] fpsimd_thread_switch+0x24/0x1e8
      [<ffffffc000089864>] __switch_to+0x34/0x218
      [<ffffffc0011e089c>] __schedule+0x3ac/0x15b8
      [<ffffffc0011e1f6c>] schedule+0x5c/0x178
      [<ffffffc0001632a8>] smpboot_thread_fn+0x350/0x960
      [<ffffffc00015b518>] kthread+0x1d8/0x2b0
      [<ffffffc0000874d0>] ret_from_fork+0x10/0x40
      Memory state around the buggy address:
       ffffffc0689eb980: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
       ffffffc0689eba00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
      >ffffffc0689eba80: 00 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00
                                              ^
       ffffffc0689ebb00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffffffc0689ebb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      The stacker tracer traverses the whole kernel stack when saving the max stack
      trace. It may touch the stack red zones to cause the warning. So, just disable
      the instrumentation to silence the warning.
      
      Link: http://lkml.kernel.org/r/1455309960-18930-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6e22c836
  6. 19 2月, 2016 1 次提交
  7. 18 2月, 2016 1 次提交
  8. 17 2月, 2016 5 次提交
  9. 12 2月, 2016 2 次提交
  10. 11 2月, 2016 2 次提交
    • D
      bpf: fix branch offset adjustment on backjumps after patching ctx expansion · a1b14d27
      Daniel Borkmann 提交于
      When ctx access is used, the kernel often needs to expand/rewrite
      instructions, so after that patching, branch offsets have to be
      adjusted for both forward and backward jumps in the new eBPF program,
      but for backward jumps it fails to account the delta. Meaning, for
      example, if the expansion happens exactly on the insn that sits at
      the jump target, it doesn't fix up the back jump offset.
      
      Analysis on what the check in adjust_branches() is currently doing:
      
        /* adjust offset of jmps if necessary */
        if (i < pos && i + insn->off + 1 > pos)
          insn->off += delta;
        else if (i > pos && i + insn->off + 1 < pos)
          insn->off -= delta;
      
      First condition (forward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- i/insn            insns[1] <--- i/insn
        insns[2] <--- pos               insns[P] <--- pos
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- target_X          insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- target_X
                                        insns[5]
      
      First case is if we cross pos-boundary and the jump instruction was
      before pos. This is handeled correctly. I.e. if i == pos, then this
      would mean our jump that we currently check was the patchlet itself
      that we just injected. Since such patchlets are self-contained and
      have no awareness of any insns before or after the patched one, the
      delta is correctly not adjusted. Also, for the second condition in
      case of i + insn->off + 1 == pos, means we jump to that newly patched
      instruction, so no offset adjustment are needed. That part is correct.
      
      Second condition (backward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- target_X          insns[1] <--- target_X
        insns[2] <--- pos <-- target_Y  insns[P] <--- pos <-- target_Y
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- i/insn            insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- i/insn
                                        insns[5]
      
      Second interesting case is where we cross pos-boundary and the jump
      instruction was after pos. Backward jump with i == pos would be
      impossible and pose a bug somewhere in the patchlet, so the first
      condition checking i > pos is okay only by itself. However, i +
      insn->off + 1 < pos does not always work as intended to trigger the
      adjustment. It works when jump targets would be far off where the
      delta wouldn't matter. But, for example, where the fixed insn->off
      before pointed to pos (target_Y), it now points to pos + delta, so
      that additional room needs to be taken into account for the check.
      This means that i) both tests here need to be adjusted into pos + delta,
      and ii) for the second condition, the test needs to be <= as pos
      itself can be a target in the backjump, too.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1b14d27
    • T
      workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup · d6e022f1
      Tejun Heo 提交于
      When looking up the pool_workqueue to use for an unbound workqueue,
      workqueue assumes that the target CPU is always bound to a valid NUMA
      node.  However, currently, when a CPU goes offline, the mapping is
      destroyed and cpu_to_node() returns NUMA_NO_NODE.
      
      This has always been broken but hasn't triggered often enough before
      874bbfe6 ("workqueue: make sure delayed work run in local cpu").
      After the commit, workqueue forcifully assigns the local CPU for
      delayed work items without explicit target CPU to fix a different
      issue.  This widens the window where CPU can go offline while a
      delayed work item is pending causing delayed work items dispatched
      with target CPU set to an already offlined CPU.  The resulting
      NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
      NULL pool_workqueue and thus crash.
      
      While 874bbfe6 has been reverted for a different reason making the
      bug less visible again, it can still happen.  Fix it by mapping
      NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
      This is a temporary workaround.  The long term solution is keeping CPU
      -> NODE mapping stable across CPU off/online cycles which is being
      worked on.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
      Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
      d6e022f1