1. 18 8月, 2016 7 次提交
    • M
      sched/core: Pass child domain into sd_init() · 3676b13e
      Morten Rasmussen 提交于
      If behavioural sched_domain flags depend on topology flags set at higher
      domain levels we need a way to update the child domain flags. Moving the
      child pointer assignment inside sd_init() should make that possible.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-7-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3676b13e
    • M
      sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag · 1f6e6c7c
      Morten Rasmussen 提交于
      Add a topology flag to the sched_domain hierarchy indicating the lowest
      domain level where the full range of CPU capacities is represented by
      the domain members for asymmetric capacity topologies (e.g. ARM
      big.LITTLE).
      
      The flag is intended to indicate that extra care should be taken when
      placing tasks on CPUs and this level spans all the different types of
      CPUs found in the system (no need to look further up the domain
      hierarchy). This information is currently only available through
      iterating through the capacities of all the CPUs at parent levels in the
      sched_domain hierarchy.
      
        SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY
      
        SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY
      
        CPU:         0      1      2      3
        capacity:  756    756   1024   1024
      
      If the topology in the example above is duplicated to create an eight
      CPU example with third sched_domain level on top (SD 3), this level
      should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
      would both have all CPU capacities represented within them.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f6e6c7c
    • M
      sched/core: Remove unnecessary NULL-pointer check · 0e6d2a67
      Morten Rasmussen 提交于
      Checking if the sched_domain pointer returned by sd_init() is NULL seems
      pointless as sd_init() neither checks if it is valid to begin with nor
      set it to NULL.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1469453670-2660-5-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0e6d2a67
    • P
      sched/core: Clarify SD_flags comment · 94f438c8
      Peter Zijlstra 提交于
      The SD_flags comment is very terse and doesn't explain why PACKING is
      odd.
      
      IIRC the distinction is that the 'normal' ones only describe topology,
      while the ASYM_PACKING one also prescribes behaviour. It is odd in the
      way that it doesn't only describe things.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: freedom.tan@mediatek.com
      Cc: keita.kobayashi.ym@renesas.com
      Cc: mgalbraith@suse.de
      Cc: sgurrappadi@nvidia.com
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/20160815105459.GS6879@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      94f438c8
    • W
      sched/cputime: Resync steal time when guest & host lose sync · 03cbc732
      Wanpeng Li 提交于
      Commit:
      
        57430218 ("sched/cputime: Count actually elapsed irq & softirq time")
      
      ... fixed a bug but also triggered a regression:
      
      On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
      CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
      on host one by one until there is only one left, then observe CPU utilization
      via 'top' in the guest, it shows:
      
        100% st for cpu0(housekeeping)
         75% st for other CPUs (nohz full mode)
      
      However, w/o this commit it shows the correct 75% for all four CPUs.
      
      When a guest is interrupted for a longer amount of time, missed clock ticks
      are not redelivered later. Because of that, we should not limit the amount
      of steal time accounted to the amount of time that the calling functions
      think have passed.
      
      However, the interval returned by account_other_time() is NOT rounded down
      to the nearest jiffy, while the base interval in get_vtime_delta() it is
      subtracted from is, so the max cputime limit is required to avoid underflow.
      
      This patch fixes the regression by limiting the account_other_time() from
      get_vtime_delta() to avoid underflow, and lets the other three call sites
      (in account_other_time() and steal_account_process_time()) account however
      much steal time the host told us elapsed.
      Suggested-by: NRik van Riel <riel@redhat.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      03cbc732
    • R
      sched: Remove struct rq::nohz_stamp · 1fc770d5
      Rik van Riel 提交于
      The nohz_stamp member of struct rq has been unused since 2010,
      when this commit removed the code that referenced it:
      
        396e894d ("sched: Revert nohz_ratelimit() for now")
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160815121410.5ea1c98f@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1fc770d5
    • P
      sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression · 173be9a1
      Peter Zijlstra 提交于
      Mike reports:
      
       Roughly 10% of the time, ltp testcase getrusage04 fails:
       getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
       getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
       getrusage04    0  TINFO  :  utime:           0us; stime:         179us
       getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
       getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased > 5000us:
      
      And tracked it down to the case where the task simply doesn't get
      _any_ [us]time ticks.
      
      Update the code to assume all rtime is utime when we lack information,
      thus ensuring a task that elides the tick gets time accounted.
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: stable@vger.kernel.org # 4.3+
      Fixes: 9d7fb042 ("sched/cputime: Guarantee stime + utime == rtime")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      173be9a1
  2. 13 8月, 2016 1 次提交
  3. 11 8月, 2016 2 次提交
  4. 10 8月, 2016 18 次提交
    • V
      sched/debug: Add taint on "BUG: Sleeping function called from invalid context" · f0b22e39
      Vegard Nossum 提交于
      Seeing this, it occurs to me that we should probably add a taint here:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17160 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17198 ffffffff81158067 0000000000000de6
           ffff88011a3c4c80 ffffffff8390e07c 0000000000000184 0000000000000000
          Call Trace:
          [...]
      
          BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1309
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17e08 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17e40 ffffffff81158067 0000000000000000
           ffff88011a3c4c80 ffffffff83437b20 000000000000051d 0000000000000000
          Call Trace:
          [...]
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Link: http://lkml.kernel.org/r/1469216762-19626-1-git-send-email-vegard.nossum@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b22e39
    • V
      sched/debug: Make the "Preemption disabled at ..." message more useful · d1c6d149
      Vegard Nossum 提交于
      This message is currently really useless since it always prints a value
      that comes from the printk() we just did, e.g.:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          BUG: sleeping function called from invalid context at include/linux/freezer.h:56
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
      Here, both down_trylock() and console_unlock() is somewhere in the
      printk() path.
      
      We should save the value before calling printk() and use the saved value
      instead. That immediately reveals the offending callsite:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
          Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
      
      Bug report:
      
        http://marc.info/?l=linux-netdev&m=146925979821849&w=2Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d1c6d149
    • P
      locking/pvqspinlock: Fix a bug in qstat_read() · c2ace36b
      Pan Xinhui 提交于
      It's obviously wrong to set stat to NULL. So lets remove it.
      Otherwise it is always zero when we check the latency of kick/wake.
      Signed-off-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NWaiman Long <Waiman.Long@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c2ace36b
    • W
      locking/pvqspinlock: Fix double hash race · 229ce631
      Wanpeng Li 提交于
      When the lock holder vCPU is racing with the queue head:
      
         CPU 0 (lock holder)    CPU1 (queue head)
         ===================    =================
         spin_lock();           spin_lock();
          pv_kick_node():        pv_wait_head_or_lock():
                                  if (!lp) {
                                   lp = pv_hash(lock, pn);
                                   xchg(&l->locked, _Q_SLOW_VAL);
                                  }
                                  WRITE_ONCE(pn->state, vcpu_halted);
           cmpxchg(&pn->state,
            vcpu_halted, vcpu_hashed);
           WRITE_ONCE(l->locked, _Q_SLOW_VAL);
           (void)pv_hash(lock, pn);
      
      In this case, lock holder inserts the pv_node of queue head into the
      hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
      restoring/setting vcpu_hashed state after failing adaptive locking
      spinning.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <Waiman.Long@hpe.com>
      Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      229ce631
    • J
      sched/deadline: Remove useless parameter from setup_new_dl_entity() · 98b0a857
      Juri Lelli 提交于
      setup_new_dl_entity() takes two parameters, but it only actually uses
      one of them, under a different name, to setup a new dl_entity, after:
      
        2f9f3fdc928 "sched/deadline: Remove dl_new from struct sched_dl_entity"
      
      as we currently do:
      
        setup_new_dl_entity(&p->dl, &p->dl)
      
      However, before Luca's change we were doing:
      
        setup_new_dl_entity(dl_se, pi_se)
      
      in update_dl_entity() for a dl_se->new entity: we were using pi_se's
      parameters (the potential PI donor) for setting up a new entity.
      
      This change removes the useless second parameter of setup_new_dl_entity().
      
      While we are at it we also optimize things further calling setup_new_dl_
      entity() only for already queued tasks, since (as pointed out by Xunlei)
      we already do the very same update at tasks wakeup time anyway. By doing
      so, we don't need to worry about a potential PI donor anymore, as
      rt_mutex_setprio() takes care of that already for us.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xunlei Pang <xpang@redhat.com>
      Link: http://lkml.kernel.org/r/1470409675-20935-1-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      98b0a857
    • L
      sched/core: Add documentation for 'cookie' argument · 9279e0d2
      Luis de Bethencourt 提交于
      Add documentation for the cookie argument in try_to_wake_up_local().
      
      This caused the following warning when building documentation:
      
        kernel/sched/core.c:2088: warning: No description found for parameter 'cookie'
      Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Fixes: e7904a28 ("ilocking/lockdep, sched/core: Implement a better lock pinning scheme")
      Link: http://lkml.kernel.org/r/1468159226-17674-1-git-send-email-luisbg@osg.samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9279e0d2
    • M
      sched/fair: Optimize find_idlest_cpu() when there is no choice · eaecf41f
      Morten Rasmussen 提交于
      In the current find_idlest_group()/find_idlest_cpu() search we end up
      calling find_idlest_cpu() in a sched_group containing only one CPU in
      the end. Checking idle-states becomes pointless when there is no
      alternative, so bail out instead.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: linux-kernel@vger.kernel.org
      Cc: mgalbraith@suse.de
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1466615004-3503-4-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eaecf41f
    • M
      sched/fair: Make the use of prev_cpu consistent in the wakeup path · 772bd008
      Morten Rasmussen 提交于
      In commit:
      
        ac66f547 ("sched/numa: Introduce migrate_swap()")
      
      select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
      in special cases (NUMA task swapping).
      
      However, the select_task_rq_fair() helper functions: wake_affine() and
      select_idle_sibling(), still use task_cpu(p) directly to work out
      prev_cpu, which leads to inconsistencies.
      
      This patch passes prev_cpu (potentially overridden by NUMA code) into
      the helper functions to ensure prev_cpu is indeed the same CPU
      everywhere in the wakeup path.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: linux-kernel@vger.kernel.org
      Cc: mgalbraith@suse.de
      Cc: vincent.guittot@linaro.org
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1466615004-3503-3-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      772bd008
    • P
      sched/fair: Improve PELT stuff some more · 7c3edd2c
      Peter Zijlstra 提交于
      Vincent noted that the update_tg_load_avg() usage in commit:
      
        3d30544f ("sched/fair: Apply more PELT fixes")
      
      isn't entirely sufficient. We need to call this function every time
      cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
      returns true, but {attach,detach}_entity_load_avg() themselves also
      change it. This means we need to unconditionally call
      update_tg_load_avg().
      
      Also, add more comments.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7c3edd2c
    • L
      sched/core: Fix one typo · a1fd4656
      Leo Yan 提交于
      Fix one minor typo in the comment: s/targer/target/.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470378758-15066-1-git-send-email-leo.yan@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1fd4656
    • L
      sched/fair: Remove 'cpu_busy' parameter from update_next_balance() · 31851a98
      Leo Yan 提交于
      The update_next_balance() function is only used by idle balancing, so its
      'cpu_busy' parameter is always 0.
      
      Open code it instead of passing it around.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470378689-14892-1-git-send-email-leo.yan@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      31851a98
    • W
      sched/deadline: Fix lock pinning warning during CPU hotplug · c0c8c9fa
      Wanpeng Li 提交于
      The following warning can be triggered by hot-unplugging the CPU
      on which an active SCHED_DEADLINE task is running on:
      
        WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
        releasing a pinned lock
        Call Trace:
         dump_stack+0x99/0xd0
         __warn+0xd1/0xf0
         ? dl_task_timer+0x1a1/0x2b0
         warn_slowpath_fmt+0x4f/0x60
         ? sched_clock+0x13/0x20
         lock_release+0x690/0x6a0
         ? enqueue_pushable_dl_task+0x9b/0xa0
         ? enqueue_task_dl+0x1ca/0x480
         _raw_spin_unlock+0x1f/0x40
         dl_task_timer+0x1a1/0x2b0
         ? push_dl_task.part.31+0x190/0x190
        WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
        unpinning an unpinned lock
        Call Trace:
         dump_stack+0x99/0xd0
         __warn+0xd1/0xf0
         warn_slowpath_fmt+0x4f/0x60
         lock_unpin_lock+0x181/0x1a0
         dl_task_timer+0x127/0x2b0
         ? push_dl_task.part.31+0x190/0x190
      
      As per the comment before this code, its safe to drop the RQ lock
      here, and since we (potentially) change rq, unpin and repin to avoid
      the splat.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [ Rewrote changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@unitn.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c0c8c9fa
    • G
      sched/cputime: Mitigate performance regression in times()/clock_gettime() · 6075620b
      Giovanni Gherdovich 提交于
      Commit:
      
        6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
      
      fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
      allow a task to wake early. It addressed the problem by calling the scheduling
      classes update_curr() when the cputimer starts.
      
      Said change induced a considerable performance regression on the syscalls
      times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
      debuggers and applications that monitor their own performance that
      accidentally depend on the performance of these specific calls.
      
      This patch mitigates the performace loss by prefetching data in the CPU
      cache, as stalls due to cache misses appear to be where most time is spent
      in our benchmarks.
      
      Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
      box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
      variable number of threads, from 2 to 4*num_cpus; the results are in
      seconds and correspond to the average of 10 runs; the percentage gain is
      computed with (before-after)/before so a positive value is an improvement
      (it's faster). The improvement varies between a few percents for 5-20
      threads and more than 10% for 2 or >20 threads.
      
      pound_clock_gettime:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.48        3.06 ( 11.83%)
            5           3.33        3.25 (  2.40%)
            8           3.37        3.26 (  3.30%)
           12           3.32        3.37 ( -1.60%)
           21           4.01        3.90 (  2.74%)
           30           3.63        3.36 (  7.41%)
           48           3.71        3.11 ( 16.27%)
           79           3.75        3.16 ( 15.74%)
          110           3.81        3.25 ( 14.80%)
          128           3.88        3.31 ( 14.76%)
      
      pound_times:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.65        3.25 ( 11.03%)
            5           3.45        3.17 (  7.92%)
            8           3.52        3.22 (  8.69%)
           12           3.29        3.36 ( -2.04%)
           21           4.07        3.92 (  3.78%)
           30           3.87        3.40 ( 12.17%)
           48           3.79        3.16 ( 16.61%)
           79           3.88        3.28 ( 15.42%)
          110           3.90        3.38 ( 13.35%)
          128           4.00        3.38 ( 15.45%)
      
      pound_clock_gettime and pound_clock_gettime are two benchmarks included in
      the MMTests framework. They launch a given number of threads which
      repeatedly call times() or clock_gettimes(). The results above can be
      reproduced with cloning MMTests from github.com and running the "poundtime"
      workload:
      
        $ git clone https://github.com/gormanm/mmtests.git
        $ cd mmtests
        $ cp configs/config-global-dhp__workload_poundtime config
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      The above will run "poundtime" measuring the kernel currently running on
      the machine; Once a new kernel is installed and the machine rebooted,
      running again
      
        $ cd mmtests
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      will produce results to compare with. A comparison table will be output
      with:
      
        $ cd mmtests/work/log
        $ ../../compare-kernels.sh
      
      the table will contain a lot of entries; grepping for "Amean" (as in
      "arithmetic mean") will give the tables presented above. The source code
      for the two benchmarks is reported at the end of this changelog for
      clairity.
      
      The cache misses addressed by this patch were found using a combination of
      `perf top`, `perf record` and `perf annotate`. The incriminated lines were
      found to be
      
          struct sched_entity *curr = cfs_rq->curr;
      
      and
      
          delta_exec = now - curr->exec_start;
      
      in the function update_curr() from kernel/sched/fair.c. This patch
      prefetches the data from memory just before update_curr is called in the
      interested execution path.
      
      A comparison of the total number of cycles before and after the patch
      follows; the data is obtained using `perf stat -r 10 -ddd <program>`
      running over the same sequence of number of threads used above (a positive
      gain is an improvement):
      
        threads   cycles before                 cycles after                gain
      
          2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
          5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
          8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
         12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
         21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
         30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
         48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
         79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
        110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
        128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%
      
      A comparison of cache miss vs total cache loads ratios, before and after
      the patch (again from the `perf stat -r 10 -ddd <program>` tables):
      
        threads   L1 misses/total*100     L1 misses/total*100            gain
      		         before                   after
            2           7.43  +-4.90%           7.36  +-4.70%           0.94%
            5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
            8          13.79  +-5.61%          12.90  +-3.27%           6.45%
           12          11.57  +-2.44%           8.71  +-1.40%          24.72%
           21          12.39  +-3.92%           9.97  +-1.84%          19.53%
           30          13.91  +-2.53%          11.73  +-2.28%          15.67%
           48          13.71  +-1.59%          12.32  +-1.97%          10.14%
           79          14.44  +-0.66%          13.40  +-1.06%           7.20%
          110          15.86  +-0.50%          14.46  +-0.59%           8.83%
          128          16.51  +-0.32%          15.06  +-0.78%           8.78%
      
      As a final note, the following shows the evolution of performance figures
      in the "poundtime" benchmark and pinpoints commit 6e998916
      ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
      major source of degradation, mostly unaddressed to this day (figures
      expressed in seconds).
      
      pound_clock_gettime:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
          5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
          8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
          12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
          21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
          30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
          48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
          79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
          110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
          128      3.05          6.42 (-110.08%)        3.88 (-27.04%)
      
      pound_times:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
          5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
          8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
          12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
          21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
          30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
          48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
          79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
          110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
          128      3.10          6.35 (-104.61%)        4.00 (-28.87%)
      
      The source code of the two benchmarks follows. To compile the two:
      
        NR_THREADS=42
        for FILE in pound_times pound_clock_gettime; do
            gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
        done
      
      ==== BEGIN pound_times.c ====
      
      struct tms start;
      
      void *pound (void *threadid)
      {
        struct tms end;
        int oldutime = 0;
        int utime;
        int i;
        for (i = 0; i < 5000000 / NUM_THREADS; i++) {
                times(&end);
                utime = ((int)end.tms_utime - (int)start.tms_utime);
                if (oldutime > utime) {
                  printf("utime decreased, was %d, now %d!\n", oldutime, utime);
                }
                oldutime = utime;
        }
        pthread_exit(NULL);
      }
      
      int main()
      {
        pthread_t th[NUM_THREADS];
        long i;
        times(&start);
        for (i = 0; i < NUM_THREADS; i++) {
          pthread_create (&th[i], NULL, pound, (void *)i);
        }
        pthread_exit(NULL);
        return 0;
      }
      ==== END pound_times.c ====
      
      ==== BEGIN pound_clock_gettime.c ====
      
      void *pound (void *threadid)
      {
      	struct timespec ts;
      	int rc, i;
      	unsigned long prev = 0, this = 0;
      
      	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
      		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
      		if (rc < 0)
      			perror("clock_gettime");
      		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
      		if (0 && this < prev)
      			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
      		prev = this;
      	}
      	pthread_exit(NULL);
      }
      
      int main()
      {
      	pthread_t th[NUM_THREADS];
      	long rc, i;
      	pid_t pgid;
      
      	for (i = 0; i < NUM_THREADS; i++) {
      		rc = pthread_create(&th[i], NULL, pound, (void *)i);
      		if (rc < 0)
      			perror("pthread_create");
      	}
      
      	pthread_exit(NULL);
      	return 0;
      }
      ==== END pound_clock_gettime.c ====
      Suggested-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6075620b
    • X
      sched/fair: Fix typo in sync_throttle() · b8922125
      Xunlei Pang 提交于
      We should update cfs_rq->throttled_clock_task, not
      pcfs_rq->throttle_clock_task.
      
      The effects of this bug was probably occasionally erratic
      group scheduling, particularly in cgroups-intense workloads.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      [ Added changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 55e16d30 ("sched/fair: Rework throttle_count sync")
      Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b8922125
    • T
      sched/deadline: Fix wrap-around in DL heap · a23eadfa
      Tommaso Cucinotta 提交于
      Current code in cpudeadline.c has a bug in re-heapifying when adding a
      new element at the end of the heap, because a deadline value of 0 is
      temporarily set in the new elem, then cpudl_change_key() is called
      with the actual elem deadline as param.
      
      However, the function compares the new deadline to set with the one
      previously in the elem, which is 0.  So, if current absolute deadlines
      grew so much to have negative values as s64, the comparison in
      cpudl_change_key() makes the wrong decision.  Instead, as from
      dl_time_before(), the kernel should handle correctly abs deadlines
      wrap-arounds.
      
      This patch fixes the problem with a minimally invasive change that
      forces cpudl_change_key() to heapify up in this case.
      Signed-off-by: NTommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NLuca Abeni <luca.abeni@unitn.it>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a23eadfa
    • D
      perf/core: Set cgroup in CPU contexts for new cgroup events · db4a8356
      David Carrillo-Cisneros 提交于
      There's a perf stat bug easy to observer on a machine with only one cgroup:
      
        $ perf stat -e cycles -I 1000 -C 0 -G /
        #          time             counts unit events
            1.000161699      <not counted>      cycles                    /
            2.000355591      <not counted>      cycles                    /
            3.000565154      <not counted>      cycles                    /
            4.000951350      <not counted>      cycles                    /
      
      We'd expect some output there.
      
      The underlying problem is that there is an optimization in
      perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
      if the old and new cgroups in a task switch are the same.
      
      This optimization interacts with the current code in two ways
      that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
      cgroup event matches the current task. These are:
      
        1. On creation of the first cgroup event in a CPU: In current code,
        cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
        aforesaid optimization, perf_cgroup_sched_in will run until the next
        cgroup switches in that CPU. This may happen late or never happen,
        depending on system's number of cgroups, CPU load, etc.
      
        2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
        cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
        because cpuctx->cgrp == NULL until a cgroup switch occurs and
        perf_cgroup_sched_in is executed (updating cpuctx->cgrp).
      
      This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
      mirroring what list_del_event does when removing a cgroup event from CPU
      context, as introduced in:
      
        commit 68cacd29 ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")
      
      With this patch, cpuctx->cgrp is always set/clear when installing/removing
      the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
      correctly set, event_filter_match works as intended when events are
      sched in/out.
      
      After the fix, the output is as expected:
      
        $ perf stat -e cycles -I 1000 -a -G /
        #         time             counts unit events
           1.004699159          627342882      cycles                    /
           2.007397156          615272690      cycles                    /
           3.010019057          616726074      cycles                    /
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db4a8356
    • P
      perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash · 0b8f1e2e
      Peter Zijlstra 提交于
      Vegard Nossum reported that perf fuzzing generates a NULL
      pointer dereference crash:
      
      > Digging a bit deeper into this, it seems the event itself is getting
      > created by perf_event_open() and it gets added to the pmu_event_list
      > through:
      >
      > perf_event_open()
      >  - perf_event_alloc()
      >     - account_event()
      >        - account_pmu_sb_event()
      >           - attach_sb_event()
      >
      > so at this point the event is being attached but its ->ctx is still
      > NULL. It seems like ->ctx is set just a bit later in
      > perf_event_open(), though.
      >
      > But before that, __schedule() comes along and creates a stack trace
      > similar to the one above:
      >
      > __schedule()
      >  - __perf_event_task_sched_out()
      >    - perf_iterate_sb()
      >      - perf_iterate_sb_cpu()
      >         - event_filter_match()
      >           - perf_cgroup_match()
      >             - __get_cpu_context()
      >               - (dereference ctx which is NULL)
      >
      > So I guess the question is... should the event be attached (= put on
      > the list) before ->ctx gets set? Or should the cgroup code check for a
      > NULL ->ctx?
      
      The latter seems like the simplest solution. Moving the list-add later
      creates a bit of a mess.
      Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: f2fb6bef ("perf/core: Optimize side-band event delivery")
      Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0b8f1e2e
    • L
      Revert "printk: create pr_<level> functions" · a0cba217
      Linus Torvalds 提交于
      This reverts commit 874f9c7d.
      
      Geert Uytterhoeven reports:
       "This change seems to have an (unintendent?) side-effect.
      
        Before, pr_*() calls without a trailing newline characters would be
        printed with a newline character appended, both on the console and in
        the output of the dmesg command.
      
        After this commit, no new line character is appended, and the output
        of the next pr_*() call of the same type may be appended, like in:
      
          - Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000
          - Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)
          + Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)"
      
      Joe Perches says:
       "No, that is not intentional.
      
        The newline handling code inside vprintk_emit is a bit involved and
        for now I suggest a revert until this has all the same behavior as
        earlier"
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Requested-by: NJoe Perches <joe@perches.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0cba217
  5. 09 8月, 2016 3 次提交
    • C
      timers: Fix get_next_timer_interrupt() computation · 46c8f0b0
      Chris Metcalf 提交于
      The tick_nohz_stop_sched_tick() routine is not properly
      canceling the sched timer when nothing is pending, because
      get_next_timer_interrupt() is no longer returning KTIME_MAX in
      that case.  This causes periodic interrupts when none are needed.
      
      When determining the next interrupt time, we first use
      __next_timer_interrupt() to get the first expiring timer in the
      timer wheel.  If no timer is found, we return the base clock value
      plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
      timer wheel.
      
      Back in get_next_timer_interrupt(), we set the "expires" value
      by converting the timer wheel expiry (in ticks) to a nsec value.
      But we don't want to do this if the timer wheel expiry value
      indicates no timer; we want to return KTIME_MAX.
      
      Prior to commit 500462a9 ("timers: Switch to a non-cascading
      wheel") we checked base->active_timers to see if any timers
      were active, and if not, we didn't touch the expiry value and so
      properly returned KTIME_MAX.  Now we don't have active_timers.
      
      To fix this, we now just check the timer wheel expiry value to
      see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
      try to compute a new value based on it, but instead simply let the
      KTIME_MAX value in expires remain.
      
      Fixes: 500462a9 "timers: Switch to a non-cascading wheel"
      Signed-off-by: NChris Metcalf <cmetcalf@mellanox.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      46c8f0b0
    • M
      genirq/msi: Make sure PCI MSIs are activated early · f3b0946d
      Marc Zyngier 提交于
      Bharat Kumar Gogada reported issues with the generic MSI code, where the
      end-point ended up with garbage in its MSI configuration (both for the vector
      and the message).
      
      It turns out that the two MSI paths in the kernel are doing slightly different
      things:
      
      generic MSI: disable MSI -> allocate MSI -> enable MSI -> setup EP
      PCI MSI: disable MSI -> allocate MSI -> setup EP -> enable MSI
      
      And it turns out that end-points are allowed to latch the content of the MSI
      configuration registers as soon as MSIs are enabled.  In Bharat's case, the
      end-point ends up using whatever was there already, which is not what you
      want.
      
      In order to make things converge, we introduce a new MSI domain flag
      (MSI_FLAG_ACTIVATE_EARLY) that is unconditionally set for PCI/MSI. When set,
      this flag forces the programming of the end-point as soon as the MSIs are
      allocated.
      
      A consequence of this is that we have an extra activate in irq_startup, but
      that should be without much consequence.
      
      tglx: 
      
       - Several people reported a VMWare regression with PCI/MSI-X passthrough. It
         turns out that the patch also cures that issue.
      
       - We need to have a look at the MSI disable interrupt path, where we write
         the msg to all zeros without disabling MSI in the PCI device. Is that
         correct?
      
      Fixes: 52f518a3 "x86/MSI: Use hierarchical irqdomains to manage MSI interrupts"
      Reported-and-tested-by: NBharat Kumar Gogada <bharat.kumar.gogada@xilinx.com>
      Reported-and-tested-by: NFoster Snowhill <forst@forstwoof.ru>
      Reported-by: NMatthias Prager <linux@matthiasprager.de>
      Reported-by: NJason Taylor <jason.taylor@simplivity.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: linux-pci@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1468426713-31431-1-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f3b0946d
    • A
      printk: Remove unnecessary #ifdef CONFIG_PRINTK · 574673c2
      Andreas Ziegler 提交于
      In commit 874f9c7d ("printk: create pr_<level> functions"), new
      pr_level defines were added to printk.c.
      
      These new defines are guarded by an #ifdef CONFIG_PRINTK - however,
      there is already a surrounding #ifdef CONFIG_PRINTK starting a lot
      earlier in line 249 which means the newly introduced #ifdef is
      unnecessary.
      
      Let's remove it to avoid confusion.
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      574673c2
  6. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  7. 04 8月, 2016 6 次提交
    • J
      jump_label: remove bug.h, atomic.h dependencies for HAVE_JUMP_LABEL · 1f69bf9c
      Jason Baron 提交于
      The current jump_label.h includes bug.h for things such as WARN_ON().
      This makes the header problematic for inclusion by kernel.h or any
      headers that kernel.h includes, since bug.h includes kernel.h (circular
      dependency).  The inclusion of atomic.h is similarly problematic.  Thus,
      this should make jump_label.h 'includable' from most places.
      
      Link: http://lkml.kernel.org/r/7060ce35ddd0d20b33bf170685e6b0fab816bdf2.1467837322.git.jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f69bf9c
    • M
      tree-wide: replace config_enabled() with IS_ENABLED() · 97f2645f
      Masahiro Yamada 提交于
      The use of config_enabled() against config options is ambiguous.  In
      practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
      author might have used it for the meaning of IS_ENABLED().  Using
      IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc.  makes the intention
      clearer.
      
      This commit replaces config_enabled() with IS_ENABLED() where possible.
      This commit is only touching bool config options.
      
      I noticed two cases where config_enabled() is used against a tristate
      option:
      
       - config_enabled(CONFIG_HWMON)
        [ drivers/net/wireless/ath/ath10k/thermal.c ]
      
       - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
        [ drivers/gpu/drm/gma500/opregion.c ]
      
      I did not touch them because they should be converted to IS_BUILTIN()
      in order to keep the logic, but I was not sure it was the authors'
      intention.
      
      Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Stas Sergeev <stsp@list.ru>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Joshua Kinard <kumba@gentoo.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: yu-cheng yu <yu-cheng.yu@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Nikolay Martynov <mar.kolya@gmail.com>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Rafal Milecki <zajec5@gmail.com>
      Cc: James Cowgill <James.Cowgill@imgtec.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Alex Smith <alex.smith@imgtec.com>
      Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
      Cc: Qais Yousef <qais.yousef@imgtec.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: "Luis R. Rodriguez" <mcgrof@do-not-panic.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Kalle Valo <kvalo@qca.qualcomm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Tony Wu <tung7970@gmail.com>
      Cc: Huaitong Han <huaitong.han@intel.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Gelmini <andrea.gelmini@gelma.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Rabin Vincent <rabin@rab.in>
      Cc: "Maciej W. Rozycki" <macro@imgtec.com>
      Cc: David Daney <david.daney@cavium.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97f2645f
    • J
      modules: add ro_after_init support · 444d13ff
      Jessica Yu 提交于
      Add ro_after_init support for modules by adding a new page-aligned section
      in the module layout (after rodata) for ro_after_init data and enabling RO
      protection for that section after module init runs.
      Signed-off-by: NJessica Yu <jeyu@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      444d13ff
    • R
      jump_label: disable preemption around __module_text_address(). · bdc9f373
      Rusty Russell 提交于
      Steven reported a warning caused by not holding module_mutex or
      rcu_read_lock_sched: his backtrace was corrupted but a quick audit
      found this possible cause.  It's wrong anyway...
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      bdc9f373
    • P
      modules: Add kernel parameter to blacklist modules · be7de5f9
      Prarit Bhargava 提交于
      Blacklisting a module in linux has long been a problem.  The current
      procedure is to use rd.blacklist=module_name, however, that doesn't
      cover the case after the initramfs and before a boot prompt (where one
      is supposed to use /etc/modprobe.d/blacklist.conf to blacklist
      runtime loading). Using rd.shell to get an early prompt is hit-or-miss,
      and doesn't cover all situations AFAICT.
      
      This patch adds this functionality of permanently blacklisting a module
      by its name via the kernel parameter module_blacklist=module_name.
      
      [v2]: Rusty, use core_param() instead of __setup() which simplifies
      things.
      
      [v3]: Rusty, undo wreckage from strsep()
      
      [v4]: Rusty, simpler version of blacklisted()
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      be7de5f9
    • S
      module: Do a WARN_ON_ONCE() for assert module mutex not held · 9502514f
      Steven Rostedt 提交于
      When running with lockdep enabled, I triggered the WARN_ON() in the
      module code that asserts when module_mutex or rcu_read_lock_sched are
      not held. The issue I have is that this can also be called from the
      dump_stack() code, causing us to enter an infinite loop...
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
       Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375 #14
       Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
        ffff880215e8fa70 ffff880215e8fa70 ffffffff812fc8e3 0000000000000000
        ffffffff81d3e55b ffff880215e8fac0 ffffffff8104fc88 ffffffff8104fcab
        0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
       Call Trace:
        [<ffffffff812fc8e3>] dump_stack+0x67/0x90
        [<ffffffff8104fc88>] __warn+0xcb/0xe9
        [<ffffffff8104fcab>] ? warn_slowpath_null+0x5/0x1f
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
       Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375 #14
       Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
        ffff880215e8f7a0 ffff880215e8f7a0 ffffffff812fc8e3 0000000000000000
        ffffffff81d3e55b ffff880215e8f7f0 ffffffff8104fc88 ffffffff8104fcab
        0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
       Call Trace:
        [<ffffffff812fc8e3>] dump_stack+0x67/0x90
        [<ffffffff8104fc88>] __warn+0xcb/0xe9
        [<ffffffff8104fcab>] ? warn_slowpath_null+0x5/0x1f
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
       Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc3-test-00013-g501c2375 #14
       Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
        ffff880215e8f4d0 ffff880215e8f4d0 ffffffff812fc8e3 0000000000000000
        ffffffff81d3e55b ffff880215e8f520 ffffffff8104fc88 ffffffff8104fcab
        0000000915e88300 0000000000000046 ffffffffa019b29a 0000000000000001
       Call Trace:
        [<ffffffff812fc8e3>] dump_stack+0x67/0x90
        [<ffffffff8104fc88>] __warn+0xcb/0xe9
        [<ffffffff8104fcab>] ? warn_slowpath_null+0x5/0x1f
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 0 at kernel/module.c:268 module_assert_mutex_or_preempt+0x3c/0x3e
      [...]
      
      Which gives us rather useless information. Worse yet, there's some race
      that causes this, and I seldom trigger it, so I have no idea what
      happened.
      
      This would not be an issue if that warning was a WARN_ON_ONCE().
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      9502514f
  8. 03 8月, 2016 2 次提交
    • R
      config: add android config fragments · 27eb6622
      Rob Herring 提交于
      Copy the config fragments from the AOSP common kernel android-4.4
      branch.  It is becoming possible to run mainline kernels with Android,
      but the kernel defconfigs don't work as-is and debugging missing config
      options is a pain.  Adding the config fragments into the kernel tree,
      makes configuring a mainline kernel as simple as:
      
        make ARCH=arm multi_v7_defconfig android-base.config android-recommended.config
      
      The following non-upstream config options were removed:
      
        CONFIG_NETFILTER_XT_MATCH_QTAGUID
        CONFIG_NETFILTER_XT_MATCH_QUOTA2
        CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
        CONFIG_PPPOLAC
        CONFIG_PPPOPNS
        CONFIG_SECURITY_PERF_EVENTS_RESTRICT
        CONFIG_USB_CONFIGFS_F_MTP
        CONFIG_USB_CONFIGFS_F_PTP
        CONFIG_USB_CONFIGFS_F_ACC
        CONFIG_USB_CONFIGFS_F_AUDIO_SRC
        CONFIG_USB_CONFIGFS_UEVENT
        CONFIG_INPUT_KEYCHORD
        CONFIG_INPUT_KEYRESET
      
      Link: http://lkml.kernel.org/r/1466708235-28593-1-git-send-email-robh@kernel.orgSigned-off-by: NRob Herring <robh@kernel.org>
      Cc: Amit Pundir <amit.pundir@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dmitry Shmidt <dimitrysh@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27eb6622
    • A
      relay: add global mode support for buffer-only channels · 59dbb2a0
      Akash Goel 提交于
      Commit 20d8b67c ("relay: add buffer-only channels; useful for early
      logging") added support to use channels with no associated files.
      
      This is useful when the exact location of relay file is not known or the
      the parent directory of relay file is not available, while creating the
      channel and the logging has to start right from the boot.
      
      But there was no provision to use global mode with buffer-only channels,
      which is added by this patch, without modifying the interface where
      initially there will be a dummy invocation of create_buf_file callback
      through which kernel client can convey the need of a global buffer.
      
      For the use case where drivers/kernel clients want a simple interface
      for the userspace, which enables them to capture data/logs from relay
      file inorder & without any post processing, support of Global buffer
      mode is warranted.
      
      Modules, like i915, using relay_open() in early init would have to later
      register their buffer-only relays, once debugfs is available, by calling
      relay_late_setup_files().  Hence relay_late_setup_files() symbol also
      needs to be exported.
      
      Link: http://lkml.kernel.org/r/1468404563-11653-1-git-send-email-akash.goel@intel.comSigned-off-by: NAkash Goel <akash.goel@intel.com>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59dbb2a0