1. 20 12月, 2015 3 次提交
  2. 04 12月, 2015 11 次提交
    • W
      locking/pvqspinlock: Queue node adaptive spinning · cd0272fa
      Waiman Long 提交于
      In an overcommitted guest where some vCPUs have to be halted to make
      forward progress in other areas, it is highly likely that a vCPU later
      in the spinlock queue will be spinning while the ones earlier in the
      queue would have been halted. The spinning in the later vCPUs is then
      just a waste of precious CPU cycles because they are not going to
      get the lock soon as the earlier ones have to be woken up and take
      their turn to get the lock.
      
      This patch implements an adaptive spinning mechanism where the vCPU
      will call pv_wait() if the previous vCPU is not running.
      
      Linux kernel builds were run in KVM guest on an 8-socket, 4
      cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
      Haswell-EX system. Both systems are configured to have 32 physical
      CPUs. The kernel build times before and after the patch were:
      
      		    Westmere			Haswell
        Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
        -----		--------    --------    --------    --------
        Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
        After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s
      
      For 32 vCPUs, this patch doesn't cause any noticeable change in
      performance. For 48 vCPUs (over-committed), there is about 8%
      performance improvement.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd0272fa
    • W
      locking/pvqspinlock: Allow limited lock stealing · 1c4941fd
      Waiman Long 提交于
      This patch allows one attempt for the lock waiter to steal the lock
      when entering the PV slowpath. To prevent lock starvation, the pending
      bit will be set by the queue head vCPU when it is in the active lock
      spinning loop to disable any lock stealing attempt.  This helps to
      reduce the performance penalty caused by lock waiter preemption while
      not having much of the downsides of a real unfair lock.
      
      The pv_wait_head() function was renamed as pv_wait_head_or_lock()
      as it was modified to acquire the lock before returning. This is
      necessary because of possible lock stealing attempts from other tasks.
      
      Linux kernel builds were run in KVM guest on an 8-socket, 4
      cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
      Haswell-EX system. Both systems are configured to have 32 physical
      CPUs. The kernel build times before and after the patch were:
      
                          Westmere                    Haswell
        Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
        -----         --------    --------    --------    --------
        Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
        After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s
      
      For the overcommited case (48 vCPUs), this patch is able to reduce
      kernel build time by more than 54% for Westmere and 44% for Haswell.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c4941fd
    • W
      locking/pvqspinlock: Collect slowpath lock statistics · 45e898b7
      Waiman Long 提交于
      This patch enables the accumulation of kicking and waiting related
      PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
      option is selected. It also enables the collection of data which
      enable us to calculate the kicking and wakeup latencies which have
      a heavy dependency on the CPUs being used.
      
      The statistical counters are per-cpu variables to minimize the
      performance overhead in their updates. These counters are exported
      via the debugfs filesystem under the qlockstat directory.  When the
      corresponding debugfs files are read, summation and computing of the
      required data are then performed.
      
      The measured latencies for different CPUs are:
      
      	CPU		Wakeup		Kicking
      	---		------		-------
      	Haswell-EX	63.6us		 7.4us
      	Westmere-EX	67.6us		 9.3us
      
      The measured latencies varied a bit from run-to-run. The wakeup
      latency is much higher than the kicking latency.
      
      A sample of statistical counters after system bootup (with vCPU
      overcommit) was:
      
      	pv_hash_hops=1.00
      	pv_kick_unlock=1148
      	pv_kick_wake=1146
      	pv_latency_kick=11040
      	pv_latency_wake=194840
      	pv_spurious_wakeup=7
      	pv_wait_again=4
      	pv_wait_head=23
      	pv_wait_node=1129
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45e898b7
    • P
      sched/core, locking: Document Program-Order guarantees · 8643cda5
      Peter Zijlstra 提交于
      These are some notes on the scheduler locking and how it provides
      program order guarantees on SMP systems.
      
      ( This commit is in the locking tree, because the new documentation
        refers to a newly introduced locking primitive. )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8643cda5
    • P
      locking, sched: Introduce smp_cond_acquire() and use it · b3e0b1b6
      Peter Zijlstra 提交于
      Introduce smp_cond_acquire() which combines a control dependency and a
      read barrier to form acquire semantics.
      
      This primitive has two benefits:
      
       - it documents control dependencies,
       - its typically cheaper than using smp_load_acquire() in a loop.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3e0b1b6
    • P
      sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() · ecf7d01c
      Peter Zijlstra 提交于
      Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
      that we'll prematurely continue with the wakeup and effectively run p on
      two CPUs at the same time.
      
      Even though the overlap is very limited; the task is in the middle of
      being scheduled out; it could still result in corruption of the
      scheduler data structures.
      
              CPU0                            CPU1
      
              set_current_state(...)
      
              <preempt_schedule>
                context_switch(X, Y)
                  prepare_lock_switch(Y)
                    Y->on_cpu = 1;
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
                                              try_to_wake_up(X)
                                                LOCK(p->pi_lock);
      
                                                t = X->on_cpu; // 0
      
                context_switch(Y, X)
                  prepare_lock_switch(X)
                    X->on_cpu = 1;
                  finish_lock_switch(Y)
                    store_release(Y->on_cpu, 0);
              </preempt_schedule>
      
              schedule();
                deactivate_task(X);
                X->on_rq = 0;
      
                                                if (X->on_rq) // false
      
                                                if (t) while (X->on_cpu)
                                                  cpu_relax();
      
                context_switch(X, ..)
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
      Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ecf7d01c
    • P
      sched/core: Better document the try_to_wake_up() barriers · b75a2253
      Peter Zijlstra 提交于
      Explain how the control dependency and smp_rmb() end up providing
      ACQUIRE semantics and pair with smp_store_release() in
      finish_lock_switch().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b75a2253
    • H
      sched/cputime: Fix invalid gtime in proc · 2541117b
      Hiroshi Shimamoto 提交于
      /proc/stats shows invalid gtime when the thread is running in guest.
      When vtime accounting is not enabled, we cannot get a valid delta.
      The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
      is only updated when vtime accounting is runtime enabled.
      
      This patch makes task_gtime() just return gtime without computing the
      buggy non-existing tickless delta when vtime accounting is not enabled.
      
      Use context_tracking_is_enabled() to check if vtime is accounting on
      some cpu, in which case only we need to check the tickless delta. This
      way we fix the gtime value regression on machines not running nohz full.
      
      The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
      CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.
      
      I ran and stop a busy loop in VM and see the gtime in host.
      Dump the 43rd field which shows the gtime in every second:
      
      	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
      	S 4348
      	R 7064566
      	R 7064766
      	R 7064967
      	R 7065168
      	S 4759
      	S 4759
      
      During running busy loop, it returns large value.
      
      After applying this patch, we can see right gtime.
      
      	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
      	S 5338
      	R 5365
      	R 5465
      	R 5566
      	R 5666
      	S 5726
      	S 5726
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541117b
    • X
      sched/core: Clear the root_domain cpumasks in init_rootdomain() · 8295c699
      Xunlei Pang 提交于
      root_domain::rto_mask allocated through alloc_cpumask_var()
      contains garbage data, this may cause problems. For instance,
      When doing pull_rt_task(), it may do useless iterations if
      rto_mask retains some extra garbage bits. Worse still, this
      violates the isolated domain rule for clustered scheduling
      using cpuset, because the tasks(with all the cpus allowed)
      belongs to one root domain can be pulled away into another
      root domain.
      
      The patch cleans the garbage by using zalloc_cpumask_var()
      instead of alloc_cpumask_var() for root_domain::rto_mask
      allocation, thereby addressing the issues.
      
      Do the same thing for root_domain's other cpumask memembers:
      dlo_mask, span, and online.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8295c699
    • S
      sched/core: Remove false-positive warning from wake_up_process() · 119d6f6a
      Sasha Levin 提交于
      Because wakeups can (fundamentally) be late, a task might not be in
      the expected state. Therefore testing against a task's state is racy,
      and can yield false positives.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: oleg@redhat.com
      Fixes: 9067ac85 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
      Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      119d6f6a
    • P
      sched/wait: Fix signal handling in bit wait helpers · 68985633
      Peter Zijlstra 提交于
      Vladimir reported getting RCU stall warnings and bisected it back to
      commit:
      
        74316201 ("sched: Remove proliferation of wait_on_bit() action functions")
      
      That commit inadvertently reversed the calls to schedule() and signal_pending(),
      thereby not handling the case where the signal receives while we sleep.
      Reported-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Tested-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mark.rutland@arm.com
      Cc: neilb@suse.de
      Cc: oleg@redhat.com
      Fixes: 74316201 ("sched: Remove proliferation of wait_on_bit() action functions")
      Fixes: cbbce822 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
      Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68985633
  3. 23 11月, 2015 5 次提交
    • W
      locking/pvqspinlock, x86: Optimize the PV unlock code path · d7804530
      Waiman Long 提交于
      The unlock function in queued spinlocks was optimized for better
      performance on bare metal systems at the expense of virtualized guests.
      
      For x86-64 systems, the unlock call needs to go through a
      PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
      registers before calling the real __pv_queued_spin_unlock()
      function. The thunk code may also be in a separate cacheline from
      __pv_queued_spin_unlock().
      
      This patch optimizes the PV unlock code path by:
      
       1) Moving the unlock slowpath code from the fastpath into a separate
          __pv_queued_spin_unlock_slowpath() function to make the fastpath
          as simple as possible..
      
       2) For x86-64, hand-coded an assembly function to combine the register
          saving thunk code with the fastpath code. Only registers that
          are used in the fastpath will be saved and restored. If the
          fastpath fails, the slowpath function will be called via another
          PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
          __pv_queued_spin_unlock() code as the thunk saves and restores
          only one 32-bit register.
      
      With a microbenchmark of 5M lock-unlock loop, the table below shows
      the execution times before and after the patch with different number
      of threads in a VM running on a 32-core Westmere-EX box with x86-64
      4.2-rc1 based kernels:
      
        Threads	Before patch	After patch	% Change
        -------	------------	-----------	--------
           1		   134.1 ms	  119.3 ms	  -11%
           2		   1286  ms	   953  ms	  -26%
           3		   3715  ms	  3480  ms	  -6.3%
           4		   4092  ms	  3764  ms	  -8.0%
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7804530
    • W
      locking/qspinlock: Avoid redundant read of next pointer · aa68744f
      Waiman Long 提交于
      With optimistic prefetch of the next node cacheline, the next pointer
      may have been properly inititalized. As a result, the reading
      of node->next in the contended path may be redundant. This patch
      eliminates the redundant read if the next pointer value is not NULL.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aa68744f
    • W
      locking/qspinlock: Prefetch the next node cacheline · 81b55986
      Waiman Long 提交于
      A queue head CPU, after acquiring the lock, will have to notify
      the next CPU in the wait queue that it has became the new queue
      head. This involves loading a new cacheline from the MCS node of the
      next CPU. That operation can be expensive and add to the latency of
      locking operation.
      
      This patch addes code to optmistically prefetch the next MCS node
      cacheline if the next pointer is defined and it has been spinning
      for the MCS lock for a while. This reduces the locking latency and
      improves the system throughput.
      
      The performance change will depend on whether the prefetch overhead
      can be hidden within the latency of the lock spin loop. On really
      short critical section, there may not be performance gain at all. With
      longer critical section, however, it was found to have a performance
      boost of 5-10% over a range of different queue depths with a spinlock
      loop microbenchmark.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81b55986
    • W
      locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg() · 64d816cb
      Waiman Long 提交于
      This patch replaces the cmpxchg() and xchg() calls in the native
      qspinlock code with the more relaxed _acquire or _release versions of
      those calls to enable other architectures to adopt queued spinlocks
      with less memory barrier performance overhead.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64d816cb
    • A
      sched/rt: Hide the push_irq_work_func() declaration · 89b41108
      Arnd Bergmann 提交于
      The push_irq_work_func() function is conditionally defined only
      when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the
      forward declaration remains visibile without HAVE_RT_PUSH_IPI,
      causing a gcc warning in ARM64 allnoconfig:
      
        kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function]
      
      This changes the code to use the same condition for both the
      declaration and the function definition, which gets rid of the
      warning.
      
      As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI
      thing after:
      
        8053871d ("smp: Fix smp_call_function_single_async() locking")
      
      Until that is done, this patch can be used to avoid the warning.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
      Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfelSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89b41108
  4. 21 11月, 2015 2 次提交
    • V
      kernel/panic.c: turn off locks debug before releasing console lock · 7625b3a0
      Vitaly Kuznetsov 提交于
      Commit 08d78658 ("panic: release stale console lock to always get the
      logbuf printed out") introduced an unwanted bad unlock balance report when
      panic() is called directly and not from OOPS (e.g.  from out_of_memory()).
      The difference is that in case of OOPS we disable locks debug in
      oops_enter() and on direct panic call nobody does that.
      
      Fixes: 08d78658 ("panic: release stale console lock to always get the logbuf printed out")
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7625b3a0
    • R
      kernel/signal.c: unexport sigsuspend() · 9d8a7652
      Richard Weinberger 提交于
      sigsuspend() is nowhere used except in signal.c itself, so we can mark it
      static do not pollute the global namespace.
      
      But this patch is more than a boring cleanup patch, it fixes a real issue
      on UserModeLinux.  UML has a special console driver to display ttys using
      xterm, or other terminal emulators, on the host side.  Vegard reported
      that sometimes UML is unable to spawn a xterm and he's facing the
      following warning:
      
        WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()
      
      It turned out that this warning makes absolutely no sense as the UML
      xterm code calls sigsuspend() on the host side, at least it tries.  But
      as the kernel itself offers a sigsuspend() symbol the linker choose this
      one instead of the glibc wrapper.  Interestingly this code used to work
      since ever but always blocked signals on the wrong side.  Some recent
      kernel change made the WARN_ON() trigger and uncovered the bug.
      
      It is a wonderful example of how much works by chance on computers. :-)
      
      Fixes: 68f3f16d ("new helper: sigsuspend()")
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Tested-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d8a7652
  5. 12 11月, 2015 1 次提交
  6. 11 11月, 2015 1 次提交
    • S
      bpf_trace: Make dependent on PERF_EVENTS · a31d82d8
      Steven Rostedt 提交于
      Arnd Bergmann reported:
      
        In my ARM randconfig tests, I'm getting a build error for
        newly added code in bpf_perf_event_read and bpf_perf_event_output
        whenever CONFIG_PERF_EVENTS is disabled:
      
        kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
        kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu'
        if (event->oncpu != smp_processor_id() ||
                 ^
        kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu'
              event->pmu->count)
      
        This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
        is disabled. I'm not sure if that is a configuration we care
        about, otherwise we could prevent this case from occuring by
        adding Kconfig dependencies.
      
      Looking at this further, it's really that UPROBE_EVENT enables PERF_EVENTS.
      By just having BPF_EVENTS depend on PERF_EVENTS, then all is fine.
      
      Link: http://lkml.kernel.org/r/4525348.Aq9YoXkChv@wuerfelReported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a31d82d8
  7. 10 11月, 2015 5 次提交
  8. 09 11月, 2015 3 次提交
  9. 08 11月, 2015 1 次提交
  10. 07 11月, 2015 8 次提交
    • V
      panic: release stale console lock to always get the logbuf printed out · 08d78658
      Vitaly Kuznetsov 提交于
      In some cases we may end up killing the CPU holding the console lock
      while still having valuable data in logbuf. E.g. I'm observing the
      following:
      
      - A crash is happening on one CPU and console_unlock() is being called on
        some other.
      
      - console_unlock() tries to print out the buffer before releasing the lock
        and on slow console it takes time.
      
      - in the meanwhile crashing CPU does lots of printk()-s with valuable data
        (which go to the logbuf) and sends IPIs to all other CPUs.
      
      - console_unlock() finishes printing previous chunk and enables interrupts
        before trying to print out the rest, the CPU catches the IPI and never
        releases console lock.
      
      This is not the only possible case: in VT/fb subsystems we have many other
      console_lock()/console_unlock() users.  Non-masked interrupts (or
      receiving NMI in case of extreme slowness) will have the same result.
      Getting the whole console buffer printed out on crash should be top
      priority.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d78658
    • B
      pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode · 8639b461
      Ben Segall 提交于
      setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
      the current pid namespace.  This is in contrast to both the other modes of
      setpriority and the example of kill(-1).  Fix this.  getpriority and
      ioprio have the same failure mode, fix them too.
      
      Eric said:
      
      : After some more thinking about it this patch sounds justifiable.
      :
      : My goal with namespaces is not to build perfect isolation mechanisms
      : as that can get into ill defined territory, but to build well defined
      : mechanisms.  And to handle the corner cases so you can use only
      : a single namespace with well defined results.
      :
      : In this case you have found the two interfaces I am aware of that
      : identify processes by uid instead of by pid.  Which quite frankly is
      : weird.  Unfortunately the weird unexpected cases are hard to handle
      : in the usual way.
      :
      : I was hoping for a little more information.  Changes like this one we
      : have to be careful of because someone might be depending on the current
      : behavior.  I don't think they are and I do think this make sense as part
      : of the pid namespace.
      Signed-off-by: NBen Segall <bsegall@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ambrose Feinstein <ambrose@google.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8639b461
    • M
      kexec: use file name as the output message prefix · de90a6bc
      Minfei Huang 提交于
      kexec output message misses the prefix "kexec", when Dave Young split the
      kexec code.  Now, we use file name as the output message prefix.
      
      Currently, the format of output message:
      [  140.290795] SYSC_kexec_load: hello, world
      [  140.291534] kexec: sanity_check_segment_list: hello, world
      
      Ideally, the format of output message:
      [   30.791503] kexec: SYSC_kexec_load, Hello, world
      [   79.182752] kexec_core: sanity_check_segment_list, Hello, world
      
      Remove the custom prefix "kexec" in output message.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de90a6bc
    • O
      coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP · 5fa534c9
      Oleg Nesterov 提交于
      task_will_free_mem() is wrong in many ways, and in particular the
      SIGNAL_GROUP_COREDUMP check is not reliable: a task can participate in the
      coredumping without SIGNAL_GROUP_COREDUMP bit set.
      
      change zap_threads() paths to always set SIGNAL_GROUP_COREDUMP even if
      other CLONE_VM processes can't react to SIGKILL.  Fortunately, at least
      oom-kill case if fine; it kills all tasks sharing the same mm, so it
      should also kill the process which actually dumps the core.
      
      The change in prepare_signal() is not strictly necessary, it just ensures
      that the patch does not bring another subtle behavioural change.  But it
      reminds us that this SIGNAL_GROUP_EXIT/COREDUMP case needs more changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle Walker <kwalker@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Stanislav Kozina <skozina@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fa534c9
    • O
      signals: kill block_all_signals() and unblock_all_signals() · 2e01fabe
      Oleg Nesterov 提交于
      It is hardly possible to enumerate all problems with block_all_signals()
      and unblock_all_signals().  Just for example,
      
      1. block_all_signals(SIGSTOP/etc) simply can't help if the caller is
         multithreaded. Another thread can dequeue the signal and force the
         group stop.
      
      2. Even is the caller is single-threaded, it will "stop" anyway. It
         will not sleep, but it will spin in kernel space until SIGCONT or
         SIGKILL.
      
      And a lot more. In short, this interface doesn't work at all, at least
      the last 10+ years.
      
      Daniel said:
      
        Yeah the only times I played around with the DRM_LOCK stuff was when
        old drivers accidentally deadlocked - my impression is that the entire
        DRM_LOCK thing was never really tested properly ;-) Hence I'm all for
        purging where this leaks out of the drm subsystem.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Acked-by: NDave Airlie <airlied@redhat.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e01fabe
    • M
      printk: prevent userland from spoofing kernel messages · 3824657c
      Mathias Krause 提交于
      The following statement of ABI/testing/dev-kmsg is not quite right:
      
         It is not possible to inject messages from userspace with the
         facility number LOG_KERN (0), to make sure that the origin of the
         messages can always be reliably determined.
      
      Userland actually can inject messages with a facility of 0 by abusing the
      fact that the facility is stored in a u8 data type.  By using a facility
      which is a multiple of 256 the assignment of msg->facility in log_store()
      implicitly truncates it to 0, i.e.  LOG_KERN, allowing users of /dev/kmsg
      to spoof kernel messages as shown below:
      
      The following call...
         # printf '<%d>Kernel panic - not syncing: beer empty\n' 0 >/dev/kmsg
      ...leads to the following log entry (dmesg -x | tail -n 1):
         user  :emerg : [   66.137758] Kernel panic - not syncing: beer empty
      
      However, this call...
         # printf '<%d>Kernel panic - not syncing: beer empty\n' 0x800 >/dev/kmsg
      ...leads to the slightly different log entry (note the kernel facility):
         kern  :emerg : [   74.177343] Kernel panic - not syncing: beer empty
      
      Fix that by limiting the user provided facility to 8 bit right from the
      beginning and catch the truncation early.
      
      Fixes: 7ff9554b ("printk: convert byte-buffer to variable-length...")
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Alex Elder <elder@linaro.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3824657c
    • D
      module: export param_free_charp() · 3d9c637f
      Dan Streetman 提交于
      Change the param_free_charp() function from static to exported.
      
      It is used by zswap in the next patch ("zswap: use charp for zswap param
      strings").
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d9c637f
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b