1. 27 4月, 2018 9 次提交
  2. 13 2月, 2018 2 次提交
    • W
      locking/qspinlock: Ensure node->count is updated before initialising node · 11dc1322
      Will Deacon 提交于
      When queuing on the qspinlock, the count field for the current CPU's head
      node is incremented. This needn't be atomic because locking in e.g. IRQ
      context is balanced and so an IRQ will return with node->count as it
      found it.
      
      However, the compiler could in theory reorder the initialisation of
      node[idx] before the increment of the head node->count, causing an
      IRQ to overwrite the initialised node and potentially corrupt the lock
      state.
      
      Avoid the potential for this harmful compiler reordering by placing a
      barrier() between the increment of the head node->count and the subsequent
      node initialisation.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11dc1322
    • W
      locking/qspinlock: Ensure node is initialised before updating prev->next · 95bcade3
      Will Deacon 提交于
      When a locker ends up queuing on the qspinlock locking slowpath, we
      initialise the relevant mcs node and publish it indirectly by updating
      the tail portion of the lock word using xchg_tail. If we find that there
      was a pre-existing locker in the queue, we subsequently update their
      ->next field to point at our node so that we are notified when it's our
      turn to take the lock.
      
      This can be roughly illustrated as follows:
      
        /* Initialise the fields in node and encode a pointer to node in tail */
        tail = initialise_node(node);
      
        /*
         * Exchange tail into the lockword using an atomic read-modify-write
         * operation with release semantics
         */
        old = xchg_tail(lock, tail);
      
        /* If there was a pre-existing waiter ... */
        if (old & _Q_TAIL_MASK) {
      	prev = decode_tail(old);
      	smp_read_barrier_depends();
      
      	/* ... then update their ->next field to point to node.
      	WRITE_ONCE(prev->next, node);
        }
      
      The conditional update of prev->next therefore relies on the address
      dependency from the result of xchg_tail ensuring order against the
      prior initialisation of node. However, since the release semantics of
      the xchg_tail operation apply only to the write portion of the RmW,
      then this ordering is not guaranteed and it is possible for the CPU
      to return old before the writes to node have been published, consequently
      allowing us to point prev->next to an uninitialised node.
      
      This patch fixes the problem by making the update of prev->next a RELEASE
      operation, which also removes the reliance on dependency ordering.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95bcade3
  3. 05 12月, 2017 1 次提交
  4. 17 8月, 2017 1 次提交
    • P
      locking: Remove spin_unlock_wait() generic definitions · d3a024ab
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics,
      and it appears that all callers could do just as well with a lock/unlock
      pair.  This commit therefore removes spin_unlock_wait() and related
      definitions from core code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      d3a024ab
  5. 11 7月, 2017 1 次提交
    • P
      locking/qspinlock: Include linux/prefetch.h · d40e0d4f
      Paul Burton 提交于
      kernel/locking/qspinlock.c makes use of prefetchw() but hasn't included
      linux/prefetch.h up until now, instead relying upon indirect inclusion
      of some header that defines prefetchw().
      
      In the case of MIPS we generally obtain a definition of prefetchw() from
      asm/processor.h, included by way of linux/mutex.h, but only for
      configurations which select CONFIG_CPU_HAS_PREFETCH. For configurations
      which don't this leaves prefetchw() undefined leading to the following
      build failure:
      
        CC      kernel/locking/qspinlock.o
        kernel/locking/qspinlock.c: In function ‘queued_spin_lock_slowpath’:
        kernel/locking/qspinlock.c:549:4: error: implicit declaration of
            function ‘prefetchw’ [-Werror=implicit-function-declaration]
          prefetchw(next);
          ^~~~~~~~~
      
      Fix this by including linux/prefetch.h in order to ensure that we get a
      definition of prefetchw().
      Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      d40e0d4f
  6. 08 7月, 2017 1 次提交
  7. 27 6月, 2016 1 次提交
  8. 14 6月, 2016 2 次提交
  9. 08 6月, 2016 3 次提交
    • P
      locking/qspinlock: Add comments · 055ce0fd
      Peter Zijlstra 提交于
      I figured we need to document the spin_is_locked() and
      spin_unlock_wait() constraints somwehere.
      
      Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
      could find a place in there. But currently that document is stale to
      the point of hardly being useful.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      055ce0fd
    • P
      locking/qspinlock: Clarify xchg_tail() ordering · 8d53fa19
      Peter Zijlstra 提交于
      While going over the code I noticed that xchg_tail() is a RELEASE but
      had no obvious pairing commented.
      
      It pairs with a somewhat unique address dependency through
      decode_tail().
      
      So the store-release of xchg_tail() is paired by the address
      dependency of the load of xchg_tail followed by the dereference from
      the pointer computed from that load.
      
      The @old -> @prev transformation itself is pure, and therefore does
      not depend on external state, so that is immaterial wrt. ordering.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d53fa19
    • P
      locking/qspinlock: Fix spin_unlock_wait() some more · 2c610022
      Peter Zijlstra 提交于
      While this prior commit:
      
        54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      
      ... fixes spin_is_locked() and spin_unlock_wait() for the usage
      in ipc/sem and netfilter, it does not in fact work right for the
      usage in task_work and futex.
      
      So while the 2 locks crossed problem:
      
      	spin_lock(A)		spin_lock(B)
      	if (!spin_is_locked(B)) spin_unlock_wait(A)
      	  foo()			foo();
      
      ... works with the smp_mb() injected by both spin_is_locked() and
      spin_unlock_wait(), this is not sufficient for:
      
      	flag = 1;
      	smp_mb();		spin_lock()
      	spin_unlock_wait()	if (!flag)
      				  // add to lockless list
      	// iterate lockless list
      
      ... because in this scenario, the store from spin_lock() can be delayed
      past the load of flag, uncrossing the variables and loosing the
      guarantee.
      
      This patch reworks spin_is_locked() and spin_unlock_wait() to work in
      both cases by exploiting the observation that while the lock byte
      store can be delayed, the contender must have registered itself
      visibly in other state contained in the word.
      
      It also allows for architectures to override both functions, as PPC
      and ARM64 have an additional issue for which we currently have no
      generic solution.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Giovanni Gherdovich <ggherdovich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org # v4.2 and later
      Fixes: 54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c610022
  10. 29 2月, 2016 1 次提交
  11. 04 12月, 2015 3 次提交
    • W
      locking/pvqspinlock: Queue node adaptive spinning · cd0272fa
      Waiman Long 提交于
      In an overcommitted guest where some vCPUs have to be halted to make
      forward progress in other areas, it is highly likely that a vCPU later
      in the spinlock queue will be spinning while the ones earlier in the
      queue would have been halted. The spinning in the later vCPUs is then
      just a waste of precious CPU cycles because they are not going to
      get the lock soon as the earlier ones have to be woken up and take
      their turn to get the lock.
      
      This patch implements an adaptive spinning mechanism where the vCPU
      will call pv_wait() if the previous vCPU is not running.
      
      Linux kernel builds were run in KVM guest on an 8-socket, 4
      cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
      Haswell-EX system. Both systems are configured to have 32 physical
      CPUs. The kernel build times before and after the patch were:
      
      		    Westmere			Haswell
        Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
        -----		--------    --------    --------    --------
        Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
        After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s
      
      For 32 vCPUs, this patch doesn't cause any noticeable change in
      performance. For 48 vCPUs (over-committed), there is about 8%
      performance improvement.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd0272fa
    • W
      locking/pvqspinlock: Allow limited lock stealing · 1c4941fd
      Waiman Long 提交于
      This patch allows one attempt for the lock waiter to steal the lock
      when entering the PV slowpath. To prevent lock starvation, the pending
      bit will be set by the queue head vCPU when it is in the active lock
      spinning loop to disable any lock stealing attempt.  This helps to
      reduce the performance penalty caused by lock waiter preemption while
      not having much of the downsides of a real unfair lock.
      
      The pv_wait_head() function was renamed as pv_wait_head_or_lock()
      as it was modified to acquire the lock before returning. This is
      necessary because of possible lock stealing attempts from other tasks.
      
      Linux kernel builds were run in KVM guest on an 8-socket, 4
      cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
      Haswell-EX system. Both systems are configured to have 32 physical
      CPUs. The kernel build times before and after the patch were:
      
                          Westmere                    Haswell
        Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
        -----         --------    --------    --------    --------
        Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
        After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s
      
      For the overcommited case (48 vCPUs), this patch is able to reduce
      kernel build time by more than 54% for Westmere and 44% for Haswell.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c4941fd
    • P
      locking, sched: Introduce smp_cond_acquire() and use it · b3e0b1b6
      Peter Zijlstra 提交于
      Introduce smp_cond_acquire() which combines a control dependency and a
      read barrier to form acquire semantics.
      
      This primitive has two benefits:
      
       - it documents control dependencies,
       - its typically cheaper than using smp_load_acquire() in a loop.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3e0b1b6
  12. 23 11月, 2015 3 次提交
  13. 11 9月, 2015 1 次提交
  14. 03 8月, 2015 1 次提交
    • W
      locking/pvqspinlock: Only kick CPU at unlock time · 75d22702
      Waiman Long 提交于
      For an over-committed guest with more vCPUs than physical CPUs
      available, it is possible that a vCPU may be kicked twice before
      getting the lock - once before it becomes queue head and once again
      before it gets the lock. All these CPU kicking and halting (VMEXIT)
      can be expensive and slow down system performance.
      
      This patch adds a new vCPU state (vcpu_hashed) which enables the code
      to delay CPU kicking until at unlock time. Once this state is set,
      the new lock holder will set _Q_SLOW_VAL and fill in the hash table
      on behalf of the halted queue head vCPU. The original vcpu_halted
      state will be used by pv_wait_node() only to differentiate other
      queue nodes from the qeue head.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75d22702
  15. 08 5月, 2015 7 次提交
    • W
      locking/pvqspinlock: Implement simple paravirt support for the qspinlock · a23db284
      Waiman Long 提交于
      Provide a separate (second) version of the spin_lock_slowpath for
      paravirt along with a special unlock path.
      
      The second slowpath is generated by adding a few pv hooks to the
      normal slowpath, but where those will compile away for the native
      case, they expand into special wait/wake code for the pv version.
      
      The actual MCS queue can use extra storage in the mcs_nodes[] array to
      keep track of state and therefore uses directed wakeups.
      
      The head contender has no such storage directly visible to the
      unlocker.  So the unlocker searches a hash table with open addressing
      using a simple binary Galois linear feedback shift register.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a23db284
    • P
      locking/qspinlock: Revert to test-and-set on hypervisors · 2aa79af6
      Peter Zijlstra (Intel) 提交于
      When we detect a hypervisor (!paravirt, see qspinlock paravirt support
      patches), revert to a simple test-and-set lock to avoid the horrors
      of queue preemption.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2aa79af6
    • W
      locking/qspinlock: Use a simple write to grab the lock · 2c83e8e9
      Waiman Long 提交于
      Currently, atomic_cmpxchg() is used to get the lock. However, this
      is not really necessary if there is more than one task in the queue
      and the queue head don't need to reset the tail code. For that case,
      a simple write to set the lock bit is enough as the queue head will
      be the only one eligible to get the lock as long as it checks that
      both the lock and pending bits are not set. The current pending bit
      waiting code will ensure that the bit will not be set as soon as the
      tail code in the lock is set.
      
      With that change, the are some slight improvement in the performance
      of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket
      Westere-EX machine as shown in the tables below.
      
      		[Standalone/Embedded - same node]
        # of tasks	Before patch	After patch	%Change
        ----------	-----------	----------	-------
             3	 2324/2321	2248/2265	 -3%/-2%
             4	 2890/2896	2819/2831	 -2%/-2%
             5	 3611/3595	3522/3512	 -2%/-2%
             6	 4281/4276	4173/4160	 -3%/-3%
             7	 5018/5001	4875/4861	 -3%/-3%
             8	 5759/5750	5563/5568	 -3%/-3%
      
      		[Standalone/Embedded - different nodes]
        # of tasks	Before patch	After patch	%Change
        ----------	-----------	----------	-------
             3	12242/12237	12087/12093	 -1%/-1%
             4	10688/10696	10507/10521	 -2%/-2%
      
      It was also found that this change produced a much bigger performance
      improvement in the newer IvyBridge-EX chip and was essentially to close
      the performance gap between the ticket spinlock and queued spinlock.
      
      The disk workload of the AIM7 benchmark was run on a 4-socket
      Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
      on a 3.14 based kernel. The results of the test runs were:
      
                      AIM7 XFS Disk Test
        kernel                 JPM    Real Time   Sys Time    Usr Time
        -----                  ---    ---------   --------    --------
        ticketlock            5678233    3.17       96.61       5.81
        qspinlock             5750799    3.13       94.83       5.97
      
                      AIM7 EXT4 Disk Test
        kernel                 JPM    Real Time   Sys Time    Usr Time
        -----                  ---    ---------   --------    --------
        ticketlock            1114551   16.15      509.72       7.11
        qspinlock             2184466    8.24      232.99       6.01
      
      The ext4 filesystem run had a much higher spinlock contention than
      the xfs filesystem run.
      
      The "ebizzy -m" test was also run with the following results:
      
        kernel               records/s  Real Time   Sys Time    Usr Time
        -----                ---------  ---------   --------    --------
        ticketlock             2075       10.00      216.35       3.49
        qspinlock              3023       10.00      198.20       4.80
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c83e8e9
    • P
      locking/qspinlock: Optimize for smaller NR_CPUS · 69f9cae9
      Peter Zijlstra (Intel) 提交于
      When we allow for a max NR_CPUS < 2^14 we can optimize the pending
      wait-acquire and the xchg_tail() operations.
      
      By growing the pending bit to a byte, we reduce the tail to 16bit.
      This means we can use xchg16 for the tail part and do away with all
      the repeated compxchg() operations.
      
      This in turn allows us to unconditionally acquire; the locked state
      as observed by the wait loops cannot change. And because both locked
      and pending are now a full byte we can use simple stores for the
      state transition, obviating one atomic operation entirely.
      
      This optimization is needed to make the qspinlock achieve performance
      parity with ticket spinlock at light load.
      
      All this is horribly broken on Alpha pre EV56 (and any other arch that
      cannot do single-copy atomic byte stores).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69f9cae9
    • W
      locking/qspinlock: Extract out code snippets for the next patch · 6403bd7d
      Waiman Long 提交于
      This is a preparatory patch that extracts out the following 2 code
      snippets to prepare for the next performance optimization patch.
      
       1) the logic for the exchange of new and previous tail code words
          into a new xchg_tail() function.
       2) the logic for clearing the pending bit and setting the locked bit
          into a new clear_pending_set_locked() function.
      
      This patch also simplifies the trylock operation before queuing by
      calling queued_spin_trylock() directly.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6403bd7d
    • P
      locking/qspinlock: Add pending bit · c1fb159d
      Peter Zijlstra (Intel) 提交于
      Because the qspinlock needs to touch a second cacheline (the per-cpu
      mcs_nodes[]); add a pending bit and allow a single in-word spinner
      before we punt to the second cacheline.
      
      It is possible so observe the pending bit without the locked bit when
      the last owner has just released but the pending owner has not yet
      taken ownership.
      
      In this case we would normally queue -- because the pending bit is
      already taken. However, in this case the pending bit is guaranteed
      to be released 'soon', therefore wait for it and avoid queueing.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1fb159d
    • W
      locking/qspinlock: Introduce a simple generic 4-byte queued spinlock · a33fda35
      Waiman Long 提交于
      This patch introduces a new generic queued spinlock implementation that
      can serve as an alternative to the default ticket spinlock. Compared
      with the ticket spinlock, this queued spinlock should be almost as fair
      as the ticket spinlock. It has about the same speed in single-thread
      and it can be much faster in high contention situations especially when
      the spinlock is embedded within the data structure to be protected.
      
      Only in light to moderate contention where the average queue depth
      is around 1-3 will this queued spinlock be potentially a bit slower
      due to the higher slowpath overhead.
      
      This queued spinlock is especially suit to NUMA machines with a large
      number of cores as the chance of spinlock contention is much higher
      in those machines. The cost of contention is also higher because of
      slower inter-node memory traffic.
      
      Due to the fact that spinlocks are acquired with preemption disabled,
      the process will not be migrated to another CPU while it is trying
      to get a spinlock. Ignoring interrupt handling, a CPU can only be
      contending in one spinlock at any one time. Counting soft IRQ, hard
      IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
      activities.  By allocating a set of per-cpu queue nodes and used them
      to form a waiting queue, we can encode the queue node address into a
      much smaller 24-bit size (including CPU number and queue node index)
      leaving one byte for the lock.
      
      Please note that the queue node is only needed when waiting for the
      lock. Once the lock is acquired, the queue node can be released to
      be used later.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a33fda35