1. 18 2月, 2015 1 次提交
    • R
      x86/spinlocks/paravirt: Fix memory corruption on unlock · d6abfdb2
      Raghavendra K T 提交于
      Paravirt spinlock clears slowpath flag after doing unlock.
      As explained by Linus currently it does:
      
                      prev = *lock;
                      add_smp(&lock->tickets.head, TICKET_LOCK_INC);
      
                      /* add_smp() is a full mb() */
      
                      if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                              __ticket_unlock_slowpath(lock, prev);
      
      which is *exactly* the kind of things you cannot do with spinlocks,
      because after you've done the "add_smp()" and released the spinlock
      for the fast-path, you can't access the spinlock any more.  Exactly
      because a fast-path lock might come in, and release the whole data
      structure.
      
      Linus suggested that we should not do any writes to lock after unlock(),
      and we can move slowpath clearing to fastpath lock.
      
      So this patch implements the fix with:
      
       1. Moving slowpath flag to head (Oleg):
          Unlocked locks don't care about the slowpath flag; therefore we can keep
          it set after the last unlock, and clear it again on the first (try)lock.
          -- this removes the write after unlock. note that keeping slowpath flag would
          result in unnecessary kicks.
          By moving the slowpath flag from the tail to the head ticket we also avoid
          the need to access both the head and tail tickets on unlock.
      
       2. use xadd to avoid read/write after unlock that checks the need for
          unlock_kick (Linus):
          We further avoid the need for a read-after-release by using xadd;
          the prev head value will include the slowpath flag and indicate if we
          need to do PV kicking of suspended spinners -- on modern chips xadd
          isn't (much) more expensive than an add + load.
      
      Result:
       setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
       benchmark overcommit %improve
       kernbench  1x           -0.13
       kernbench  2x            0.02
       dbench     1x           -1.77
       dbench     2x           -0.63
      
      [Jeremy: Hinted missing TICKET_LOCK_INC for kick]
      [Oleg: Moved slowpath flag to head, ticket_equals idea]
      [PeterZ: Added detailed changelog]
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: a.ryabinin@samsung.com
      Cc: dave@stgolabs.net
      Cc: hpa@zytor.com
      Cc: jasowang@redhat.com
      Cc: jeremy@goop.org
      Cc: paul.gortmaker@windriver.com
      Cc: riel@redhat.com
      Cc: tglx@linutronix.de
      Cc: waiman.long@hp.com
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6abfdb2
  2. 19 1月, 2015 1 次提交
  3. 18 12月, 2014 1 次提交
  4. 08 12月, 2014 1 次提交
  5. 10 9月, 2014 1 次提交
  6. 06 6月, 2014 1 次提交
  7. 12 3月, 2014 1 次提交
  8. 03 9月, 2013 1 次提交
    • L
      lockref: implement lockless reference count updates using cmpxchg() · bc08b449
      Linus Torvalds 提交于
      Instead of taking the spinlock, the lockless versions atomically check
      that the lock is not taken, and do the reference count update using a
      cmpxchg() loop.  This is semantically identical to doing the reference
      count update protected by the lock, but avoids the "wait for lock"
      contention that you get when accesses to the reference count are
      contended.
      
      Note that a "lockref" is absolutely _not_ equivalent to an atomic_t.
      Even when the lockref reference counts are updated atomically with
      cmpxchg, the fact that they also verify the state of the spinlock means
      that the lockless updates can never happen while somebody else holds the
      spinlock.
      
      So while "lockref_put_or_lock()" looks a lot like just another name for
      "atomic_dec_and_lock()", and both optimize to lockless updates, they are
      fundamentally different: the decrement done by atomic_dec_and_lock() is
      truly independent of any lock (as long as it doesn't decrement to zero),
      so a locked region can still see the count change.
      
      The lockref structure, in contrast, really is a *locked* reference
      count.  If you hold the spinlock, the reference count will be stable and
      you can modify the reference count without using atomics, because even
      the lockless updates will see and respect the state of the lock.
      
      In order to enable the cmpxchg lockless code, the architecture needs to
      do three things:
      
       (1) Make sure that the "arch_spinlock_t" and an "unsigned int" can fit
           in an aligned u64, and have a "cmpxchg()" implementation that works
           on such a u64 data type.
      
       (2) define a helper function to test for a spinlock being unlocked
           ("arch_spin_value_unlocked()")
      
       (3) select the "ARCH_USE_CMPXCHG_LOCKREF" config variable in its
           Kconfig file.
      
      This enables it for x86-64 (but not 32-bit, we'd need to make sure
      cmpxchg() turns into the proper cmpxchg8b in order to enable it for
      32-bit mode).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc08b449
  9. 13 8月, 2013 1 次提交
    • O
      sched: fix the theoretical signal_wake_up() vs schedule() race · e0acd0a6
      Oleg Nesterov 提交于
      This is only theoretical, but after try_to_wake_up(p) was changed
      to check p->state under p->pi_lock the code like
      
      	__set_current_state(TASK_INTERRUPTIBLE);
      	schedule();
      
      can miss a signal. This is the special case of wait-for-condition,
      it relies on try_to_wake_up/schedule interaction and thus it does
      not need mb() between __set_current_state() and if(signal_pending).
      
      However, this __set_current_state() can move into the critical
      section protected by rq->lock, now that try_to_wake_up() takes
      another lock we need to ensure that it can't be reordered with
      "if (signal_pending(current))" check inside that section.
      
      The patch is actually one-liner, it simply adds smp_wmb() before
      spin_lock_irq(rq->lock). This is what try_to_wake_up() already
      does by the same reason.
      
      We turn this wmb() into the new helper, smp_mb__before_spinlock(),
      for better documentation and to allow the architectures to change
      the default implementation.
      
      While at it, kill smp_mb__after_lock(), it has no callers.
      
      Perhaps we can also add smp_mb__before/after_spinunlock() for
      prepare_to_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0acd0a6
  10. 09 8月, 2013 4 次提交
  11. 22 8月, 2012 1 次提交
  12. 31 3月, 2012 1 次提交
  13. 07 2月, 2012 1 次提交
  14. 26 11月, 2011 1 次提交
  15. 28 9月, 2011 1 次提交
  16. 30 8月, 2011 4 次提交
  17. 27 7月, 2011 1 次提交
  18. 21 7月, 2011 1 次提交
    • J
      x86: Fix write lock scalability 64-bit issue · a750036f
      Jan Beulich 提交于
      With the write lock path simply subtracting RW_LOCK_BIAS there
      is, on large systems, the theoretical possibility of overflowing
      the 32-bit value that was used so far (namely if 128 or more
      CPUs manage to do the subtraction, but don't get to do the
      inverse addition in the failure path quickly enough).
      
      A first measure is to modify RW_LOCK_BIAS itself - with the new
      value chosen, it is good for up to 2048 CPUs each allowed to
      nest over 2048 times on the read path without causing an issue.
      Quite possibly it would even be sufficient to adjust the bias a
      little further, assuming that allowing for significantly less
      nesting would suffice.
      
      However, as the original value chosen allowed for even more
      nesting levels, to support more than 2048 CPUs (possible
      currently only for 64-bit kernels) the lock itself gets widened
      to 64 bits.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4E258E0D020000780004E3F0@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a750036f
  19. 15 12月, 2009 4 次提交
  20. 10 7月, 2009 1 次提交
  21. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  22. 03 4月, 2009 1 次提交
  23. 10 2月, 2009 1 次提交
  24. 26 1月, 2009 1 次提交
  25. 21 1月, 2009 1 次提交
    • J
      x86: remove byte locks · afb33f8c
      Jiri Kosina 提交于
      Impact: cleanup
      
      Remove byte locks implementation, which was introduced by Jeremy in
      8efcbab6 ("paravirt: introduce a "lock-byte" spinlock implementation"),
      but turned out to be dead code that is not used by any in-kernel
      virtualization guest (Xen uses its own variant of spinlocks implementation
      and KVM is not planning to move to byte locks).
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      afb33f8c
  26. 23 10月, 2008 2 次提交
  27. 05 9月, 2008 3 次提交
  28. 20 8月, 2008 1 次提交