1. 15 4月, 2015 1 次提交
  2. 03 4月, 2015 1 次提交
  3. 04 2月, 2015 1 次提交
  4. 19 11月, 2014 1 次提交
    • D
      x86: Cleanly separate use of asm-generic/mm_hooks.h · a1ea1c03
      Dave Hansen 提交于
      asm-generic/mm_hooks.h provides some generic fillers for the 90%
      of architectures that do not need to hook some mmap-manipulation
      functions.  A comment inside says:
      
      > Define generic no-op hooks for arch_dup_mmap and
      > arch_exit_mmap, to be included in asm-FOO/mmu_context.h
      > for any arch FOO which doesn't need to hook these.
      
      So, does x86 need to hook these?  It depends on CONFIG_PARAVIRT.
      We *conditionally* include this generic header if we have
      CONFIG_PARAVIRT=n.  That's madness.
      
      With this patch, x86 stops using asm-generic/mmu_hooks.h entirely.
      We use our own copies of the functions.  The paravirt code
      provides some stubs if it is disabled, and we always call those
      stubs in our x86-private versions of arch_exit_mmap() and
      arch_dup_mmap().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182349.14567FA5@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a1ea1c03
  5. 30 1月, 2014 1 次提交
  6. 09 8月, 2013 3 次提交
    • J
      x86, ticketlock: Add slowpath logic · 96f853ea
      Jeremy Fitzhardinge 提交于
      Maintain a flag in the LSB of the ticket lock tail which indicates
      whether anyone is in the lock slowpath and may need kicking when
      the current holder unlocks.  The flags are set when the first locker
      enters the slowpath, and cleared when unlocking to an empty queue (ie,
      no contention).
      
      In the specific implementation of lock_spinning(), make sure to set
      the slowpath flags on the lock just before blocking.  We must do
      this before the last-chance pickup test to prevent a deadlock
      with the unlocker:
      
      Unlocker			Locker
      				test for lock pickup
      					-> fail
      unlock
      test slowpath
      	-> false
      				set slowpath flags
      				block
      
      Whereas this works in any ordering:
      
      Unlocker			Locker
      				set slowpath flags
      				test for lock pickup
      					-> fail
      				block
      unlock
      test slowpath
      	-> true, kick
      
      If the unlocker finds that the lock has the slowpath flag set but it is
      actually uncontended (ie, head == tail, so nobody is waiting), then it
      clears the slowpath flag.
      
      The unlock code uses a locked add to update the head counter.  This also
      acts as a full memory barrier so that its safe to subsequently
      read back the slowflag state, knowing that the updated lock is visible
      to the other CPUs.  If it were an unlocked add, then the flag read may
      just be forwarded from the store buffer before it was visible to the other
      CPUs, which could result in a deadlock.
      
      Unfortunately this means we need to do a locked instruction when
      unlocking with PV ticketlocks.  However, if PV ticketlocks are not
      enabled, then the old non-locked "add" is the only unlocking code.
      
      Note: this code relies on gcc making sure that unlikely() code is out of
      line of the fastpath, which only happens when OPTIMIZE_SIZE=n.  If it
      doesn't the generated code isn't too bad, but its definitely suboptimal.
      
      Thanks to Srivatsa Vaddagiri for providing a bugfix to the original
      version of this change, which has been folded in.
      Thanks to Stephan Diestelhorst for commenting on some code which relied
      on an inaccurate reading of the x86 memory ordering rules.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Link: http://lkml.kernel.org/r/1376058122-8248-11-git-send-email-raghavendra.kt@linux.vnet.ibm.comSigned-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Stephan Diestelhorst <stephan.diestelhorst@amd.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      96f853ea
    • J
      x86, pvticketlock: Use callee-save for lock_spinning · 354714dd
      Jeremy Fitzhardinge 提交于
      Although the lock_spinning calls in the spinlock code are on the
      uncommon path, their presence can cause the compiler to generate many
      more register save/restores in the function pre/postamble, which is in
      the fast path.  To avoid this, convert it to using the pvops callee-save
      calling convention, which defers all the save/restores until the actual
      function is called, keeping the fastpath clean.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Link: http://lkml.kernel.org/r/1376058122-8248-8-git-send-email-raghavendra.kt@linux.vnet.ibm.comReviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Tested-by: NAttilio Rao <attilio.rao@citrix.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      354714dd
    • J
      x86, spinlock: Replace pv spinlocks with pv ticketlocks · 545ac138
      Jeremy Fitzhardinge 提交于
      Rather than outright replacing the entire spinlock implementation in
      order to paravirtualize it, keep the ticket lock implementation but add
      a couple of pvops hooks on the slow patch (long spin on lock, unlocking
      a contended lock).
      
      Ticket locks have a number of nice properties, but they also have some
      surprising behaviours in virtual environments.  They enforce a strict
      FIFO ordering on cpus trying to take a lock; however, if the hypervisor
      scheduler does not schedule the cpus in the correct order, the system can
      waste a huge amount of time spinning until the next cpu can take the lock.
      
      (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
      http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
      
      To address this, we add two hooks:
       - __ticket_spin_lock which is called after the cpu has been
         spinning on the lock for a significant number of iterations but has
         failed to take the lock (presumably because the cpu holding the lock
         has been descheduled).  The lock_spinning pvop is expected to block
         the cpu until it has been kicked by the current lock holder.
       - __ticket_spin_unlock, which on releasing a contended lock
         (there are more cpus with tail tickets), it looks to see if the next
         cpu is blocked and wakes it if so.
      
      When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
      functions causes all the extra code to go away.
      
      Results:
      =======
      setup: 32 core machine with 32 vcpu KVM guest (HT off)  with 8GB RAM
      base = 3.11-rc
      patched = base + pvspinlock V12
      
      +-----------------+----------------+--------+
       dbench (Throughput in MB/sec. Higher is better)
      +-----------------+----------------+--------+
      |   base (stdev %)|patched(stdev%) | %gain  |
      +-----------------+----------------+--------+
      | 15035.3   (0.3) |15150.0   (0.6) |   0.8  |
      |  1470.0   (2.2) | 1713.7   (1.9) |  16.6  |
      |   848.6   (4.3) |  967.8   (4.3) |  14.0  |
      |   652.9   (3.5) |  685.3   (3.7) |   5.0  |
      +-----------------+----------------+--------+
      
      pvspinlock shows benefits for overcommit ratio > 1 for PLE enabled cases,
      and undercommits results are flat
      Signed-off-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Link: http://lkml.kernel.org/r/1376058122-8248-2-git-send-email-raghavendra.kt@linux.vnet.ibm.comReviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Tested-by: NAttilio Rao <attilio.rao@citrix.com>
      [ Raghavendra: Changed SPIN_THRESHOLD, fixed redefinition of arch_spinlock_t]
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      545ac138
  7. 12 4月, 2013 1 次提交
  8. 11 4月, 2013 1 次提交
  9. 19 12月, 2012 1 次提交
    • D
      x86, paravirt: fix build error when thp is disabled · c36e0501
      David Rientjes 提交于
      With CONFIG_PARAVIRT=y and CONFIG_TRANSPARENT_HUGEPAGE=n, the build breaks
      because set_pmd_at() is undeclared:
      
        mm/memory.c: In function 'do_pmd_numa_page':
        mm/memory.c:3520: error: implicit declaration of function 'set_pmd_at'
        mm/mprotect.c: In function 'change_pmd_protnuma':
        mm/mprotect.c:120: error: implicit declaration of function 'set_pmd_at'
      
      This is because paravirt defines set_pmd_at() only when
      CONFIG_TRANSPARENT_HUGEPAGE=y and such a restriction is unneeded.  The
      fix is to define it for all CONFIG_PARAVIRT configurations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c36e0501
  10. 28 6月, 2012 1 次提交
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
  11. 08 6月, 2012 1 次提交
  12. 06 6月, 2012 1 次提交
  13. 20 4月, 2012 1 次提交
  14. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  15. 24 2月, 2012 1 次提交
    • I
      static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb
      Ingo Molnar 提交于
      static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
      
      So here's a boot tested patch on top of Jason's series that does
      all the cleanups I talked about and turns jump labels into a
      more intuitive to use facility. It should also address the
      various misconceptions and confusions that surround jump labels.
      
      Typical usage scenarios:
      
              #include <linux/static_key.h>
      
              struct static_key key = STATIC_KEY_INIT_TRUE;
      
              if (static_key_false(&key))
                      do unlikely code
              else
                      do likely code
      
      Or:
      
              if (static_key_true(&key))
                      do likely code
              else
                      do unlikely code
      
      The static key is modified via:
      
              static_key_slow_inc(&key);
              ...
              static_key_slow_dec(&key);
      
      The 'slow' prefix makes it abundantly clear that this is an
      expensive operation.
      
      I've updated all in-kernel code to use this everywhere. Note
      that I (intentionally) have not pushed through the rename
      blindly through to the lowest levels: the actual jump-label
      patching arch facility should be named like that, so we want to
      decouple jump labels from the static-key facility a bit.
      
      On non-jump-label enabled architectures static keys default to
      likely()/unlikely() branches.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: a.p.zijlstra@chello.nl
      Cc: mathieu.desnoyers@efficios.com
      Cc: davem@davemloft.net
      Cc: ddaney.cavm@gmail.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c5905afb
  16. 14 7月, 2011 1 次提交
  17. 26 1月, 2011 1 次提交
  18. 14 1月, 2011 1 次提交
  19. 28 12月, 2010 1 次提交
    • C
      x86, paravirt: Use native_halt on a halt, not native_safe_halt · c8217b83
      Cliff Wickman 提交于
      halt() should use native_halt()
      safe_halt() uses native_safe_halt()
      
      If CONFIG_PARAVIRT=y, halt() is defined in arch/x86/include/asm/paravirt.h as
      
      static inline void halt(void)
      {
              PVOP_VCALL0(pv_irq_ops.safe_halt);
      }
      
      Otherwise (no CONFIG_PARAVIRT) halt() in arch/x86/include/asm/irqflags.h is
      
      static inline void halt(void)
      {
              native_halt();
      }
      
      So it looks to me like the CONFIG_PARAVIRT case of using native_safe_halt()
      for a halt() is an oversight.
      Am I missing something?
      
      It probably hasn't shown up as a problem because the local apic is disabled
      on a shutdown or restart.  But if we disable interrupts and call halt()
      we shouldn't expect that the halt() will re-enable interrupts.
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      LKML-Reference: <E1PSBcz-0001g1-FM@eag09.americas.sgi.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c8217b83
  20. 11 11月, 2010 1 次提交
    • S
      tracing: Force arch_local_irq_* notrace for paravirt · b5908548
      Steven Rostedt 提交于
      When running ktest.pl randconfig tests, I would sometimes trigger
      a lockdep annotation bug (possible reason: unannotated irqs-on).
      
      This triggering happened right after function tracer self test was
      executed. After doing a config bisect I found that this was caused with
      having function tracer, paravirt guest, prove locking, and rcu torture
      all enabled.
      
      The rcu torture just enhanced the likelyhood of triggering the bug.
      Prove locking was needed, since it was the thing that was bugging.
      Function tracer would trace and disable interrupts in all sorts
      of funny places.
      paravirt guest would turn arch_local_irq_* into functions that would
      be traced.
      
      Besides the fact that tracing arch_local_irq_* is just a bad idea,
      this is what is happening.
      
      The bug happened simply in the local_irq_restore() code:
      
      		if (raw_irqs_disabled_flags(flags)) {	\
      			raw_local_irq_restore(flags);	\
      			trace_hardirqs_off();		\
      		} else {				\
      			trace_hardirqs_on();		\
      			raw_local_irq_restore(flags);	\
      		}					\
      
      The raw_local_irq_restore() was defined as arch_local_irq_restore().
      
      Now imagine, we are about to enable interrupts. We go into the else
      case and call trace_hardirqs_on() which tells lockdep that we are enabling
      interrupts, so it sets the current->hardirqs_enabled = 1.
      
      Then we call raw_local_irq_restore() which calls arch_local_irq_restore()
      which gets traced!
      
      Now in the function tracer we disable interrupts with local_irq_save().
      This is fine, but flags is stored that we have interrupts disabled.
      
      When the function tracer calls local_irq_restore() it does it, but this
      time with flags set as disabled, so we go into the if () path.
      This keeps interrupts disabled and calls trace_hardirqs_off() which
      sets current->hardirqs_enabled = 0.
      
      When the tracer is finished and proceeds with the original code,
      we enable interrupts but leave current->hardirqs_enabled as 0. Which
      now breaks lockdeps internal processing.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b5908548
  21. 07 10月, 2010 1 次提交
    • D
      Fix IRQ flag handling naming · df9ee292
      David Howells 提交于
      Fix the IRQ flag handling naming.  In linux/irqflags.h under one configuration,
      it maps:
      
      	local_irq_enable() -> raw_local_irq_enable()
      	local_irq_disable() -> raw_local_irq_disable()
      	local_irq_save() -> raw_local_irq_save()
      	...
      
      and under the other configuration, it maps:
      
      	raw_local_irq_enable() -> local_irq_enable()
      	raw_local_irq_disable() -> local_irq_disable()
      	raw_local_irq_save() -> local_irq_save()
      	...
      
      This is quite confusing.  There should be one set of names expected of the
      arch, and this should be wrapped to give another set of names that are expected
      by users of this facility.
      
      Change this to have the arch provide:
      
      	flags = arch_local_save_flags()
      	flags = arch_local_irq_save()
      	arch_local_irq_restore(flags)
      	arch_local_irq_disable()
      	arch_local_irq_enable()
      	arch_irqs_disabled_flags(flags)
      	arch_irqs_disabled()
      	arch_safe_halt()
      
      Then linux/irqflags.h wraps these to provide:
      
      	raw_local_save_flags(flags)
      	raw_local_irq_save(flags)
      	raw_local_irq_restore(flags)
      	raw_local_irq_disable()
      	raw_local_irq_enable()
      	raw_irqs_disabled_flags(flags)
      	raw_irqs_disabled()
      	raw_safe_halt()
      
      with type checking on the flags 'arguments', and then wraps those to provide:
      
      	local_save_flags(flags)
      	local_irq_save(flags)
      	local_irq_restore(flags)
      	local_irq_disable()
      	local_irq_enable()
      	irqs_disabled_flags(flags)
      	irqs_disabled()
      	safe_halt()
      
      with tracing included if enabled.
      
      The arch functions can now all be inline functions rather than some of them
      having to be macros.
      
      Signed-off-by: David Howells <dhowells@redhat.com> [X86, FRV, MN10300]
      Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> [Tile]
      Signed-off-by: Michal Simek <monstr@monstr.eu> [Microblaze]
      Tested-by: Catalin Marinas <catalin.marinas@arm.com> [ARM]
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com> [AVR]
      Acked-by: Tony Luck <tony.luck@intel.com> [IA-64]
      Acked-by: Hirokazu Takata <takata@linux-m32r.org> [M32R]
      Acked-by: Greg Ungerer <gerg@uclinux.org> [M68K/M68KNOMMU]
      Acked-by: Ralf Baechle <ralf@linux-mips.org> [MIPS]
      Acked-by: Kyle McMartin <kyle@mcmartin.ca> [PA-RISC]
      Acked-by: Paul Mackerras <paulus@samba.org> [PowerPC]
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [S390]
      Acked-by: Chen Liqin <liqin.chen@sunplusct.com> [Score]
      Acked-by: Matt Fleming <matt@console-pimps.org> [SH]
      Acked-by: David S. Miller <davem@davemloft.net> [Sparc]
      Acked-by: Chris Zankel <chris@zankel.net> [Xtensa]
      Reviewed-by: Richard Henderson <rth@twiddle.net> [Alpha]
      Reviewed-by: Yoshinori Sato <ysato@users.sourceforge.jp> [H8300]
      Cc: starvik@axis.com [CRIS]
      Cc: jesper.nilsson@axis.com [CRIS]
      Cc: linux-cris-kernel@axis.com
      df9ee292
  22. 24 8月, 2010 1 次提交
  23. 28 2月, 2010 1 次提交
  24. 15 12月, 2009 2 次提交
  25. 13 10月, 2009 1 次提交
    • J
      x86/paravirt: Use normal calling sequences for irq enable/disable · 71999d98
      Jeremy Fitzhardinge 提交于
      Bastian Blank reported a boot crash with stackprotector enabled,
      and debugged it back to edx register corruption.
      
      For historical reasons irq enable/disable/save/restore had special
      calling sequences to make them more efficient.  With the more
      recent introduction of higher-level and more general optimisations
      this is no longer necessary so we can just use the normal PVOP_
      macros.
      
      This fixes some residual bugs in the old implementations which left
      edx liable to inadvertent clobbering. Also, fix some bugs in
      __PVOP_VCALLEESAVE which were revealed by actual use.
      Reported-by: NBastian Blank <bastian@waldi.eu.org>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stable Kernel <stable@kernel.org>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4AD3BC9B.7040501@goop.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      71999d98
  26. 16 9月, 2009 1 次提交
  27. 01 9月, 2009 2 次提交
  28. 31 8月, 2009 7 次提交
  29. 18 6月, 2009 1 次提交
  30. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126