1. 29 5月, 2009 29 次提交
  2. 27 5月, 2009 4 次提交
  3. 25 5月, 2009 1 次提交
  4. 23 5月, 2009 1 次提交
  5. 22 5月, 2009 1 次提交
  6. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  7. 15 5月, 2009 1 次提交
  8. 14 5月, 2009 1 次提交
    • S
      x86/function-graph: fix constraint for recording old return value · aa512a27
      Steven Rostedt 提交于
      After upgrading from gcc 4.2.2 to 4.4.0, the function graph tracer broke.
      Investigating, I found that in the asm that replaces the return value,
      gcc was using the same register for the old value as it was for the
      new value.
      
      	mov	(addr), old
      	mov	new, (addr)
      
      But if old and new are the same register, we clobber new with old!
      I first thought this was a bug in gcc 4.4.0 and reported it:
      
        http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40132
      
      Andrew Pinski responded (quickly), saying that it was correct gcc behavior
      and the code needed to denote old as an "early clobber".
      
      Instead of "=r"(old), we need "=&r"(old).
      
      [Impact: keep function graph tracer from breaking with gcc 4.4.0 ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      aa512a27
  9. 11 5月, 2009 1 次提交
    • Y
      x86: mtrr: Fix high_width computation when phys-addr is >= 44bit · 917a0153
      Yinghai Lu 提交于
      found one system where cpu address line is 44bits, mtrr printout
      is not right:
      
       [    0.000000] MTRR variable ranges enabled:
       [    0.000000]   0 base 0   00000000 mask FF0 00000000 write-back
       [    0.000000]   1 base 10  00000000 mask FFF 80000000 write-back
       [    0.000000]   2 base 0   80000000 mask FFF 80000000 uncachable
       [    0.000000]   3 base 0   7F800000 mask FFF FF800000 uncachable
      
      Li Zefan and Frederic pointed out the high_width could be -4 some how.
      
      It turns out when phys_addr is 44bit, size_or_mask will be
      ffffffff,00000000 so ffs(size_or_mask) will be 0.
      
      Try to check low 32 bit, to get correct high_width.
      Signed-off-by: NYinghai Lu <yinghai@kerne.org>
      Also-analyzed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Also-analyzed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Zhaolei <zhaolei@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4A026540.8060504@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      917a0153