1. 31 8月, 2009 4 次提交
  2. 27 8月, 2009 1 次提交
  3. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  4. 30 3月, 2009 6 次提交
  5. 19 3月, 2009 1 次提交
  6. 23 2月, 2009 1 次提交
    • I
      x86: refactor x86_quirks support · 8e6dafd6
      Ingo Molnar 提交于
      Impact: cleanup
      
      Make x86_quirks support more transparent. The highlevel
      methods are now named:
      
        extern void x86_quirk_pre_intr_init(void);
        extern void x86_quirk_intr_init(void);
      
        extern void x86_quirk_trap_init(void);
      
        extern void x86_quirk_pre_time_init(void);
        extern void x86_quirk_time_init(void);
      
      This makes it clear that if some platform extension has to
      do something here that it is considered ... weird, and is
      discouraged.
      
      Also remove arch_hooks.h and move it into setup.h (and other
      header files where appropriate).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8e6dafd6
  7. 13 2月, 2009 2 次提交
  8. 31 1月, 2009 3 次提交
    • J
      x86/paravirt: use callee-saved convention for pte_val/make_pte/etc · da5de7c2
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      In the native case, pte_val, make_pte, etc are all just identity
      functions, so there's no need to clobber a lot of registers over them.
      
      (This changes the 32-bit callee-save calling convention to return both
      EAX and EDX so functions can return 64-bit values.)
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      da5de7c2
    • J
      x86/paravirt: add register-saving thunks to reduce caller register pressure · ecb93d1c
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      One of the problems with inserting a pile of C calls where previously
      there were none is that the register pressure is greatly increased.
      The C calling convention says that the caller must expect a certain
      set of registers may be trashed by the callee, and that the callee can
      use those registers without restriction.  This includes the function
      argument registers, and several others.
      
      This patch seeks to alleviate this pressure by introducing wrapper
      thunks that will do the register saving/restoring, so that the
      callsite doesn't need to worry about it, but the callee function can
      be conventional compiler-generated code.  In many cases (particularly
      performance-sensitive cases) the callee will be in assembler anyway,
      and need not use the compiler's calling convention.
      
      Standard calling convention is:
      	 arguments	    return	scratch
      x86-32	 eax edx ecx	    eax		?
      x86-64	 rdi rsi rdx rcx    rax		r8 r9 r10 r11
      
      The thunk preserves all argument and scratch registers.  The return
      register is not preserved, and is available as a scratch register for
      unwrapped callee code (and of course the return value).
      
      Wrapped function pointers are themselves wrapped in a struct
      paravirt_callee_save structure, in order to get some warning from the
      compiler when functions with mismatched calling conventions are used.
      
      The most common paravirt ops, both statically and dynamically, are
      interrupt enable/disable/save/restore, so handle them first.  This is
      particularly easy since their calls are handled specially anyway.
      
      XXX Deal with VMI.  What's their calling convention?
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ecb93d1c
    • J
      x86/pvops: add a paravirt_ident functions to allow special patching · 41edafdb
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      Several paravirt ops implementations simply return their arguments,
      the most obvious being the make_pte/pte_val class of operations on
      native.
      
      On 32-bit, the identity function is literally a no-op, as the calling
      convention uses the same registers for the first argument and return.
      On 64-bit, it can be implemented with a single "mov".
      
      This patch adds special identity functions for 32 and 64 bit argument,
      and machinery to recognize them and replace them with either nops or a
      mov as appropriate.
      
      At the moment, the only users for the identity functions are the
      pagetable entry conversion functions.
      
      The result is a measureable improvement on pagetable-heavy benchmarks
      (2-3%, reducing the pvops overhead from 5 to 2%).
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      41edafdb
  9. 23 1月, 2009 1 次提交
  10. 22 8月, 2008 1 次提交
  11. 21 8月, 2008 1 次提交
  12. 24 7月, 2008 2 次提交
  13. 22 7月, 2008 1 次提交
    • R
      x86: fix pte_flags() to only return flags, fix lguest (updated) · c2e3277f
      Rusty Russell 提交于
      (Jeremy said:
      	rusty: use PTE_MASK
      	rusty: use PTE_MASK
      	rusty: use PTE_MASK
       When I asked:
      	jsgf: does that include the NX flag?
       He responded eloquently:
      	rusty: use PTE_MASK
      	rusty: use PTE_MASK
      	yes, it's the official constant of masking flags out of ptes
      )
      
      Change a15af1c9 'x86/paravirt: add
      pte_flags to just get pte flags' removed lguest's private pte_flags()
      in favor of a generic one.
      
      Unfortunately, the generic one doesn't filter out the non-flags bits:
      this results in lguest creating corrupt shadow page tables and blowing
      up host memory.
      
      Since noone is supposed to use the pfn part of pte_flags(), it seems
      safest to always do the filtering.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-and-morning-tea-spilled-by: NIngo Molnar <mingo@elte.hu>
      c2e3277f
  14. 18 7月, 2008 1 次提交
    • M
      x86: APIC: remove apic_write_around(); use alternatives · 593f4a78
      Maciej W. Rozycki 提交于
      Use alternatives to select the workaround for the 11AP Pentium erratum
      for the affected steppings on the fly rather than build time.  Remove the
      X86_GOOD_APIC configuration option and replace all the calls to
      apic_write_around() with plain apic_write(), protecting accesses to the
      ESR as appropriate due to the 3AP Pentium erratum.  Remove
      apic_read_around() and all its invocations altogether as not needed.
      Remove apic_write_atomic() and all its implementing backends.  The use of
      ASM_OUTPUT2() is not strictly needed for input constraints, but I have
      used it for readability's sake.
      
      I had the feeling no one else was brave enough to do it, so I went ahead
      and here it is.  Verified by checking the generated assembly and tested
      with both a 32-bit and a 64-bit configuration, also with the 11AP
      "feature" forced on and verified with gdb on /proc/kcore to work as
      expected (as an 11AP machines are quite hard to get hands on these days).
      Some script complained about the use of "volatile", but apic_write() needs
      it for the same reason and is effectively a replacement for writel(), so I
      have disregarded it.
      
      I am not sure what the policy wrt defconfig files is, they are generated
      and there is risk of a conflict resulting from an unrelated change, so I
      have left changes to them out.  The option will get removed from them at
      the next run.
      
      Some testing with machines other than mine will be needed to avoid some
      stupid mistake, but despite its volume, the change is not really that
      intrusive, so I am fairly confident that because it works for me, it will
      everywhere.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      593f4a78
  15. 16 7月, 2008 5 次提交
    • I
      x86: paravirt spinlocks, modular build fix · 9af98578
      Ingo Molnar 提交于
      fix:
      
        MODPOST 408 modules
      ERROR: "pv_lock_ops" [net/dccp/dccp.ko] undefined!
      ERROR: "pv_lock_ops" [fs/jbd2/jbd2.ko] undefined!
      ERROR: "pv_lock_ops" [drivers/media/common/saa7146_vv.ko] undefined!
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9af98578
    • I
      x86: paravirt spinlocks, !CONFIG_SMP build fixes · 4bb689ee
      Ingo Molnar 提交于
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4bb689ee
    • J
      paravirt: introduce a "lock-byte" spinlock implementation · 8efcbab6
      Jeremy Fitzhardinge 提交于
      Implement a version of the old spinlock algorithm, in which everyone
      spins waiting for a lock byte.  In order to be compatible with the
      ticket-lock's use of a zero initializer, this uses the convention of
      '0' for unlocked and '1' for locked.
      
      This algorithm is much better than ticket locks in a virtual
      envionment, because it doesn't interact badly with the vcpu scheduler.
      If there are multiple vcpus spinning on a lock and the lock is
      released, the next vcpu to be scheduled will take the lock, rather
      than cycling around until the next ticketed vcpu gets it.
      
      To use this, you must call paravirt_use_bytelocks() very early, before
      any spinlocks have been taken.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <clameter@linux-foundation.org>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Virtualization <virtualization@lists.linux-foundation.org>
      Cc: Xen devel <xen-devel@lists.xensource.com>
      Cc: Thomas Friebel <thomas.friebel@amd.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8efcbab6
    • J
      x86/paravirt: add hooks for spinlock operations · 74d4affd
      Jeremy Fitzhardinge 提交于
      Ticket spinlocks have absolutely ghastly worst-case performance
      characteristics in a virtual environment.  If there is any contention
      for physical CPUs (ie, there are more runnable vcpus than cpus), then
      ticket locks can cause the system to end up spending 90+% of its time
      spinning.
      
      The problem is that (v)cpus waiting on a ticket spinlock will be
      granted access to the lock in strict order they got their tickets.  If
      the hypervisor scheduler doesn't give the vcpus time in that order,
      they will burn timeslices waiting for the scheduler to give the right
      vcpu some time.  In the worst case it could take O(n^2) vcpu scheduler
      timeslices for everyone waiting on the lock to get it, not counting
      new cpus trying to take the lock while the log-jam is sorted out.
      
      These hooks allow a paravirt backend to replace the spinlock
      implementation.
      
      At the very least, this could revert the implementation back to the
      old lock algorithm, which allows the next scheduled vcpu to take the
      lock, and has basically fairly good performance.
      
      It also allows the spinlocks to take advantages of the hypervisor
      features to make locks more efficient (spin and block, for example).
      
      The cost to native execution is an extra direct call when using a
      spinlock function.  There's no overhead if CONFIG_PARAVIRT is turned
      off.
      
      The lock structure is fixed at a single "unsigned int", initialized to
      zero, but the spinlock implementation can use it as it wishes.
      
      Thanks to Thomas Friebel's Xen Summit talk "Preventing Guests from
      Spinning Around" for pointing out this problem.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <clameter@linux-foundation.org>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Virtualization <virtualization@lists.linux-foundation.org>
      Cc: Xen devel <xen-devel@lists.xensource.com>
      Cc: Thomas Friebel <thomas.friebel@amd.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74d4affd
    • E
      x86/paravirt: call paravirt_pagetable_setup_{start, done} · a312b37b
      Eduardo Habkost 提交于
      Call paravirt_pagetable_setup_{start,done}
      
      These paravirt_ops functions were not being called on x86_64.
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a312b37b
  16. 14 7月, 2008 1 次提交
  17. 12 7月, 2008 2 次提交
    • S
      x64, x2apic/intr-remap: Interrupt-remapping and x2apic support · 372e92d8
      Suresh Siddha 提交于
      On Thu, Jul 10, 2008 at 12:53:20PM -0700, Ingo Molnar wrote:
      >
      > Btw., i threw it at the -tip test-cluster and got back a quick build
      > bugreport:
      >
      > arch/x86/xen/enlighten.c: In function 'xen_patch':
      > arch/x86/xen/enlighten.c:1084: warning: label 'patch_site' defined but not used
      > arch/x86/xen/enlighten.c: At top level:
      > arch/x86/xen/enlighten.c:1272: error: expected identifier before '(' token
      > arch/x86/xen/enlighten.c:1273: error: expected '}' before '.' token
      > arch/x86/kernel/paravirt.c:376:2: error: invalid preprocessing directive
      > #ifndedarch/x86/kernel/paravirt.c:384:2: error: #endif without #if
      >
      > with this config:
      >
      >   http://redhat.com/~mingo/misc/config-Thu_Jul_10_21_43_28_CEST_2008.bad
      
      fix the typo.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: "Siddha
      Cc: Suresh B" <suresh.b.siddha@intel.com>
      Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>
      Cc: "arjan@linux.intel.com" <arjan@linux.intel.com>
      Cc: "andi@firstfloor.org" <andi@firstfloor.org>
      Cc: "ebiederm@xmission.com" <ebiederm@xmission.com>
      Cc: "jbarnes@virtuousgeek.org" <jbarnes@virtuousgeek.org>
      Cc: "steiner@sgi.com" <steiner@sgi.com>
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      372e92d8
    • S
      x64, x2apic/intr-remap: basic apic ops support · 1b374e4d
      Suresh Siddha 提交于
      Introduce basic apic operations which handle the apic programming. This
      will be used later to introduce another specific operations for x2apic.
      
      For the perfomance critial accesses like IPI's, EOI etc, we use the
      native operations as they are already referenced by different
      indirections like genapic, irq_chip etc.
      
      64bit Paravirt ops can also define their apic operations accordingly.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: akpm@linux-foundation.org
      Cc: arjan@linux.intel.com
      Cc: andi@firstfloor.org
      Cc: ebiederm@xmission.com
      Cc: jbarnes@virtuousgeek.org
      Cc: steiner@sgi.com
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1b374e4d
  18. 09 7月, 2008 1 次提交
  19. 08 7月, 2008 5 次提交