1. 04 12月, 2009 1 次提交
  2. 31 8月, 2009 1 次提交
  3. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  4. 09 4月, 2009 1 次提交
  5. 31 3月, 2009 1 次提交
  6. 30 3月, 2009 1 次提交
  7. 05 2月, 2009 1 次提交
  8. 31 1月, 2009 1 次提交
  9. 17 12月, 2008 1 次提交
  10. 01 12月, 2008 1 次提交
  11. 09 9月, 2008 1 次提交
  12. 25 8月, 2008 1 次提交
    • A
      xen: implement CPU hotplugging · d68d82af
      Alex Nixon 提交于
      Note the changes from 2.6.18-xen CPU hotplugging:
      
      A vcpu_down request from the remote admin via Xenbus both hotunplugs the
      CPU, and disables it by removing it from the cpu_present map, and removing
      its entry in /sys.
      
      A vcpu_up request from the remote admin only re-enables the CPU, and does
      not immediately bring the CPU up. A udev event is emitted, which can be
      caught by the user if he wishes to automatically re-up CPUs when available,
      or implement a more complex policy.
      Signed-off-by: NAlex Nixon <alex.nixon@citrix.com>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d68d82af
  13. 22 8月, 2008 1 次提交
  14. 31 7月, 2008 1 次提交
  15. 24 7月, 2008 1 次提交
  16. 16 7月, 2008 5 次提交
  17. 09 7月, 2008 1 次提交
  18. 26 6月, 2008 1 次提交
  19. 02 6月, 2008 2 次提交
  20. 27 5月, 2008 4 次提交
  21. 25 4月, 2008 3 次提交
  22. 17 4月, 2008 1 次提交
  23. 17 10月, 2007 3 次提交
    • J
      xen: deal with stale cr3 values when unpinning pagetables · 9f79991d
      Jeremy Fitzhardinge 提交于
      When a pagetable is no longer in use, it must be unpinned so that its
      pages can be freed.  However, this is only possible if there are no
      stray uses of the pagetable.  The code currently deals with all the
      usual cases, but there's a rare case where a vcpu is changing cr3, but
      is doing so lazily, and the change hasn't actually happened by the time
      the pagetable is unpinned, even though it appears to have been completed.
      
      This change adds a second per-cpu cr3 variable - xen_current_cr3 -
      which tracks the actual state of the vcpu cr3.  It is only updated once
      the actual hypercall to set cr3 has been completed.  Other processors
      wishing to unpin a pagetable can check other vcpu's xen_current_cr3
      values to see if any cross-cpu IPIs are needed to clean things up.
      
      [ Stable folks: 2.6.23 bugfix ]
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Stable Kernel <stable@kernel.org>
      9f79991d
    • J
      xen: yield to IPI target if necessary · f0d73394
      Jeremy Fitzhardinge 提交于
      When sending a call-function IPI to a vcpu, yield if the vcpu isn't
      running.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      f0d73394
    • J
      paravirt: clean up lazy mode handling · 8965c1c0
      Jeremy Fitzhardinge 提交于
      Currently, the set_lazy_mode pv_op is overloaded with 5 functions:
       1. enter lazy cpu mode
       2. leave lazy cpu mode
       3. enter lazy mmu mode
       4. leave lazy mmu mode
       5. flush pending batched operations
      
      This complicates each paravirt backend, since it needs to deal with
      all the possible state transitions, handling flushing, etc. In
      particular, flushing is quite distinct from the other 4 functions, and
      seems to just cause complication.
      
      This patch removes the set_lazy_mode operation, and adds "enter" and
      "leave" lazy mode operations on mmu_ops and cpu_ops.  All the logic
      associated with enter and leaving lazy states is now in common code
      (basically BUG_ONs to make sure that no mode is current when entering
      a lazy mode, and make sure that the mode is current when leaving).
      Also, flush is handled in a common way, by simply leaving and
      re-entering the lazy mode.
      
      The result is that the Xen, lguest and VMI lazy mode implementations
      are much simpler.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Zach Amsden <zach@vmware.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Anthony Liguory <aliguori@us.ibm.com>
      Cc: "Glauber de Oliveira Costa" <glommer@gmail.com>
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      8965c1c0
  24. 11 10月, 2007 1 次提交
  25. 18 7月, 2007 4 次提交
    • J
      xen: use iret directly when possible · 9ec2b804
      Jeremy Fitzhardinge 提交于
      Most of the time we can simply use the iret instruction to exit the
      kernel, rather than having to use the iret hypercall - the only
      exception is if we're returning into vm86 mode, or from delivering an
      NMI (which we don't support yet).
      
      When running native, iret has the behaviour of testing for a pending
      interrupt atomically with re-enabling interrupts.  Unfortunately
      there's no way to do this with Xen, so there's a window in which we
      could get a recursive exception after enabling events but before
      actually returning to userspace.
      
      This causes a problem: if the nested interrupt causes one of the
      task's TIF_WORK_MASK flags to be set, they will not be checked again
      before returning to userspace.  This means that pending work may be
      left pending indefinitely, until the process enters and leaves the
      kernel again.  The net effect is that a pending signal or reschedule
      event could be delayed for an unbounded amount of time.
      
      To deal with this, the xen event upcall handler checks to see if the
      EIP is within the critical section of the iret code, after events
      are (potentially) enabled up to the iret itself.  If its within this
      range, it calls the iret critical section fixup, which adjusts the
      stack to deal with any unrestored registers, and then shifts the
      stack frame up to replace the previous invocation.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      9ec2b804
    • J
      xen: Attempt to patch inline versions of common operations · 6487673b
      Jeremy Fitzhardinge 提交于
      This patchs adds the mechanism to allow us to patch inline versions of
      common operations.
      
      The implementations of the direct-access versions save_fl, restore_fl,
      irq_enable and irq_disable are now in assembler, and the same code is
      used for both out of line and inline uses.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      6487673b
    • J
      xen: Place vcpu_info structure into per-cpu memory · 60223a32
      Jeremy Fitzhardinge 提交于
      An experimental patch for Xen allows guests to place their vcpu_info
      structs anywhere.  We try to use this to place the vcpu_info into the
      PDA, which allows direct access.
      
      If this works, then switch to using direct access operations for
      irq_enable, disable, save_fl and restore_fl.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      60223a32
    • J
      xen: SMP guest support · f87e4cac
      Jeremy Fitzhardinge 提交于
      This is a fairly straightforward Xen implementation of smp_ops.
      
      Xen has its own IPI mechanisms, and has no dependency on any
      APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
      allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
      operation is a single apic_read for the apic version number).
      
      One subtle point which needs to be addressed is unpinning pagetables
      when another cpu may have a lazy tlb reference to the pagetable. Xen
      will not allow an in-use pagetable to be unpinned, so we must find any
      other cpus with a reference to the pagetable and get them to shoot
      down their references.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      f87e4cac