1. 31 8月, 2009 1 次提交
  2. 18 6月, 2009 1 次提交
  3. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  4. 11 4月, 2009 1 次提交
  5. 10 4月, 2009 1 次提交
  6. 30 3月, 2009 3 次提交
  7. 19 3月, 2009 1 次提交
  8. 17 3月, 2009 1 次提交
    • J
      x86, paravirt: prevent gcc from generating the wrong addressing mode · 42854dc0
      Jeremy Fitzhardinge 提交于
      Impact: fix crash on VMI (VMware)
      
      When we generate a call sequence for calling a paravirtualized
      function, we presume that the generated code is "call *0xXXXXX",
      which is a 6 byte opcode; this is larger than a normal
      direct call, and so we can patch a direct call over it.
      
      At the moment, however we give gcc enough rope to hang us by
      putting the address in a register and generating a two byte
      indirect-via-register call.  Prevent this by explicitly
      dereferencing the function pointer and passing it into the
      asm as a constant.
      
      This prevents crashes in VMI, as it cannot handle unpatchable
      callsites.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Alok Kataria <akataria@vmware.com>
      LKML-Reference: <49BEEDC2.2070809@goop.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      42854dc0
  9. 13 2月, 2009 1 次提交
  10. 12 2月, 2009 2 次提交
  11. 10 2月, 2009 1 次提交
  12. 04 2月, 2009 1 次提交
  13. 03 2月, 2009 1 次提交
  14. 31 1月, 2009 6 次提交
    • J
      x86/paravirt: fix missing callee-save call on pud_val · 4767afbf
      Jeremy Fitzhardinge 提交于
      Impact: Fix build when CONFIG_PARAVIRT_DEBUG is enabled
      
      Fix missed convertion to using callee-saved calls for pud_val, which
      causes a compile error when CONFIG_PARAVIRT_DEBUG is enabled.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      4767afbf
    • J
      x86/paravirt: use callee-saved convention for pte_val/make_pte/etc · da5de7c2
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      In the native case, pte_val, make_pte, etc are all just identity
      functions, so there's no need to clobber a lot of registers over them.
      
      (This changes the 32-bit callee-save calling convention to return both
      EAX and EDX so functions can return 64-bit values.)
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      da5de7c2
    • J
      x86/paravirt: implement PVOP_CALL macros for callee-save functions · 791bad9d
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      Functions with the callee save calling convention clobber many fewer
      registers than the normal C calling convention.  Implement variants of
      PVOP_V?CALL* accordingly.  This only bothers with functions up to 3
      args, since functions with more args may as well use the normal
      calling convention.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      791bad9d
    • J
      x86/paravirt: add register-saving thunks to reduce caller register pressure · ecb93d1c
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      One of the problems with inserting a pile of C calls where previously
      there were none is that the register pressure is greatly increased.
      The C calling convention says that the caller must expect a certain
      set of registers may be trashed by the callee, and that the callee can
      use those registers without restriction.  This includes the function
      argument registers, and several others.
      
      This patch seeks to alleviate this pressure by introducing wrapper
      thunks that will do the register saving/restoring, so that the
      callsite doesn't need to worry about it, but the callee function can
      be conventional compiler-generated code.  In many cases (particularly
      performance-sensitive cases) the callee will be in assembler anyway,
      and need not use the compiler's calling convention.
      
      Standard calling convention is:
      	 arguments	    return	scratch
      x86-32	 eax edx ecx	    eax		?
      x86-64	 rdi rsi rdx rcx    rax		r8 r9 r10 r11
      
      The thunk preserves all argument and scratch registers.  The return
      register is not preserved, and is available as a scratch register for
      unwrapped callee code (and of course the return value).
      
      Wrapped function pointers are themselves wrapped in a struct
      paravirt_callee_save structure, in order to get some warning from the
      compiler when functions with mismatched calling conventions are used.
      
      The most common paravirt ops, both statically and dynamically, are
      interrupt enable/disable/save/restore, so handle them first.  This is
      particularly easy since their calls are handled specially anyway.
      
      XXX Deal with VMI.  What's their calling convention?
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ecb93d1c
    • J
      x86/paravirt: selectively save/restore regs around pvops calls · 9104a18d
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      Each asm paravirt-ops call says what registers are available for
      clobbering.  This patch makes use of this to selectively save/restore
      registers around each pvops call.  In many cases this significantly
      shrinks code size.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9104a18d
    • J
      x86/pvops: add a paravirt_ident functions to allow special patching · 41edafdb
      Jeremy Fitzhardinge 提交于
      Impact: Optimization
      
      Several paravirt ops implementations simply return their arguments,
      the most obvious being the make_pte/pte_val class of operations on
      native.
      
      On 32-bit, the identity function is literally a no-op, as the calling
      convention uses the same registers for the first argument and return.
      On 64-bit, it can be implemented with a single "mov".
      
      This patch adds special identity functions for 32 and 64 bit argument,
      and machinery to recognize them and replace them with either nops or a
      mov as appropriate.
      
      At the moment, the only users for the identity functions are the
      pagetable entry conversion functions.
      
      The result is a measureable improvement on pagetable-heavy benchmarks
      (2-3%, reducing the pvops overhead from 5 to 2%).
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      41edafdb
  15. 23 1月, 2009 1 次提交
  16. 21 1月, 2009 1 次提交
    • J
      x86: remove byte locks · afb33f8c
      Jiri Kosina 提交于
      Impact: cleanup
      
      Remove byte locks implementation, which was introduced by Jeremy in
      8efcbab6 ("paravirt: introduce a "lock-byte" spinlock implementation"),
      but turned out to be dead code that is not used by any in-kernel
      virtualization guest (Xen uses its own variant of spinlocks implementation
      and KVM is not planning to move to byte locks).
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      afb33f8c
  17. 12 1月, 2009 1 次提交
    • R
      x86: change flush_tlb_others to take a const struct cpumask · 4595f962
      Rusty Russell 提交于
      Impact: reduce stack usage, use new cpumask API.
      
      This is made a little more tricky by uv_flush_tlb_others which
      actually alters its argument, for an IPI to be sent to the remaining
      cpus in the mask.
      
      I solve this by allocating a cpumask_var_t for this case and falling back
      to IPI should this fail.
      
      To eliminate temporaries in the caller, all flush_tlb_others implementations
      now do the this-cpu-elimination step themselves.
      
      Note also the curious "cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask)"
      which has been there since pre-git and yet f->flush_cpumask is always zero
      at this point.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NMike Travis <travis@sgi.com>
      4595f962
  18. 23 10月, 2008 2 次提交
  19. 22 8月, 2008 2 次提交
  20. 20 8月, 2008 1 次提交
  21. 24 7月, 2008 1 次提交
  22. 23 7月, 2008 1 次提交
    • V
      x86: consolidate header guards · 77ef50a5
      Vegard Nossum 提交于
      This patch is the result of an automatic script that consolidates the
      format of all the headers in include/asm-x86/.
      
      The format:
      
      1. No leading underscore. Names with leading underscores are reserved.
      2. Pathname components are separated by two underscores. So we can
         distinguish between mm_types.h and mm/types.h.
      3. Everything except letters and numbers are turned into single
         underscores.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      77ef50a5
  23. 22 7月, 2008 2 次提交
    • J
      x86: rename PTE_MASK to PTE_PFN_MASK · 59438c9f
      Jeremy Fitzhardinge 提交于
      Rusty, in his peevish way, complained that macros defining constants
      should have a name which somewhat accurately reflects the actual
      purpose of the constant.
      
      Aside from the fact that PTE_MASK gives no clue as to what's actually
      being masked, and is misleadingly similar to the functionally entirely
      different PMD_MASK, PUD_MASK and PGD_MASK, I don't really see what the
      problem is.
      
      But if this patch silences the incessent noise, then it will have
      achieved its goal (TODO: write test-case).
      Signed-off-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      59438c9f
    • R
      x86: fix pte_flags() to only return flags, fix lguest (updated) · c2e3277f
      Rusty Russell 提交于
      (Jeremy said:
      	rusty: use PTE_MASK
      	rusty: use PTE_MASK
      	rusty: use PTE_MASK
       When I asked:
      	jsgf: does that include the NX flag?
       He responded eloquently:
      	rusty: use PTE_MASK
      	rusty: use PTE_MASK
      	yes, it's the official constant of masking flags out of ptes
      )
      
      Change a15af1c9 'x86/paravirt: add
      pte_flags to just get pte flags' removed lguest's private pte_flags()
      in favor of a generic one.
      
      Unfortunately, the generic one doesn't filter out the non-flags bits:
      this results in lguest creating corrupt shadow page tables and blowing
      up host memory.
      
      Since noone is supposed to use the pfn part of pte_flags(), it seems
      safest to always do the filtering.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-and-morning-tea-spilled-by: NIngo Molnar <mingo@elte.hu>
      c2e3277f
  24. 18 7月, 2008 2 次提交
    • H
      x86: suppress sparse returning void warnings · 32172561
      Harvey Harrison 提交于
      include/asm/paravirt.h:1404:2: warning: returning void-valued expression
      include/asm/paravirt.h:1414:2: warning: returning void-valued expression
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      32172561
    • M
      x86: APIC: remove apic_write_around(); use alternatives · 593f4a78
      Maciej W. Rozycki 提交于
      Use alternatives to select the workaround for the 11AP Pentium erratum
      for the affected steppings on the fly rather than build time.  Remove the
      X86_GOOD_APIC configuration option and replace all the calls to
      apic_write_around() with plain apic_write(), protecting accesses to the
      ESR as appropriate due to the 3AP Pentium erratum.  Remove
      apic_read_around() and all its invocations altogether as not needed.
      Remove apic_write_atomic() and all its implementing backends.  The use of
      ASM_OUTPUT2() is not strictly needed for input constraints, but I have
      used it for readability's sake.
      
      I had the feeling no one else was brave enough to do it, so I went ahead
      and here it is.  Verified by checking the generated assembly and tested
      with both a 32-bit and a 64-bit configuration, also with the 11AP
      "feature" forced on and verified with gdb on /proc/kcore to work as
      expected (as an 11AP machines are quite hard to get hands on these days).
      Some script complained about the use of "volatile", but apic_write() needs
      it for the same reason and is effectively a replacement for writel(), so I
      have disregarded it.
      
      I am not sure what the policy wrt defconfig files is, they are generated
      and there is risk of a conflict resulting from an unrelated change, so I
      have left changes to them out.  The option will get removed from them at
      the next run.
      
      Some testing with machines other than mine will be needed to avoid some
      stupid mistake, but despite its volume, the change is not really that
      intrusive, so I am fairly confident that because it works for me, it will
      everywhere.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      593f4a78
  25. 16 7月, 2008 4 次提交
    • I
      x86: paravirt spinlocks, !CONFIG_SMP build fixes · 4bb689ee
      Ingo Molnar 提交于
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4bb689ee
    • J
      paravirt: introduce a "lock-byte" spinlock implementation · 8efcbab6
      Jeremy Fitzhardinge 提交于
      Implement a version of the old spinlock algorithm, in which everyone
      spins waiting for a lock byte.  In order to be compatible with the
      ticket-lock's use of a zero initializer, this uses the convention of
      '0' for unlocked and '1' for locked.
      
      This algorithm is much better than ticket locks in a virtual
      envionment, because it doesn't interact badly with the vcpu scheduler.
      If there are multiple vcpus spinning on a lock and the lock is
      released, the next vcpu to be scheduled will take the lock, rather
      than cycling around until the next ticketed vcpu gets it.
      
      To use this, you must call paravirt_use_bytelocks() very early, before
      any spinlocks have been taken.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <clameter@linux-foundation.org>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Virtualization <virtualization@lists.linux-foundation.org>
      Cc: Xen devel <xen-devel@lists.xensource.com>
      Cc: Thomas Friebel <thomas.friebel@amd.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8efcbab6
    • J
      x86/paravirt: add hooks for spinlock operations · 74d4affd
      Jeremy Fitzhardinge 提交于
      Ticket spinlocks have absolutely ghastly worst-case performance
      characteristics in a virtual environment.  If there is any contention
      for physical CPUs (ie, there are more runnable vcpus than cpus), then
      ticket locks can cause the system to end up spending 90+% of its time
      spinning.
      
      The problem is that (v)cpus waiting on a ticket spinlock will be
      granted access to the lock in strict order they got their tickets.  If
      the hypervisor scheduler doesn't give the vcpus time in that order,
      they will burn timeslices waiting for the scheduler to give the right
      vcpu some time.  In the worst case it could take O(n^2) vcpu scheduler
      timeslices for everyone waiting on the lock to get it, not counting
      new cpus trying to take the lock while the log-jam is sorted out.
      
      These hooks allow a paravirt backend to replace the spinlock
      implementation.
      
      At the very least, this could revert the implementation back to the
      old lock algorithm, which allows the next scheduled vcpu to take the
      lock, and has basically fairly good performance.
      
      It also allows the spinlocks to take advantages of the hypervisor
      features to make locks more efficient (spin and block, for example).
      
      The cost to native execution is an extra direct call when using a
      spinlock function.  There's no overhead if CONFIG_PARAVIRT is turned
      off.
      
      The lock structure is fixed at a single "unsigned int", initialized to
      zero, but the spinlock implementation can use it as it wishes.
      
      Thanks to Thomas Friebel's Xen Summit talk "Preventing Guests from
      Spinning Around" for pointing out this problem.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <clameter@linux-foundation.org>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Virtualization <virtualization@lists.linux-foundation.org>
      Cc: Xen devel <xen-devel@lists.xensource.com>
      Cc: Thomas Friebel <thomas.friebel@amd.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74d4affd
    • J
      xen64: save lots of registers · c24481e9
      Jeremy Fitzhardinge 提交于
      The Xen hypercall interface is allowed to trash any or all of the
      argument registers, so we need to be careful that the kernel state
      isn't damaged.  On 32-bit kernels, the hypercall parameter registers
      same as a regparm function call, so we've got away without explicit
      clobbering so far.  The 64-bit ABI defines lots of caller-save
      registers, so save them all for safety.  We can trim this set later by
      re-distributing the responsibility for saving all these registers.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c24481e9