1. 29 5月, 2009 5 次提交
  2. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  3. 22 4月, 2009 1 次提交
  4. 17 4月, 2009 1 次提交
    • Y
      x86/irq: mark NUMA_MIGRATE_IRQ_DESC broken · ca713c2a
      Yinghai Lu 提交于
      It causes crash on system with lots of cards with MSI-X
      when irq_balancer enabled...
      
      The patches fixing it were both complex and fragile, according
      to Eric they were also doing quite dangerous things to the
      hardware.
      
      Instead we now have patches that solve this problem via static
      NUMA node mappings - not dynamic allocation and balancing.
      
      The patches are much simpler than this method but are still too
      large outside of the merge window, so we mark the dynamic balancer
      as broken for now, and queue up the new approach for v2.6.31.
      
      [ Impact: deactivate broken kernel feature ]
      Reported-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      LKML-Reference: <49E68C41.4020801@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ca713c2a
  5. 08 4月, 2009 1 次提交
  6. 07 4月, 2009 1 次提交
    • D
      x86, intel-iommu: fix X2APIC && !ACPI build failure · f7d7f866
      David Woodhouse 提交于
      This build failure:
      
      | drivers/pci/dmar.c:47: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘dmar_tbl_size’
      | drivers/pci/dmar.c:62: warning: ‘struct acpi_dmar_device_scope’ declared inside parameter list
      | drivers/pci/dmar.c:62: warning: its scope is only this definition or declaration, which is probably not what you want
      
      Triggers due to this commit:
      
        d0b03bd1: x2apic/intr-remap: decouple interrupt remapping from x2apic
      
      Which exposed a pre-existing but dormant fragility of the 'select X86_X2APIC'
      it moved around and turned that fragility into a build failure.
      
      Replace it with a proper 'depends on' construct.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      LKML-Reference: <1239084280.22733.404.camel@macbook.infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f7d7f866
  7. 04 4月, 2009 1 次提交
  8. 01 4月, 2009 1 次提交
  9. 30 3月, 2009 1 次提交
  10. 26 3月, 2009 2 次提交
  11. 17 3月, 2009 1 次提交
  12. 13 3月, 2009 3 次提交
  13. 11 3月, 2009 2 次提交
  14. 27 2月, 2009 1 次提交
  15. 25 2月, 2009 1 次提交
    • A
      x86, mce, cmci: factor out threshold interrupt handler · b2762686
      Andi Kleen 提交于
      Impact: cleanup; preparation for feature
      
      The mce_amd_64 code has an own private MC threshold vector with an own
      interrupt handler. Since Intel needs a similar handler
      it makes sense to share the vector because both can not
      be active at the same time.
      
      I factored the common APIC handler code into a separate file which can
      be used by both the Intel or AMD MC code.
      
      This is needed for the next patch which adds an Intel specific
      CMCI handler.
      
      This patch should be a nop for AMD, it just moves some code
      around.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      b2762686
  16. 24 2月, 2009 1 次提交
    • T
      bootmem: clean up arch-specific bootmem wrapping · c1329375
      Tejun Heo 提交于
      Impact: cleaner and consistent bootmem wrapping
      
      By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define
      arch-specific wrappers for bootmem allocation.  However, this is done
      a bit strangely in that only the high level convenience macros can be
      changed while lower level, but still exported, interface functions
      can't be wrapped.  This not only is messy but also leads to strange
      situation where alloc_bootmem() does what the arch wants it to do but
      the equivalent __alloc_bootmem() call doesn't although they should be
      able to be used interchangeably.
      
      This patch updates bootmem such that archs can override / wrap the
      backend function - alloc_bootmem_core() instead of the highlevel
      interface functions to allow simpler and consistent wrapping.  Also,
      HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      c1329375
  17. 23 2月, 2009 2 次提交
    • I
      x86: remove the Voyager 32-bit subarch · 965c7eca
      Ingo Molnar 提交于
      Impact: remove unused/broken code
      
      The Voyager subarch last built successfully on the v2.6.26 kernel
      and has been stale since then and does not build on the v2.6.27,
      v2.6.28 and v2.6.29-rc5 kernels.
      
      No actual users beyond the maintainer reported this breakage.
      Patches were sent and most of the fixes were accepted but the
      discussion around how to do a few remaining issues cleanly
      fizzled out with no resolution and the code remained broken.
      
      In the v2.6.30 x86 tree development cycle 32-bit subarch support
      has been reworked and removed - and the Voyager code, beyond the
      build problems already known, needs serious and significant
      changes and probably a rewrite to support it.
      
      CONFIG_X86_VOYAGER has been marked BROKEN then. The maintainer has
      been notified but no patches have been sent so far to fix it.
      
      While all other subarchs have been converted to the new scheme,
      voyager is still broken. We'd prefer to receive patches which
      clean up the current situation in a constructive way, but even in
      case of removal there is no obstacle to add that support back
      after the issues have been sorted out in a mutually acceptable
      fashion.
      
      So remove this inactive code for now.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      965c7eca
    • R
      x86: improve the help text of X86_EXTENDED_PLATFORM · 8425091f
      Ravikiran G Thirumalai 提交于
      Change the CONFIG_X86_EXTENDED_PLATFORM help text to display the
      32bit/64bit extended platform list. This is as suggested by Ingo.
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Cc: shai@scalex86.org
      Cc: "Benzi Galili (Benzi@ScaleMP.com)" <benzi@scalemp.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8425091f
  18. 20 2月, 2009 1 次提交
    • T
      x86: convert to the new dynamic percpu allocator · 11124411
      Tejun Heo 提交于
      Impact: use new dynamic allocator, unified access to static/dynamic
              percpu memory
      
      Convert to the new dynamic percpu allocator.
      
      * implement populate_extra_pte() for both 32 and 64
      * update setup_per_cpu_areas() to use pcpu_setup_static()
      * define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
      * define config HAVE_DYNAMIC_PER_CPU_AREA
      Signed-off-by: NTejun Heo <tj@kernel.org>
      11124411
  19. 17 2月, 2009 2 次提交
  20. 12 2月, 2009 1 次提交
  21. 11 2月, 2009 1 次提交
  22. 10 2月, 2009 2 次提交
    • T
      x86: implement x86_32 stack protector · 60a5317f
      Tejun Heo 提交于
      Impact: stack protector for x86_32
      
      Implement stack protector for x86_32.  GDT entry 28 is used for it.
      It's set to point to stack_canary-20 and have the length of 24 bytes.
      CONFIG_CC_STACKPROTECTOR turns off CONFIG_X86_32_LAZY_GS and sets %gs
      to the stack canary segment on entry.  As %gs is otherwise unused by
      the kernel, the canary can be anywhere.  It's defined as a percpu
      variable.
      
      x86_32 exception handlers take register frame on stack directly as
      struct pt_regs.  With -fstack-protector turned on, gcc copies the
      whole structure after the stack canary and (of course) doesn't copy
      back on return thus losing all changed.  For now, -fno-stack-protector
      is added to all files which contain those functions.  We definitely
      need something better.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      60a5317f
    • T
      x86: make lazy %gs optional on x86_32 · ccbeed3a
      Tejun Heo 提交于
      Impact: pt_regs changed, lazy gs handling made optional, add slight
              overhead to SAVE_ALL, simplifies error_code path a bit
      
      On x86_32, %gs hasn't been used by kernel and handled lazily.  pt_regs
      doesn't have place for it and gs is saved/loaded only when necessary.
      In preparation for stack protector support, this patch makes lazy %gs
      handling optional by doing the followings.
      
      * Add CONFIG_X86_32_LAZY_GS and place for gs in pt_regs.
      
      * Save and restore %gs along with other registers in entry_32.S unless
        LAZY_GS.  Note that this unfortunately adds "pushl $0" on SAVE_ALL
        even when LAZY_GS.  However, it adds no overhead to common exit path
        and simplifies entry path with error code.
      
      * Define different user_gs accessors depending on LAZY_GS and add
        lazy_save_gs() and lazy_load_gs() which are noop if !LAZY_GS.  The
        lazy_*_gs() ops are used to save, load and clear %gs lazily.
      
      * Define ELF_CORE_COPY_KERNEL_REGS() which always read %gs directly.
      
      xen and lguest changes need to be verified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ccbeed3a
  23. 08 2月, 2009 2 次提交
    • S
      ftrace: change function graph tracer to use new in_nmi · 9a5fd902
      Steven Rostedt 提交于
      The function graph tracer piggy backed onto the dynamic ftracer
      to use the in_nmi custom code for dynamic tracing. The problem
      was (as Andrew Morton pointed out) it really only wanted to bail
      out if the context of the current CPU was in NMI context. But the
      dynamic ftrace in_nmi custom code was true if _any_ CPU happened
      to be in NMI context.
      
      Now that we have a generic in_nmi interface, this patch changes
      the function graph code to use it instead of the dynamic ftarce
      custom code.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      9a5fd902
    • S
      ring-buffer: add NMI protection for spinlocks · 78d904b4
      Steven Rostedt 提交于
      Impact: prevent deadlock in NMI
      
      The ring buffers are not yet totally lockless with writing to
      the buffer. When a writer crosses a page, it grabs a per cpu spinlock
      to protect against a reader. The spinlocks taken by a writer are not
      to protect against other writers, since a writer can only write to
      its own per cpu buffer. The spinlocks protect against readers that
      can touch any cpu buffer. The writers are made to be reentrant
      with the spinlocks disabling interrupts.
      
      The problem arises when an NMI writes to the buffer, and that write
      crosses a page boundary. If it grabs a spinlock, it can be racing
      with another writer (since disabling interrupts does not protect
      against NMIs) or with a reader on the same CPU. Luckily, most of the
      users are not reentrant and protects against this issue. But if a
      user of the ring buffer becomes reentrant (which is what the ring
      buffers do allow), if the NMI also writes to the ring buffer then
      we risk the chance of a deadlock.
      
      This patch moves the ftrace_nmi_enter called by nmi_enter() to the
      ring buffer code. It replaces the current ftrace_nmi_enter that is
      used by arch specific code to arch_ftrace_nmi_enter and updates
      the Kconfig to handle it.
      
      When an NMI is called, it will set a per cpu variable in the ring buffer
      code and will clear it when the NMI exits. If a write to the ring buffer
      crosses page boundaries inside an NMI, a trylock is used on the spin
      lock instead. If the spinlock fails to be acquired, then the entry
      is discarded.
      
      This bug appeared in the ftrace work in the RT tree, where event tracing
      is reentrant. This workaround solved the deadlocks that appeared there.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      78d904b4
  24. 06 2月, 2009 1 次提交
  25. 05 2月, 2009 1 次提交
  26. 30 1月, 2009 2 次提交
  27. 29 1月, 2009 1 次提交