1. 15 6月, 2009 1 次提交
  2. 29 5月, 2009 5 次提交
  3. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  4. 12 5月, 2009 2 次提交
    • H
      x86: make CONFIG_RELOCATABLE the default · 26717808
      H. Peter Anvin 提交于
      Remove the EXPERIMENTAL tag from CONFIG_RELOCATABLE and make it the
      default.  Relocatable kernels have been used for a while now, and
      should now have identical semantics to non-relocatable kernels when
      loaded by a non-relocating bootloader.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      26717808
    • H
      x86: default CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN to 16 MB · ceefccc9
      H. Peter Anvin 提交于
      Default CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN each to 16 MB,
      so that both non-relocatable and relocatable kernels are loaded at
      16 MB by a non-relocating bootloader.  This is somewhat hacky, but it
      appears to be the only way to do this that does not break some some
      set of existing bootloaders.
      
      We want to avoid the bottom 16 MB because of large page breakup,
      memory holes, and ZONE_DMA.  Embedded systems may need to reduce this,
      or update their bootloaders to be aware of the new min_alignment field.
      
      [ Impact: performance improvement, avoids problems on some systems ]
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ceefccc9
  5. 09 5月, 2009 1 次提交
  6. 02 5月, 2009 1 次提交
    • Y
      x86/irq: use move_irq_desc() in create_irq_nr() · 15e957d0
      Yinghai Lu 提交于
      move_irq_desc() will try to move irq_desc to the home node if
      the allocated one is not correct, in create_irq_nr().
      
      ( This can happen on devices that are on different nodes that
        are using MSI, when drivers are loaded and unloaded randomly. )
      
      v2: fix non-smp build
      v3: add NUMA_IRQ_DESC to eliminate #ifdefs
      
      [ Impact: improve irq descriptor locality on NUMA systems ]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      LKML-Reference: <49F95EAE.2050903@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      15e957d0
  7. 28 4月, 2009 1 次提交
    • Y
      x86/irq: remove leftover code from NUMA_MIGRATE_IRQ_DESC · fcef5911
      Yinghai Lu 提交于
      The original feature of migrating irq_desc dynamic was too fragile
      and was causing problems: it caused crashes on systems with lots of
      cards with MSI-X when user-space irq-balancer was enabled.
      
      We now have new patches that create irq_desc according to device
      numa node. This patch removes the leftover bits of the dynamic balancer.
      
      [ Impact: remove dead code ]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      LKML-Reference: <49F654AF.8000808@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fcef5911
  8. 27 4月, 2009 1 次提交
    • L
      x86: unify arch/x86/boot/compressed/vmlinux_*.lds · 51b26ada
      Linus Torvalds 提交于
      Look at the:
      
      	diff -u arch/x86/boot/compressed/vmlinux_*.lds
      
      output and realize that they're basially exactly the same except for
      trivial naming differences, and the fact that the 64-bit version has a
      "pgtable" thing.
      
      So unify them.
      
      There's some trivial cleanup there (make the output format a Kconfig thing
      rather than doing #ifdef's for it, and unify both 32-bit and 64-bit BSS
      end to "_ebss", where 32-bit used to use the traditional "_end"), but
      other than that it's really very mindless and straigt conversion.
      
      For example, I think we should aim to remove "startup_32" vs "startup_64",
      and just call it "startup", and get rid of one more difference. I didn't
      do that.
      
      Also, notice the comment in the unified vmlinux.lds.S talks about
      "head_64" and "startup_32" which is an odd and incorrect mix, but that was
      actually what the old 64-bit only lds file had, so the confusion isn't
      new, and now that mixing is arguably more accurate thanks to the
      vmlinux.lds.S file being shared between the two cases ;)
      
      [ Impact: cleanup, unification ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NSam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      51b26ada
  9. 22 4月, 2009 1 次提交
  10. 21 4月, 2009 1 次提交
  11. 17 4月, 2009 1 次提交
    • Y
      x86/irq: mark NUMA_MIGRATE_IRQ_DESC broken · ca713c2a
      Yinghai Lu 提交于
      It causes crash on system with lots of cards with MSI-X
      when irq_balancer enabled...
      
      The patches fixing it were both complex and fragile, according
      to Eric they were also doing quite dangerous things to the
      hardware.
      
      Instead we now have patches that solve this problem via static
      NUMA node mappings - not dynamic allocation and balancing.
      
      The patches are much simpler than this method but are still too
      large outside of the merge window, so we mark the dynamic balancer
      as broken for now, and queue up the new approach for v2.6.31.
      
      [ Impact: deactivate broken kernel feature ]
      Reported-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      LKML-Reference: <49E68C41.4020801@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ca713c2a
  12. 08 4月, 2009 1 次提交
  13. 07 4月, 2009 1 次提交
    • D
      x86, intel-iommu: fix X2APIC && !ACPI build failure · f7d7f866
      David Woodhouse 提交于
      This build failure:
      
      | drivers/pci/dmar.c:47: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘dmar_tbl_size’
      | drivers/pci/dmar.c:62: warning: ‘struct acpi_dmar_device_scope’ declared inside parameter list
      | drivers/pci/dmar.c:62: warning: its scope is only this definition or declaration, which is probably not what you want
      
      Triggers due to this commit:
      
        d0b03bd1: x2apic/intr-remap: decouple interrupt remapping from x2apic
      
      Which exposed a pre-existing but dormant fragility of the 'select X86_X2APIC'
      it moved around and turned that fragility into a build failure.
      
      Replace it with a proper 'depends on' construct.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      LKML-Reference: <1239084280.22733.404.camel@macbook.infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f7d7f866
  14. 04 4月, 2009 1 次提交
  15. 01 4月, 2009 1 次提交
  16. 30 3月, 2009 1 次提交
  17. 26 3月, 2009 2 次提交
  18. 17 3月, 2009 1 次提交
  19. 13 3月, 2009 3 次提交
  20. 11 3月, 2009 2 次提交
  21. 27 2月, 2009 1 次提交
  22. 25 2月, 2009 1 次提交
    • A
      x86, mce, cmci: factor out threshold interrupt handler · b2762686
      Andi Kleen 提交于
      Impact: cleanup; preparation for feature
      
      The mce_amd_64 code has an own private MC threshold vector with an own
      interrupt handler. Since Intel needs a similar handler
      it makes sense to share the vector because both can not
      be active at the same time.
      
      I factored the common APIC handler code into a separate file which can
      be used by both the Intel or AMD MC code.
      
      This is needed for the next patch which adds an Intel specific
      CMCI handler.
      
      This patch should be a nop for AMD, it just moves some code
      around.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      b2762686
  23. 24 2月, 2009 1 次提交
    • T
      bootmem: clean up arch-specific bootmem wrapping · c1329375
      Tejun Heo 提交于
      Impact: cleaner and consistent bootmem wrapping
      
      By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define
      arch-specific wrappers for bootmem allocation.  However, this is done
      a bit strangely in that only the high level convenience macros can be
      changed while lower level, but still exported, interface functions
      can't be wrapped.  This not only is messy but also leads to strange
      situation where alloc_bootmem() does what the arch wants it to do but
      the equivalent __alloc_bootmem() call doesn't although they should be
      able to be used interchangeably.
      
      This patch updates bootmem such that archs can override / wrap the
      backend function - alloc_bootmem_core() instead of the highlevel
      interface functions to allow simpler and consistent wrapping.  Also,
      HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      c1329375
  24. 23 2月, 2009 2 次提交
    • I
      x86: remove the Voyager 32-bit subarch · 965c7eca
      Ingo Molnar 提交于
      Impact: remove unused/broken code
      
      The Voyager subarch last built successfully on the v2.6.26 kernel
      and has been stale since then and does not build on the v2.6.27,
      v2.6.28 and v2.6.29-rc5 kernels.
      
      No actual users beyond the maintainer reported this breakage.
      Patches were sent and most of the fixes were accepted but the
      discussion around how to do a few remaining issues cleanly
      fizzled out with no resolution and the code remained broken.
      
      In the v2.6.30 x86 tree development cycle 32-bit subarch support
      has been reworked and removed - and the Voyager code, beyond the
      build problems already known, needs serious and significant
      changes and probably a rewrite to support it.
      
      CONFIG_X86_VOYAGER has been marked BROKEN then. The maintainer has
      been notified but no patches have been sent so far to fix it.
      
      While all other subarchs have been converted to the new scheme,
      voyager is still broken. We'd prefer to receive patches which
      clean up the current situation in a constructive way, but even in
      case of removal there is no obstacle to add that support back
      after the issues have been sorted out in a mutually acceptable
      fashion.
      
      So remove this inactive code for now.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      965c7eca
    • R
      x86: improve the help text of X86_EXTENDED_PLATFORM · 8425091f
      Ravikiran G Thirumalai 提交于
      Change the CONFIG_X86_EXTENDED_PLATFORM help text to display the
      32bit/64bit extended platform list. This is as suggested by Ingo.
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Cc: shai@scalex86.org
      Cc: "Benzi Galili (Benzi@ScaleMP.com)" <benzi@scalemp.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8425091f
  25. 20 2月, 2009 1 次提交
    • T
      x86: convert to the new dynamic percpu allocator · 11124411
      Tejun Heo 提交于
      Impact: use new dynamic allocator, unified access to static/dynamic
              percpu memory
      
      Convert to the new dynamic percpu allocator.
      
      * implement populate_extra_pte() for both 32 and 64
      * update setup_per_cpu_areas() to use pcpu_setup_static()
      * define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
      * define config HAVE_DYNAMIC_PER_CPU_AREA
      Signed-off-by: NTejun Heo <tj@kernel.org>
      11124411
  26. 17 2月, 2009 2 次提交
  27. 12 2月, 2009 1 次提交
  28. 11 2月, 2009 1 次提交
  29. 10 2月, 2009 1 次提交
    • T
      x86: implement x86_32 stack protector · 60a5317f
      Tejun Heo 提交于
      Impact: stack protector for x86_32
      
      Implement stack protector for x86_32.  GDT entry 28 is used for it.
      It's set to point to stack_canary-20 and have the length of 24 bytes.
      CONFIG_CC_STACKPROTECTOR turns off CONFIG_X86_32_LAZY_GS and sets %gs
      to the stack canary segment on entry.  As %gs is otherwise unused by
      the kernel, the canary can be anywhere.  It's defined as a percpu
      variable.
      
      x86_32 exception handlers take register frame on stack directly as
      struct pt_regs.  With -fstack-protector turned on, gcc copies the
      whole structure after the stack canary and (of course) doesn't copy
      back on return thus losing all changed.  For now, -fno-stack-protector
      is added to all files which contain those functions.  We definitely
      need something better.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      60a5317f