提交 · e7b52ffd45a6d834473f43b349e7d86593d763c7 · openanolis / cloud-kernel

28 6月, 2012 1 次提交

x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd

由 Alex Shi 提交于 6月 28, 2012

x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.

But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.

So, on a 512 4KB TLB entries CPU, the balance points is at:
	(512 - X) * 100ns(assumed TLB refill cost) =
		X(TLB flush entries) * 100ns(assumed invlpg cost)

Here, X is 256, that is 1/2 of 512 entries.

But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.

As to hugetlb, guess due to smaller page table, and smaller active TLB
entries, I didn't see benefit via my benchmark, so no optimizing now.

My micro benchmark show in ideal scenarios, the performance improves 70
percent in reading. And in worst scenario, the reading/writing
performance is similar with unpatched 3.4-rc4 kernel.

Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
'always':

multi thread testing, '-t' paramter is thread number:
	       	        with patch   unpatched 3.4-rc4
./mprotect -t 1           14ns		24ns
./mprotect -t 2           13ns		22ns
./mprotect -t 4           12ns		19ns
./mprotect -t 8           14ns		16ns
./mprotect -t 16          28ns		26ns
./mprotect -t 32          54ns		51ns
./mprotect -t 128         200ns		199ns

Single process with sequencial flushing and memory accessing:

		       	with patch   unpatched 3.4-rc4
./mprotect		    7ns			11ns
./mprotect -p 4096  -l 8 -n 10240
			    21ns		21ns

[ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
  has additional performance numbers. ]
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>

e7b52ffd

20 4月, 2012 1 次提交

x86, paravirt: Replace GET_CR2_INTO_RCX with GET_CR2_INTO_RAX · ffc4bc9c

由 H. Peter Anvin 提交于 4月 18, 2012

GET_CR2_INTO_RCX is asinine: it is only used in one place, the actual
paravirt call returns the value in %rax, not %rcx; and the one place
that wants it wants the result in %r9.  We actually generate as a
result of this call:

       call ...
       movq %rax, %rcx
       xorq %rax, %rax		/* this value isn't even used... */
       movq %rcx, %r9

At least make the macro do what the paravirt call does, which is put
the value into %rax.

Nevermind the fact that the macro clobbers all the volatile registers.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1334794610-5546-4-git-send-email-hpa@zytor.com
Cc: Glauber de Oliveira Costa <glommer@parallels.com>

ffc4bc9c

05 3月, 2012 1 次提交

BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882

由 Paul Gortmaker 提交于 11月 23, 2011

If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
other BUG variant in a static inline (i.e. not in a #define) then
that header really should be including <linux/bug.h> and not just
expecting it to be implicitly present.

We can make this change risk-free, since if the files using these
headers didn't have exposure to linux/bug.h already, they would have
been causing compile failures/warnings.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

187f1882

24 2月, 2012 1 次提交

static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb

由 Ingo Molnar 提交于 2月 24, 2012

static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()

So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NJason Baron <jbaron@redhat.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>

c5905afb

14 7月, 2011 1 次提交

KVM guest: Add a pv_ops stub for steal time · 3c404b57

由 Glauber Costa 提交于 7月 11, 2011

This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function. Besides
a steal time function, we also declare two jump_labels. They will be
used to allow the steal time code to be easily bypassed when not
in use.
Signed-off-by: NGlauber Costa <glommer@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Tested-by: NEric B Munson <emunson@mgebm.net>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3c404b57

26 1月, 2011 1 次提交

thp: fix PARAVIRT x86 32bit noPAE · cacf061c

由 Andrea Arcangeli 提交于 1月 25, 2011

This fixes TRANSPARENT_HUGEPAGE=y with PARAVIRT=y and HIGHMEM64=n.

The #ifdef that this patch removes was erratically introduced to fix a
build error for noPAE (where pmd.pmd doesn't exist).  So then the kernel
built but it failed at runtime because set_pmd_at was a noop.  This will
correct it by enabling set_pmd_at for noPAE mode too.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: Nwerner <w.landgraf@ru.ru>
Reported-by: NMinchan Kim <minchan.kim@gmail.com>
Tested-by: NMinchan Kim <minchan.kim@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cacf061c

14 1月, 2011 1 次提交

thp: add pmd paravirt ops · 331127f7

由 Andrea Arcangeli 提交于 1月 13, 2011

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at.  Not all might be
necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs
pmd_update_defer), but this is to keep full simmetry with pte paravirt
ops, which looks cleaner and simpler from a common code POV.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

331127f7

28 12月, 2010 1 次提交

x86, paravirt: Use native_halt on a halt, not native_safe_halt · c8217b83

由 Cliff Wickman 提交于 12月 13, 2010

halt() should use native_halt()
safe_halt() uses native_safe_halt()

If CONFIG_PARAVIRT=y, halt() is defined in arch/x86/include/asm/paravirt.h as

static inline void halt(void)
{
        PVOP_VCALL0(pv_irq_ops.safe_halt);
}

Otherwise (no CONFIG_PARAVIRT) halt() in arch/x86/include/asm/irqflags.h is

static inline void halt(void)
{
        native_halt();
}

So it looks to me like the CONFIG_PARAVIRT case of using native_safe_halt()
for a halt() is an oversight.
Am I missing something?

It probably hasn't shown up as a problem because the local apic is disabled
on a shutdown or restart.  But if we disable interrupts and call halt()
we shouldn't expect that the halt() will re-enable interrupts.
Signed-off-by: NCliff Wickman <cpw@sgi.com>
LKML-Reference: <E1PSBcz-0001g1-FM@eag09.americas.sgi.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

c8217b83

11 11月, 2010 1 次提交

tracing: Force arch_local_irq_* notrace for paravirt · b5908548

由 Steven Rostedt 提交于 11月 10, 2010

When running ktest.pl randconfig tests, I would sometimes trigger
a lockdep annotation bug (possible reason: unannotated irqs-on).

This triggering happened right after function tracer self test was
executed. After doing a config bisect I found that this was caused with
having function tracer, paravirt guest, prove locking, and rcu torture
all enabled.

The rcu torture just enhanced the likelyhood of triggering the bug.
Prove locking was needed, since it was the thing that was bugging.
Function tracer would trace and disable interrupts in all sorts
of funny places.
paravirt guest would turn arch_local_irq_* into functions that would
be traced.

Besides the fact that tracing arch_local_irq_* is just a bad idea,
this is what is happening.

The bug happened simply in the local_irq_restore() code:

		if (raw_irqs_disabled_flags(flags)) {	\
			raw_local_irq_restore(flags);	\
			trace_hardirqs_off();		\
		} else {				\
			trace_hardirqs_on();		\
			raw_local_irq_restore(flags);	\
		}					\

The raw_local_irq_restore() was defined as arch_local_irq_restore().

Now imagine, we are about to enable interrupts. We go into the else
case and call trace_hardirqs_on() which tells lockdep that we are enabling
interrupts, so it sets the current->hardirqs_enabled = 1.

Then we call raw_local_irq_restore() which calls arch_local_irq_restore()
which gets traced!

Now in the function tracer we disable interrupts with local_irq_save().
This is fine, but flags is stored that we have interrupts disabled.

When the function tracer calls local_irq_restore() it does it, but this
time with flags set as disabled, so we go into the if () path.
This keeps interrupts disabled and calls trace_hardirqs_off() which
sets current->hardirqs_enabled = 0.

When the tracer is finished and proceeds with the original code,
we enable interrupts but leave current->hardirqs_enabled as 0. Which
now breaks lockdeps internal processing.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

b5908548

07 10月, 2010 1 次提交

Fix IRQ flag handling naming · df9ee292

由 David Howells 提交于 10月 07, 2010

Fix the IRQ flag handling naming.  In linux/irqflags.h under one configuration,
it maps:

	local_irq_enable() -> raw_local_irq_enable()
	local_irq_disable() -> raw_local_irq_disable()
	local_irq_save() -> raw_local_irq_save()
	...

and under the other configuration, it maps:

	raw_local_irq_enable() -> local_irq_enable()
	raw_local_irq_disable() -> local_irq_disable()
	raw_local_irq_save() -> local_irq_save()
	...

This is quite confusing.  There should be one set of names expected of the
arch, and this should be wrapped to give another set of names that are expected
by users of this facility.

Change this to have the arch provide:

	flags = arch_local_save_flags()
	flags = arch_local_irq_save()
	arch_local_irq_restore(flags)
	arch_local_irq_disable()
	arch_local_irq_enable()
	arch_irqs_disabled_flags(flags)
	arch_irqs_disabled()
	arch_safe_halt()

Then linux/irqflags.h wraps these to provide:

	raw_local_save_flags(flags)
	raw_local_irq_save(flags)
	raw_local_irq_restore(flags)
	raw_local_irq_disable()
	raw_local_irq_enable()
	raw_irqs_disabled_flags(flags)
	raw_irqs_disabled()
	raw_safe_halt()

with type checking on the flags 'arguments', and then wraps those to provide:

	local_save_flags(flags)
	local_irq_save(flags)
	local_irq_restore(flags)
	local_irq_disable()
	local_irq_enable()
	irqs_disabled_flags(flags)
	irqs_disabled()
	safe_halt()

with tracing included if enabled.

The arch functions can now all be inline functions rather than some of them
having to be macros.

Signed-off-by: David Howells <dhowells@redhat.com> [X86, FRV, MN10300]
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> [Tile]
Signed-off-by: Michal Simek <monstr@monstr.eu> [Microblaze]
Tested-by: Catalin Marinas <catalin.marinas@arm.com> [ARM]
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com> [AVR]
Acked-by: Tony Luck <tony.luck@intel.com> [IA-64]
Acked-by: Hirokazu Takata <takata@linux-m32r.org> [M32R]
Acked-by: Greg Ungerer <gerg@uclinux.org> [M68K/M68KNOMMU]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [MIPS]
Acked-by: Kyle McMartin <kyle@mcmartin.ca> [PA-RISC]
Acked-by: Paul Mackerras <paulus@samba.org> [PowerPC]
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [S390]
Acked-by: Chen Liqin <liqin.chen@sunplusct.com> [Score]
Acked-by: Matt Fleming <matt@console-pimps.org> [SH]
Acked-by: David S. Miller <davem@davemloft.net> [Sparc]
Acked-by: Chris Zankel <chris@zankel.net> [Xtensa]
Reviewed-by: Richard Henderson <rth@twiddle.net> [Alpha]
Reviewed-by: Yoshinori Sato <ysato@users.sourceforge.jp> [H8300]
Cc: starvik@axis.com [CRIS]
Cc: jesper.nilsson@axis.com [CRIS]
Cc: linux-cris-kernel@axis.com

df9ee292

24 8月, 2010 1 次提交

x86, paravirt: Remove alloc_pmd_clone hook, only used by VMI · b0f4c062

由 Alok Kataria 提交于 8月 23, 2010

VMI was the only user of the alloc_pmd_clone hook, given that VMI
is now removed we can also remove this hook.
Signed-off-by: NAlok N Kataria <akataria@vmware.com>
LKML-Reference: <1282608357.19396.36.camel@ank32.eng.vmware.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

b0f4c062

28 2月, 2010 1 次提交

x86, paravirt: Remove kmap_atomic_pte paravirt op. · dad52fc0

由 Ian Campbell 提交于 2月 26, 2010

Now that both Xen and VMI disable allocations of PTE pages from high
memory this paravirt op serves no further purpose.

This effectively reverts ce6234b5 "add kmap_atomic_pte for mapping
highpte pages".
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
LKML-Reference: <1267204562-11844-3-git-send-email-ian.campbell@citrix.com>
Acked-by: NAlok Kataria <akataria@vmware.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

dad52fc0

15 12月, 2009 2 次提交

locking: Convert __raw_spin* functions to arch_spin* · 0199c4e6

由 Thomas Gleixner 提交于 12月 02, 2009

Name space cleanup. No functional change.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org

0199c4e6

locking: Convert raw_spinlock to arch_spinlock · 445c8951

由 Thomas Gleixner 提交于 12月 02, 2009

The raw_spin* namespace was taken by lockdep for the architecture
specific implementations. raw_spin_* would be the ideal name space for
the spinlocks which are not converted to sleeping locks in preempt-rt.

Linus suggested to convert the raw_ to arch_ locks and cleanup the
name space instead of using an artifical name like core_spin,
atomic_spin or whatever

No functional change.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org

445c8951

13 10月, 2009 1 次提交

x86/paravirt: Use normal calling sequences for irq enable/disable · 71999d98

由 Jeremy Fitzhardinge 提交于 10月 12, 2009

Bastian Blank reported a boot crash with stackprotector enabled,
and debugged it back to edx register corruption.

For historical reasons irq enable/disable/save/restore had special
calling sequences to make them more efficient.  With the more
recent introduction of higher-level and more general optimisations
this is no longer necessary so we can just use the normal PVOP_
macros.

This fixes some residual bugs in the old implementations which left
edx liable to inadvertent clobbering. Also, fix some bugs in
__PVOP_VCALLEESAVE which were revealed by actual use.
Reported-by: NBastian Blank <bastian@waldi.eu.org>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: Xen-devel <xen-devel@lists.xensource.com>
LKML-Reference: <4AD3BC9B.7040501@goop.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

71999d98

16 9月, 2009 1 次提交

x86: Move get/set_wallclock to x86_platform_ops · 7bd867df

由 Feng Tang 提交于 9月 10, 2009

get/set_wallclock() have already a set of platform dependent
implementations (default, EFI, paravirt). MRST will add another
variant.

Moving them to platform ops simplifies the existing code and minimizes
the effort to integrate new variants.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
LKML-Reference: <new-submission>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

7bd867df

01 9月, 2009 2 次提交

x86, msr: Rewrite AMD rd/wrmsr variants · 177fed1e

由 Borislav Petkov 提交于 8月 31, 2009

Switch them to native_{rd,wr}msr_safe_regs and remove
pv_cpu_ops.read_msr_amd.
Signed-off-by: NBorislav Petkov <petkovbb@gmail.com>
LKML-Reference: <1251705011-18636-2-git-send-email-petkovbb@gmail.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

177fed1e

x86, msr: Add rd/wrmsr interfaces with preset registers · 132ec92f

由 Borislav Petkov 提交于 8月 31, 2009

native_{rdmsr,wrmsr}_safe_regs are two new interfaces which allow
presetting of a subset of eight x86 GPRs before executing the rd/wrmsr
instructions. This is needed at least on AMD K8 for accessing an erratum
workaround MSR.

Originally based on an idea by H. Peter Anvin.
Signed-off-by: NBorislav Petkov <petkovbb@gmail.com>
LKML-Reference: <1251705011-18636-1-git-send-email-petkovbb@gmail.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

132ec92f

31 8月, 2009 7 次提交

x86: Move tsc_calibration to x86_init_ops · 2d826404

由 Thomas Gleixner 提交于 8月 20, 2009

TSC calibration is modified by the vmware hypervisor and paravirt by
separate means. Moorestown wants to add its own calibration routine as
well. So make calibrate_tsc a proper x86_init_ops function and
override it by paravirt or by the early setup of the vmware
hypervisor.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

2d826404

x86: Add timer_init to x86_init_ops · 845b3944

由 Thomas Gleixner 提交于 8月 19, 2009

The timer init code is convoluted with several quirks and the paravirt
timer chooser. Figuring out which code path is actually taken is not
for the faint hearted.

Move the numaq TSC quirk to tsc_pre_init x86_init_ops function and
replace the paravirt time chooser and the remaining x86 quirk with a
simple x86_init_ops function.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

845b3944

x86: Move percpu clockevents setup to x86_init_ops · 736decac

由 Thomas Gleixner 提交于 8月 19, 2009

paravirt overrides the setup of the default apic timers as per cpu
timers. Moorestown needs to override that as well.

Move it to x86_init_ops setup and create a separate x86_cpuinit struct
which holds the function for the secondary evtl. hotplugabble CPUs.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

736decac

x86: Move xen_post_allocator_init into xen_pagetable_setup_done · f1d7062a

由 Thomas Gleixner 提交于 8月 20, 2009

We really do not need two paravirt/x86_init_ops functions which are
called in two consecutive source lines. Move the only user of
post_allocator_init into the already existing pagetable_setup_done
function.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

f1d7062a

T
x86: Move paravirt pagetable_setup to x86_init_ops · 030cb6c0
由 Thomas Gleixner 提交于 8月 20, 2009
```
Replace more paravirt hackery by proper x86_init_ops.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
030cb6c0

x86: Move paravirt banner printout to x86_init_ops · 6f30c1ac

由 Thomas Gleixner 提交于 8月 20, 2009

Replace another obscure paravirt magic and move it to
x86_init_ops. Such a hook is also useful for embedded and special
hardware.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

6f30c1ac

x86: Replace ARCH_SETUP by a proper x86_init_ops · 42bbdb43

由 Thomas Gleixner 提交于 8月 20, 2009

ARCH_SETUP is a horrible leftover from the old arch/i386 mach support
code. It still has a lonely user in xen. Move it to x86_init_ops.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

42bbdb43

18 6月, 2009 1 次提交

x86/paravirt: split paravirt definitions into paravirt_types.h · ac5672f8

由 Jeremy Fitzhardinge 提交于 4月 14, 2009

Split the monolithic asm/paravirt.h into separate paravirt.h (inlines and other
"active" definitions), and paravirt_types.h (types, constants and other "passive"
definitions).  This makes it easier to use the type/constant definitions without
pulling in everything else and causing circular dependency problems.

[ Impact: cleanup ]
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

ac5672f8

16 5月, 2009 1 次提交

x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126

由 Jeremy Fitzhardinge 提交于 5月 13, 2009

Xiaohui Xin and some other folks at Intel have been looking into what's
behind the performance hit of paravirt_ops when running native.

It appears that the hit is entirely due to the paravirtualized
spinlocks introduced by:

 | commit 8efcbab6
 | Date:   Mon Jul 7 12:07:51 2008 -0700
 |
 |     paravirt: introduce a "lock-byte" spinlock implementation

The extra call/return in the spinlock path is somehow
causing an increase in the cycles/instruction of somewhere around 2-7%
(seems to vary quite a lot from test to test).  The working theory is
that the CPU's pipeline is getting upset about the
call->call->locked-op->return->return, and seems to be failing to
speculate (though I haven't seen anything definitive about the precise
reasons).  This doesn't entirely make sense, because the performance
hit is also visible on unlock and other operations which don't involve
locked instructions.  But spinlock operations clearly swamp all the
other pvops operations, even though I can't imagine that they're
nearly as common (there's only a .05% increase in instructions
executed).

If I disable just the pv-spinlock calls, my tests show that pvops is
identical to non-pvops performance on native (my measurements show that
it is actually about .1% faster, but Xiaohui shows a .05% slowdown).

Summary of results, averaging 10 runs of the "mmperf" test, using a
no-pvops build as baseline:

		nopv		Pv-nospin	Pv-spin
CPU cycles	100.00%		99.89%		102.18%
instructions	100.00%		100.10%		100.15%
CPI		100.00%		99.79%		102.03%
cache ref	100.00%		100.84%		100.28%
cache miss	100.00%		90.47%		88.56%
cache miss rate	100.00%		89.72%		88.31%
branches	100.00%		99.93%		100.04%
branch miss	100.00%		103.66%		107.72%
branch miss rt	100.00%		103.73%		107.67%
wallclock	100.00%		99.90%		102.20%

The clear effect here is that the 2% increase in CPI is
directly reflected in the final wallclock time.

(The other interesting effect is that the more ops are
out of line calls via pvops, the lower the cache access
and miss rates.  Not too surprising, but it suggests that
the non-pvops kernel is over-inlined.  On the flipside,
the branch misses go up correspondingly...)

So, what's the fix?

Paravirt patching turns all the pvops calls into direct calls, so
_spin_lock etc do end up having direct calls.  For example, the compiler
generated code for paravirtualized _spin_lock is:

<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq  *0xffffffff805a5b30
<_spin_lock+22>:	retq

The indirect call will get patched to:
<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq <__ticket_spin_lock>
<_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
<_spin_lock+22>:	retq

One possibility is to inline _spin_lock, etc, when building an
optimised kernel (ie, when there's no spinlock/preempt
instrumentation/debugging enabled).  That will remove the outer
call/return pair, returning the instruction stream to a single
call/return, which will presumably execute the same as the non-pvops
case.  The downsides arel 1) it will replicate the
preempt_disable/enable code at eack lock/unlock callsite; this code is
fairly small, but not nothing; and 2) the spinlock definitions are
already a very heavily tangled mass of #ifdefs and other preprocessor
magic, and making any changes will be non-trivial.

The other obvious answer is to disable pv-spinlocks.  Making them a
separate config option is fairly easy, and it would be trivial to
enable them only when Xen is enabled (as the only non-default user).
But it doesn't really address the common case of a distro build which
is going to have Xen support enabled, and leaves the open question of
whether the native performance cost of pv-spinlocks is worth the
performance improvement on a loaded Xen system (10% saving of overall
system CPU when guests block rather than spin).  Still it is a
reasonable short-term workaround.

[ Impact: fix pvops performance regression when running native ]
Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
Analysed-by: N"Li Xin" <xin.li@intel.com>
Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NH. Peter Anvin <hpa@zytor.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Xen-devel <xen-devel@lists.xensource.com>
LKML-Reference: <4A0B62F7.5030802@goop.org>
[ fixed the help text ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b4ecc126

11 4月, 2009 1 次提交

x86: fix set_fixmap to use phys_addr_t · 9b987aeb

由 Masami Hiramatsu 提交于 4月 09, 2009

Impact: fix kprobes crash on 32-bit with RAM above 4G

Use phys_addr_t for receiving a physical address argument
instead of unsigned long. This allows fixmap to handle
pages higher than 4GB on x86-32.
Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
Acked-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: systemtap-ml <systemtap@sources.redhat.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <49DE3695.6040800@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9b987aeb

10 4月, 2009 1 次提交

x86: fix set_fixmap to use phys_addr_t · 3b3809ac

由 Masami Hiramatsu 提交于 4月 09, 2009

Use phys_addr_t for receiving a physical address argument instead of
unsigned long.  This allows fixmap to handle pages higher than 4GB on
x86-32.
Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3b3809ac

30 3月, 2009 3 次提交

x86/paravirt: finish change from lazy cpu to context switch start/end · 224101ed

由 Jeremy Fitzhardinge 提交于 2月 18, 2009

Impact: fix lazy context switch API

Pass the previous and next tasks into the context switch start
end calls, so that the called functions can properly access the
task state (esp in end_context_switch, in which the next task
is not yet completely current).
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

224101ed

x86/paravirt: flush pending mmu updates on context switch · b407fc57

由 Jeremy Fitzhardinge 提交于 2月 17, 2009

Impact: allow preemption during lazy mmu updates

If we're in lazy mmu mode when context switching, leave
lazy mmu mode, but remember the task's state in
TIF_LAZY_MMU_UPDATES.  When we resume the task, check this
flag and re-enter lazy mmu mode if its set.

This sets things up for allowing lazy mmu mode while preemptible,
though that won't actually be active until the next change.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

b407fc57

x86/pvops: replace arch_enter_lazy_cpu_mode with arch_start_context_switch · 7fd7d83d

由 Jeremy Fitzhardinge 提交于 2月 17, 2009

Impact: simplification, prepare for later changes

Make lazy cpu mode more specific to context switching, so that
it makes sense to do more context-switch specific things in
the callbacks.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

7fd7d83d

19 3月, 2009 1 次提交

x86: with the last user gone, remove set_pte_present · 71ff49d7

由 Jeremy Fitzhardinge 提交于 3月 18, 2009

Impact: cleanup

set_pte_present() is no longer used, directly or indirectly,
so remove it.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
LKML-Reference: <1237406613-2929-2-git-send-email-jeremy@goop.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

71ff49d7

17 3月, 2009 1 次提交

x86, paravirt: prevent gcc from generating the wrong addressing mode · 42854dc0

由 Jeremy Fitzhardinge 提交于 3月 16, 2009

Impact: fix crash on VMI (VMware)

When we generate a call sequence for calling a paravirtualized
function, we presume that the generated code is "call *0xXXXXX",
which is a 6 byte opcode; this is larger than a normal
direct call, and so we can patch a direct call over it.

At the moment, however we give gcc enough rope to hang us by
putting the address in a register and generating a two byte
indirect-via-register call.  Prevent this by explicitly
dereferencing the function pointer and passing it into the
asm as a constant.

This prevents crashes in VMI, as it cannot handle unpatchable
callsites.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Alok Kataria <akataria@vmware.com>
LKML-Reference: <49BEEDC2.2070809@goop.org>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

42854dc0

13 2月, 2009 1 次提交

x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption · d85cf93d

由 Jeremy Fitzhardinge 提交于 2月 12, 2009

Impact: avoid access to percpu vars in preempible context

They are intended to be used whenever there's the possibility
that there's some stale state which is going to be overwritten
with a queued update, or to force a state change when we may be
in lazy mode.  Either way, we could end up calling it with
preemption enabled, so wrap the functions in their own little
preempt-disable section so they can be safely called in any
context (though preemption should never be enabled if we're actually
in a lazy state).

(Move out of line to avoid #include dependencies.)
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

d85cf93d

12 2月, 2009 2 次提交

x86: move pte types into pgtable*.h · 54321d94

由 Jeremy Fitzhardinge 提交于 2月 11, 2009

pgtable*.h is intended for definitions relating to actual pagetables
and their entries, so move all the definitions for
(pte|pmd|pud|pgd)(val)?_t to the appropriate pgtable*.h headers.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

54321d94

J
x86: move defs around to allow paravirt.h to just include page_types.h · e42778de
由 Jeremy Fitzhardinge 提交于 2月 08, 2009
```
Signed-off-by: NJeremy Fitzhardinge <jeremy@goop.org>
```
e42778de

10 2月, 2009 1 次提交

x86: spinlocks: define dummy __raw_spin_is_contended · a5ef7ca0

由 Kyle McMartin 提交于 2月 08, 2009

Architectures other than mips and x86 are not using ticket spinlocks.
Therefore, the contention on the lock is meaningless, since there is
nobody known to be waiting on it (arguably /fairly/ unfair locks).

Dummy it out to return 0 on other architectures.
Signed-off-by: NKyle McMartin <kyle@redhat.com>
Acked-by: NRalf Baechle <ralf@linux-mips.org>
Acked-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a5ef7ca0

04 2月, 2009 1 次提交

x86/paravirt: return full 64-bit result · 0eb592db

由 Jeremy Fitzhardinge 提交于 2月 03, 2009

Impact: Bug fix

A hunk went missing in the original patch, and callee-save callsites were
not marked as returning the upper 32-bit of result, causing Badness.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

0eb592db

03 2月, 2009 1 次提交

x86/paravirt: don't restore second return reg · e584f559

由 Jeremy Fitzhardinge 提交于 1月 30, 2009

Impact: bugfix

In the 32-bit calling convention, %eax:%edx is used to return 64-bit
values.  Don't save and restore %edx around wrapped functions, or they
can't return a full 64-bit result.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

e584f559

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功