1. 15 9月, 2012 1 次提交
  2. 23 8月, 2012 1 次提交
    • S
      ftrace/x86: Add support for -mfentry to x86_64 · d57c5d51
      Steven Rostedt 提交于
      If the kernel is compiled with gcc 4.6.0 which supports -mfentry,
      then use that instead of mcount.
      
      With mcount, frame pointers are forced with the -pg option and we
      get something like:
      
      <can_vma_merge_before>:
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             53                      push   %rbx
             41 51                   push   %r9
             e8 fe 6a 39 00          callq  ffffffff81483d00 <mcount>
             31 c0                   xor    %eax,%eax
             48 89 fb                mov    %rdi,%rbx
             48 89 d7                mov    %rdx,%rdi
             48 33 73 30             xor    0x30(%rbx),%rsi
             48 f7 c6 ff ff ff f7    test   $0xfffffffff7ffffff,%rsi
      
      With -mfentry, frame pointers are no longer forced and the call looks
      like this:
      
      <can_vma_merge_before>:
             e8 33 af 37 00          callq  ffffffff81461b40 <__fentry__>
             53                      push   %rbx
             48 89 fb                mov    %rdi,%rbx
             31 c0                   xor    %eax,%eax
             48 89 d7                mov    %rdx,%rdi
             41 51                   push   %r9
             48 33 73 30             xor    0x30(%rbx),%rsi
             48 f7 c6 ff ff ff f7    test   $0xfffffffff7ffffff,%rsi
      
      This adds the ftrace hook at the beginning of the function before a
      frame is set up, and allows the function callbacks to be able to access
      parameters. As kprobes now can use function tracing (at least on x86)
      this speeds up the kprobe hooks that are at the beginning of the
      function.
      
      Link: http://lkml.kernel.org/r/20120807194100.130477900@goodmis.orgAcked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d57c5d51
  3. 22 8月, 2012 1 次提交
  4. 10 8月, 2012 2 次提交
    • F
      perf: Factor __output_copy to be usable with specific copy function · 91d7753a
      Frederic Weisbecker 提交于
      Adding a generic way to use __output_copy function with specific copy
      function via DEFINE_PERF_OUTPUT_COPY macro.
      
      Using this to add new __output_copy_user function, that provides output
      copy from user pointers. For x86 the copy_from_user_nmi function is used
      and __copy_from_user_inatomic for the rest of the architectures.
      
      This new function will be used in user stack dump on sample, coming in
      next patches.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Link: http://lkml.kernel.org/r/1344345647-11536-4-git-send-email-jolsa@redhat.comSigned-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      91d7753a
    • J
      perf: Unified API to record selective sets of arch registers · c5e63197
      Jiri Olsa 提交于
      This brings a new API to help the selective dump of registers on event
      sampling, and its implementation for x86 arch.
      
      Added HAVE_PERF_REGS config option to determine if the architecture
      provides perf registers ABI.
      
      The information about desired registers will be passed in u64 mask.
      It's up to the architecture to map the registers into the mask bits.
      
      For the x86 arch implementation, both 32 and 64 bit registers bits are
      defined within single enum to ensure 64 bit system can provide register
      dump for compat task if needed in the future.
      Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
      [ Added missing linux/errno.h include ]
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Link: http://lkml.kernel.org/r/1344345647-11536-2-git-send-email-jolsa@redhat.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5e63197
  5. 01 8月, 2012 3 次提交
  6. 31 7月, 2012 3 次提交
  7. 26 7月, 2012 2 次提交
  8. 21 7月, 2012 4 次提交
  9. 20 7月, 2012 5 次提交
  10. 16 7月, 2012 2 次提交
  11. 12 7月, 2012 2 次提交
    • M
      KVM: VMX: Implement PCID/INVPCID for guests with EPT · ad756a16
      Mao, Junjie 提交于
      This patch handles PCID/INVPCID for guests.
      
      Process-context identifiers (PCIDs) are a facility by which a logical processor
      may cache information for multiple linear-address spaces so that the processor
      may retain cached information when software switches to a different linear
      address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
      Volume 3A for details.
      
      For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
      natively.
      For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.
      Signed-off-by: NJunjie Mao <junjie.mao@intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      ad756a16
    • P
      KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check · fc73373b
      Prarit Bhargava 提交于
      While debugging I noticed that unlike all the other hypervisor code in the
      kernel, kvm does not have an entry for x86_hyper which is used in
      detect_hypervisor_platform() which results in a nice printk in the
      syslog.  This is only really a stub function but it
      does make kvm more consistent with the other hypervisors.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marcelo Tostatti <mtosatti@redhat.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      fc73373b
  12. 09 7月, 2012 2 次提交
  13. 06 7月, 2012 5 次提交
  14. 04 7月, 2012 1 次提交
  15. 30 6月, 2012 1 次提交
  16. 28 6月, 2012 5 次提交
    • A
      x86/tlb: do flush_tlb_kernel_range by 'invlpg' · effee4b9
      Alex Shi 提交于
      This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
      and gain was analyzed in previous patch
      (x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).
      
      In the testing: http://lkml.org/lkml/2012/6/21/10
      
      The pay is mostly covered by long kernel path, but the gain is still
      quite clear, memory access in user APP can increase 30+% when kernel
      execute this funtion.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      effee4b9
    • A
      x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR · 52aec330
      Alex Shi 提交于
      There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
      amount of vector in IDT. But it is still not enough, since modern x86
      sever has more cpu number. That still causes heavy lock contention
      in TLB flushing.
      
      The patch using generic smp call function to replace it. That saved 32
      vector number in IDT, and resolved the lock contention in TLB
      flushing on large system.
      
      In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
      has 3% performance increase.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      52aec330
    • A
      x86/tlb: enable tlb flush range support for x86 · 611ae8e3
      Alex Shi 提交于
      Not every tlb_flush execution moment is really need to evacuate all
      TLB entries, like in munmap, just few 'invlpg' is better for whole
      process performance, since it leaves most of TLB entries for later
      accessing.
      
      This patch also rewrite flush_tlb_range for 2 purposes:
      1, split it out to get flush_blt_mm_range function.
      2, clean up to reduce line breaking, thanks for Borislav's input.
      
      My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
      show that the random memory access on other CPU has 0~50% speed up
      on a 2P * 4cores * HT NHM EP while do 'munmap'.
      
      Thanks Yongjie's testing on this patch:
      -------------
      I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
      kernel.
      After running two benchmarks in Xen HVM guest, I found his patches
      brought about 1%~3% performance gain in 'kernel build' and 'netperf'
      testing, though the performance gain was not very stable in 'kernel
      build' testing.
      
      Some detailed testing results are below.
      
      Testing Environment:
      	Hardware: Romley-EP platform
      	Xen version: latest upstream
      	Linux kernel: 3.4-RC6
      	Guest vCPU number: 8
      	NIC: Intel 82599 (10GB bandwidth)
      
      In 'kernel build' testing in guest:
      	Command line  |  performance gain
          make -j 4      |    3.81%
          make -j 8      |    0.37%
          make -j 16     |    -0.52%
      
      In 'netperf' testing, we tested TCP_STREAM with default socket size
      16384 byte as large packet and 64 byte as small packet.
      I used several clients to add networking pressure, then 'netperf' server
      automatically generated several threads to response them.
      I also used large-size packet and small-size packet in the testing.
      	Packet size  |  Thread number | performance gain
      	16384 bytes  |      4       |   0.02%
      	16384 bytes  |      8       |   2.21%
      	16384 bytes  |      16      |   2.04%
      	64 bytes     |      4       |   1.07%
      	64 bytes     |      8       |   3.31%
      	64 bytes     |      16      |   0.71%
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.comTested-by: NRen, Yongjie <yongjie.ren@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      611ae8e3
    • A
      x86/tlb: add tlb_flushall_shift for specific CPU · c4211f42
      Alex Shi 提交于
      Testing show different CPU type(micro architectures and NUMA mode) has
      different balance points between the TLB flush all and multiple invlpg.
      And there also has cases the tlb flush change has no any help.
      
      This patch give a interface to let x86 vendor developers have a chance
      to set different shift for different CPU type.
      
      like some machine in my hands, balance points is 16 entries on
      Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
      IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
      help.
      
      For untested machine, do a conservative optimization, same as NHM CPU.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      c4211f42
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd