1. 01 8月, 2012 3 次提交
  2. 31 7月, 2012 2 次提交
  3. 26 7月, 2012 2 次提交
  4. 21 7月, 2012 4 次提交
  5. 20 7月, 2012 1 次提交
  6. 16 7月, 2012 2 次提交
  7. 12 7月, 2012 2 次提交
    • M
      KVM: VMX: Implement PCID/INVPCID for guests with EPT · ad756a16
      Mao, Junjie 提交于
      This patch handles PCID/INVPCID for guests.
      
      Process-context identifiers (PCIDs) are a facility by which a logical processor
      may cache information for multiple linear-address spaces so that the processor
      may retain cached information when software switches to a different linear
      address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
      Volume 3A for details.
      
      For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
      natively.
      For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.
      Signed-off-by: NJunjie Mao <junjie.mao@intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      ad756a16
    • P
      KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check · fc73373b
      Prarit Bhargava 提交于
      While debugging I noticed that unlike all the other hypervisor code in the
      kernel, kvm does not have an entry for x86_hyper which is used in
      detect_hypervisor_platform() which results in a nice printk in the
      syslog.  This is only really a stub function but it
      does make kvm more consistent with the other hypervisors.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marcelo Tostatti <mtosatti@redhat.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      fc73373b
  8. 09 7月, 2012 2 次提交
  9. 06 7月, 2012 5 次提交
  10. 04 7月, 2012 1 次提交
  11. 30 6月, 2012 1 次提交
  12. 28 6月, 2012 6 次提交
    • A
      x86/tlb: do flush_tlb_kernel_range by 'invlpg' · effee4b9
      Alex Shi 提交于
      This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
      and gain was analyzed in previous patch
      (x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).
      
      In the testing: http://lkml.org/lkml/2012/6/21/10
      
      The pay is mostly covered by long kernel path, but the gain is still
      quite clear, memory access in user APP can increase 30+% when kernel
      execute this funtion.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      effee4b9
    • A
      x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR · 52aec330
      Alex Shi 提交于
      There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
      amount of vector in IDT. But it is still not enough, since modern x86
      sever has more cpu number. That still causes heavy lock contention
      in TLB flushing.
      
      The patch using generic smp call function to replace it. That saved 32
      vector number in IDT, and resolved the lock contention in TLB
      flushing on large system.
      
      In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
      has 3% performance increase.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      52aec330
    • A
      x86/tlb: enable tlb flush range support for x86 · 611ae8e3
      Alex Shi 提交于
      Not every tlb_flush execution moment is really need to evacuate all
      TLB entries, like in munmap, just few 'invlpg' is better for whole
      process performance, since it leaves most of TLB entries for later
      accessing.
      
      This patch also rewrite flush_tlb_range for 2 purposes:
      1, split it out to get flush_blt_mm_range function.
      2, clean up to reduce line breaking, thanks for Borislav's input.
      
      My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
      show that the random memory access on other CPU has 0~50% speed up
      on a 2P * 4cores * HT NHM EP while do 'munmap'.
      
      Thanks Yongjie's testing on this patch:
      -------------
      I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
      kernel.
      After running two benchmarks in Xen HVM guest, I found his patches
      brought about 1%~3% performance gain in 'kernel build' and 'netperf'
      testing, though the performance gain was not very stable in 'kernel
      build' testing.
      
      Some detailed testing results are below.
      
      Testing Environment:
      	Hardware: Romley-EP platform
      	Xen version: latest upstream
      	Linux kernel: 3.4-RC6
      	Guest vCPU number: 8
      	NIC: Intel 82599 (10GB bandwidth)
      
      In 'kernel build' testing in guest:
      	Command line  |  performance gain
          make -j 4      |    3.81%
          make -j 8      |    0.37%
          make -j 16     |    -0.52%
      
      In 'netperf' testing, we tested TCP_STREAM with default socket size
      16384 byte as large packet and 64 byte as small packet.
      I used several clients to add networking pressure, then 'netperf' server
      automatically generated several threads to response them.
      I also used large-size packet and small-size packet in the testing.
      	Packet size  |  Thread number | performance gain
      	16384 bytes  |      4       |   0.02%
      	16384 bytes  |      8       |   2.21%
      	16384 bytes  |      16      |   2.04%
      	64 bytes     |      4       |   1.07%
      	64 bytes     |      8       |   3.31%
      	64 bytes     |      16      |   0.71%
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.comTested-by: NRen, Yongjie <yongjie.ren@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      611ae8e3
    • A
      x86/tlb: add tlb_flushall_shift for specific CPU · c4211f42
      Alex Shi 提交于
      Testing show different CPU type(micro architectures and NUMA mode) has
      different balance points between the TLB flush all and multiple invlpg.
      And there also has cases the tlb flush change has no any help.
      
      This patch give a interface to let x86 vendor developers have a chance
      to set different shift for different CPU type.
      
      like some machine in my hands, balance points is 16 entries on
      Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
      IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
      help.
      
      For untested machine, do a conservative optimization, same as NHM CPU.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      c4211f42
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
    • A
      x86/tlb_info: get last level TLB entry number of CPU · e0ba94f1
      Alex Shi 提交于
      For 4KB pages, x86 CPU has 2 or 1 level TLB, first level is data TLB and
      instruction TLB, second level is shared TLB for both data and instructions.
      
      For hupe page TLB, usually there is just one level and seperated by 2MB/4MB
      and 1GB.
      
      Although each levels TLB size is important for performance tuning, but for
      genernal and rude optimizing, last level TLB entry number is suitable. And
      in fact, last level TLB always has the biggest entry number.
      
      This patch will get the biggest TLB entry number and use it in furture TLB
      optimizing.
      
      Accroding Borislav's suggestion, except tlb_ll[i/d]_* array, other
      function and data will be released after system boot up.
      
      For all kinds of x86 vendor friendly, vendor specific code was moved to its
      specific files.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-2-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e0ba94f1
  13. 27 6月, 2012 6 次提交
  14. 26 6月, 2012 1 次提交
  15. 25 6月, 2012 2 次提交
    • C
      x86/uv: Work around UV2 BAU hangs · 8b6e511e
      Cliff Wickman 提交于
      On SGI's UV2 the BAU (Broadcast Assist Unit) driver can hang
      under a heavy load. To cure this:
      
      - Disable the UV2 extended status mode (see UV2_EXT_SHFT), as
        this mode changes BAU behavior in more ways then just delivering
        an extra bit of status.  Revert status to just two meaningful bits,
        like UV1.
      
      - Use no IPI-style resets on UV2.  Just give up the request for
        whatever the reason it failed and let it be accomplished with
        the legacy IPI method.
      
      - Use no alternate sending descriptor (the former UV2 workaround
        bcp->using_desc and handle_uv2_busy() stuff).  Just disable the
        use of the BAU for a period of time in favor of the legacy IPI
        method when the h/w bug leaves a descriptor busy.
      
        -- new tunable: giveup_limit determines the threshold at which a hub is
           so plugged that it should do all requests with the legacy IPI method for a
           period of time
        -- generalize disable_for_congestion() (renamed disable_for_period()) for
           use whenever a hub should avoid using the BAU for a period of time
      
      Also:
      
       - Fix find_another_by_swack(), which is part of the UV2 bug workaround
      
       - Correct and clarify the statistics (new stats s_overipilimit, s_giveuplimit,
         s_enters, s_ipifordisabled, s_plugged, s_congested)
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Link: http://lkml.kernel.org/r/20120622131459.GC31884@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b6e511e
    • C
      x86/uv: Implement UV BAU runtime enable and disable control via /proc/sgi_uv/ · 26ef8577
      Cliff Wickman 提交于
      This patch enables the BAU to be turned on or off dynamically.
      
        echo "on"  > /proc/sgi_uv/ptc_statistics
        echo "off" > /proc/sgi_uv/ptc_statistics
      
      The system may be booted with or without the nobau option.
      
      Whether the system currently has the BAU off can be seen in
      the /proc file -- normally with the baustats script.
      Each cpu will have a 1 in the bauoff field if the BAU was turned
      off, so baustats will give a count of cpus that have it off.
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Link: http://lkml.kernel.org/r/20120622131330.GB31884@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      26ef8577