1. 28 6月, 2012 5 次提交
    • A
      x86/tlb: add tlb_flushall_shift knob into debugfs · 3df3212f
      Alex Shi 提交于
      kernel will replace cr3 rewrite with invlpg when
        tlb_flush_entries <= active_tlb_entries / 2^tlb_flushall_factor
      if tlb_flushall_factor is -1, kernel won't do this replacement.
      
      User can modify its value according to specific CPU/applications.
      
      Thanks for Borislav providing the help message of
      CONFIG_DEBUG_TLBFLUSH.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-6-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      3df3212f
    • A
      x86/tlb: add tlb_flushall_shift for specific CPU · c4211f42
      Alex Shi 提交于
      Testing show different CPU type(micro architectures and NUMA mode) has
      different balance points between the TLB flush all and multiple invlpg.
      And there also has cases the tlb flush change has no any help.
      
      This patch give a interface to let x86 vendor developers have a chance
      to set different shift for different CPU type.
      
      like some machine in my hands, balance points is 16 entries on
      Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
      IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
      help.
      
      For untested machine, do a conservative optimization, same as NHM CPU.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      c4211f42
    • A
      x86/tlb: fall back to flush all when meet a THP large page · d8dfe60d
      Alex Shi 提交于
      We don't need to flush large pages by PAGE_SIZE step, that just waste
      time. and actually, large page don't need 'invlpg' optimizing according
      to our micro benchmark. So, just flush whole TLB is enough for them.
      
      The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
      with THP 'always' setting.
      
      Multi-thread testing, '-t' paramter is thread number:
                             without this patch 	with this patch
      ./mprotect -t 1         14ns                       13ns
      ./mprotect -t 2         13ns                       13ns
      ./mprotect -t 4         12ns                       11ns
      ./mprotect -t 8         14ns                       10ns
      ./mprotect -t 16        28ns                       28ns
      ./mprotect -t 32        54ns                       52ns
      ./mprotect -t 128       200ns                      200ns
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-4-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      d8dfe60d
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
    • A
      x86/tlb_info: get last level TLB entry number of CPU · e0ba94f1
      Alex Shi 提交于
      For 4KB pages, x86 CPU has 2 or 1 level TLB, first level is data TLB and
      instruction TLB, second level is shared TLB for both data and instructions.
      
      For hupe page TLB, usually there is just one level and seperated by 2MB/4MB
      and 1GB.
      
      Although each levels TLB size is important for performance tuning, but for
      genernal and rude optimizing, last level TLB entry number is suitable. And
      in fact, last level TLB always has the biggest entry number.
      
      This patch will get the biggest TLB entry number and use it in furture TLB
      optimizing.
      
      Accroding Borislav's suggestion, except tlb_ll[i/d]_* array, other
      function and data will be released after system boot up.
      
      For all kinds of x86 vendor friendly, vendor specific code was moved to its
      specific files.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-2-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e0ba94f1
  2. 14 6月, 2012 2 次提交
    • V
      x86: Add read_mostly declaration/definition to variables from smp.h · 0816b0f0
      Vlad Zolotarov 提交于
      Add "read-mostly" qualifier to the following variables in
      smp.h:
      
       - cpu_sibling_map
       - cpu_core_map
       - cpu_llc_shared_map
       - cpu_llc_id
       - cpu_number
       - x86_cpu_to_apicid
       - x86_bios_cpu_apicid
       - x86_cpu_to_logical_apicid
      
      As long as all the variables above are only written during the
      initialization, this change is meant to prevent the false
      sharing. More specifically, on vSMP Foundation platform
      x86_cpu_to_apicid shared the same internode_cache_line with
      frequently written lapic_events.
      
      From the analysis of the first 33 per_cpu variables out of 219
      (memories they describe, to be more specific) the 8 have read_mostly
      nature (tlb_vector_offset, cpu_loops_per_jiffy, xen_debug_irq, etc.)
      and 25 are frequently written (irq_stack_union, gdt_page,
      exception_stacks, idt_desc, etc.).
      
      Assuming that the spread of the rest of the per_cpu variables is
      similar, identifying the read mostly memories will make more sense
      in terms of long-term code maintenance comparing to identifying
      frequently written memories.
      Signed-off-by: NVlad Zolotarov <vlad@scalemp.com>
      Acked-by: NShai Fultheim <shai@scalemp.com>
      Cc: Shai Fultheim (Shai@ScaleMP.com) <Shai@scalemp.com>
      Cc: ido@wizery.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1719258.EYKzE4Zbq5@vladSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0816b0f0
    • I
      x86: Define early read-mostly per-cpu macros · c35f7741
      Ido Yariv 提交于
      Some read-mostly per-cpu data may need to be declared or defined
      early, so it can be initialized and accessed before per_cpu
      areas are allocated.
      
      Only the data that resides in the per_cpu areas should be
      read-mostly, as there is little benefit in optimizing cache
      lines on initialization.
      Signed-off-by: NIdo Yariv <ido@wizery.com>
      [ Added the missing declarations in !SMP code. ]
      Signed-off-by: NVlad Zolotarov <vlad@scalemp.com>
      Acked-by: NShai Fultheim <shai@scalemp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/46188571.ddB8aVQYWo@vladSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c35f7741
  3. 08 6月, 2012 4 次提交
  4. 06 6月, 2012 17 次提交
  5. 02 6月, 2012 12 次提交