1. 05 6月, 2013 1 次提交
  2. 01 2月, 2013 1 次提交
  3. 30 11月, 2012 1 次提交
  4. 21 7月, 2012 1 次提交
  5. 28 6月, 2012 3 次提交
    • A
      x86/tlb: do flush_tlb_kernel_range by 'invlpg' · effee4b9
      Alex Shi 提交于
      This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
      and gain was analyzed in previous patch
      (x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).
      
      In the testing: http://lkml.org/lkml/2012/6/21/10
      
      The pay is mostly covered by long kernel path, but the gain is still
      quite clear, memory access in user APP can increase 30+% when kernel
      execute this funtion.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      effee4b9
    • A
      x86/tlb: enable tlb flush range support for x86 · 611ae8e3
      Alex Shi 提交于
      Not every tlb_flush execution moment is really need to evacuate all
      TLB entries, like in munmap, just few 'invlpg' is better for whole
      process performance, since it leaves most of TLB entries for later
      accessing.
      
      This patch also rewrite flush_tlb_range for 2 purposes:
      1, split it out to get flush_blt_mm_range function.
      2, clean up to reduce line breaking, thanks for Borislav's input.
      
      My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
      show that the random memory access on other CPU has 0~50% speed up
      on a 2P * 4cores * HT NHM EP while do 'munmap'.
      
      Thanks Yongjie's testing on this patch:
      -------------
      I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
      kernel.
      After running two benchmarks in Xen HVM guest, I found his patches
      brought about 1%~3% performance gain in 'kernel build' and 'netperf'
      testing, though the performance gain was not very stable in 'kernel
      build' testing.
      
      Some detailed testing results are below.
      
      Testing Environment:
      	Hardware: Romley-EP platform
      	Xen version: latest upstream
      	Linux kernel: 3.4-RC6
      	Guest vCPU number: 8
      	NIC: Intel 82599 (10GB bandwidth)
      
      In 'kernel build' testing in guest:
      	Command line  |  performance gain
          make -j 4      |    3.81%
          make -j 8      |    0.37%
          make -j 16     |    -0.52%
      
      In 'netperf' testing, we tested TCP_STREAM with default socket size
      16384 byte as large packet and 64 byte as small packet.
      I used several clients to add networking pressure, then 'netperf' server
      automatically generated several threads to response them.
      I also used large-size packet and small-size packet in the testing.
      	Packet size  |  Thread number | performance gain
      	16384 bytes  |      4       |   0.02%
      	16384 bytes  |      8       |   2.21%
      	16384 bytes  |      16      |   2.04%
      	64 bytes     |      4       |   1.07%
      	64 bytes     |      8       |   3.31%
      	64 bytes     |      16      |   0.71%
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.comTested-by: NRen, Yongjie <yongjie.ren@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      611ae8e3
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
  6. 18 5月, 2012 1 次提交
  7. 15 5月, 2012 1 次提交
  8. 29 3月, 2012 1 次提交
  9. 21 10月, 2010 1 次提交
  10. 12 6月, 2009 1 次提交
    • Y
      x86: make zap_low_mapping could be used early · 55cd6367
      Yinghai Lu 提交于
      Only one cpu is there, just call __flush_tlb for it. Fixes the following boot
      warning on x86:
      
        [    0.000000] Memory: 885032k/915540k available (5993k kernel code, 29844k reserved, 3842k data, 428k init, 0k highmem)
        [    0.000000] virtual kernel memory layout:
        [    0.000000]     fixmap  : 0xffe17000 - 0xfffff000   (1952 kB)
        [    0.000000]     vmalloc : 0xf8615000 - 0xffe15000   ( 120 MB)
        [    0.000000]     lowmem  : 0xc0000000 - 0xf7e15000   ( 894 MB)
        [    0.000000]       .init : 0xc19a5000 - 0xc1a10000   ( 428 kB)
        [    0.000000]       .data : 0xc15da4bb - 0xc199af6c   (3842 kB)
        [    0.000000]       .text : 0xc1000000 - 0xc15da4bb   (5993 kB)
        [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...Ok.
        [    0.000000] ------------[ cut here ]------------
        [    0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x50/0x1b0()
        [    0.000000] Hardware name: System Product Name
        [    0.000000] Modules linked in:
        [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip #52504
        [    0.000000] Call Trace:
        [    0.000000]  [<c104aa16>] warn_slowpath_common+0x65/0x95
        [    0.000000]  [<c104aa58>] warn_slowpath_null+0x12/0x15
        [    0.000000]  [<c1073bbe>] smp_call_function_many+0x50/0x1b0
        [    0.000000]  [<c1037615>] ? do_flush_tlb_all+0x0/0x41
        [    0.000000]  [<c1037615>] ? do_flush_tlb_all+0x0/0x41
        [    0.000000]  [<c1073d4f>] smp_call_function+0x31/0x58
        [    0.000000]  [<c1037615>] ? do_flush_tlb_all+0x0/0x41
        [    0.000000]  [<c104f635>] on_each_cpu+0x26/0x65
        [    0.000000]  [<c10374b5>] flush_tlb_all+0x19/0x1b
        [    0.000000]  [<c1032ab3>] zap_low_mappings+0x4d/0x56
        [    0.000000]  [<c15d64b5>] ? printk+0x14/0x17
        [    0.000000]  [<c19b42a8>] mem_init+0x23d/0x245
        [    0.000000]  [<c19a56a1>] start_kernel+0x17a/0x2d5
        [    0.000000]  [<c19a5347>] ? unknown_bootoption+0x0/0x19a
        [    0.000000]  [<c19a5039>] __init_begin+0x39/0x41
        [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      55cd6367
  11. 24 4月, 2009 1 次提交
  12. 22 4月, 2009 1 次提交
    • D
      FRV: Fix the section attribute on UP DECLARE_PER_CPU() · 9b8de747
      David Howells 提交于
      In non-SMP mode, the variable section attribute specified by DECLARE_PER_CPU()
      does not agree with that specified by DEFINE_PER_CPU().  This means that
      architectures that have a small data section references relative to a base
      register may throw up linkage errors due to too great a displacement between
      where the base register points and the per-CPU variable.
      
      On FRV, the .h declaration says that the variable is in the .sdata section, but
      the .c definition says it's actually in the .data section.  The linker throws
      up the following errors:
      
      kernel/built-in.o: In function `release_task':
      kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o
      kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o
      
      To fix this, DECLARE_PER_CPU() should simply apply the same section attribute
      as does DEFINE_PER_CPU().  However, this is made slightly more complex by
      virtue of the fact that there are several variants on DEFINE, so these need to
      be matched by variants on DECLARE.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b8de747
  13. 18 1月, 2009 1 次提交
  14. 12 1月, 2009 1 次提交
    • R
      x86: change flush_tlb_others to take a const struct cpumask · 4595f962
      Rusty Russell 提交于
      Impact: reduce stack usage, use new cpumask API.
      
      This is made a little more tricky by uv_flush_tlb_others which
      actually alters its argument, for an IPI to be sent to the remaining
      cpus in the mask.
      
      I solve this by allocating a cpumask_var_t for this case and falling back
      to IPI should this fail.
      
      To eliminate temporaries in the caller, all flush_tlb_others implementations
      now do the this-cpu-elimination step themselves.
      
      Note also the curious "cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask)"
      which has been there since pre-git and yet f->flush_cpumask is always zero
      at this point.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NMike Travis <travis@sgi.com>
      4595f962
  15. 07 1月, 2009 1 次提交
  16. 23 10月, 2008 2 次提交
  17. 05 9月, 2008 1 次提交
  18. 23 7月, 2008 1 次提交
    • V
      x86: consolidate header guards · 77ef50a5
      Vegard Nossum 提交于
      This patch is the result of an automatic script that consolidates the
      format of all the headers in include/asm-x86/.
      
      The format:
      
      1. No leading underscore. Names with leading underscores are reserved.
      2. Pathname components are separated by two underscores. So we can
         distinguish between mm_types.h and mm/types.h.
      3. Everything except letters and numbers are turned into single
         underscores.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      77ef50a5
  19. 24 5月, 2008 1 次提交
  20. 17 4月, 2008 1 次提交
  21. 30 1月, 2008 1 次提交
  22. 11 10月, 2007 1 次提交