1. 25 1月, 2014 3 次提交
  2. 12 9月, 2013 2 次提交
    • D
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen 提交于
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df46865
    • D
      mm: vmstats: tlb flush counters · 9824cf97
      Dave Hansen 提交于
      I was investigating some TLB flush scaling issues and realized that we do
      not have any good methods for figuring out how many TLB flushes we are
      doing.
      
      It would be nice to be able to do these in generic code, but the
      arch-independent calls don't explicitly specify whether we actually need
      to do remote flushes or not.  In the end, we really need to know if we
      actually _did_ global vs.  local invalidations, so that leaves us with few
      options other than to muck with the counters from arch-specific code.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9824cf97
  3. 25 1月, 2013 1 次提交
  4. 30 11月, 2012 1 次提交
  5. 15 11月, 2012 1 次提交
  6. 28 9月, 2012 1 次提交
  7. 07 9月, 2012 1 次提交
  8. 28 6月, 2012 7 次提交
    • A
      x86/tlb: do flush_tlb_kernel_range by 'invlpg' · effee4b9
      Alex Shi 提交于
      This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
      and gain was analyzed in previous patch
      (x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).
      
      In the testing: http://lkml.org/lkml/2012/6/21/10
      
      The pay is mostly covered by long kernel path, but the gain is still
      quite clear, memory access in user APP can increase 30+% when kernel
      execute this funtion.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      effee4b9
    • A
      x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR · 52aec330
      Alex Shi 提交于
      There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
      amount of vector in IDT. But it is still not enough, since modern x86
      sever has more cpu number. That still causes heavy lock contention
      in TLB flushing.
      
      The patch using generic smp call function to replace it. That saved 32
      vector number in IDT, and resolved the lock contention in TLB
      flushing on large system.
      
      In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
      has 3% performance increase.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      52aec330
    • A
      x86/tlb: enable tlb flush range support for x86 · 611ae8e3
      Alex Shi 提交于
      Not every tlb_flush execution moment is really need to evacuate all
      TLB entries, like in munmap, just few 'invlpg' is better for whole
      process performance, since it leaves most of TLB entries for later
      accessing.
      
      This patch also rewrite flush_tlb_range for 2 purposes:
      1, split it out to get flush_blt_mm_range function.
      2, clean up to reduce line breaking, thanks for Borislav's input.
      
      My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
      show that the random memory access on other CPU has 0~50% speed up
      on a 2P * 4cores * HT NHM EP while do 'munmap'.
      
      Thanks Yongjie's testing on this patch:
      -------------
      I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
      kernel.
      After running two benchmarks in Xen HVM guest, I found his patches
      brought about 1%~3% performance gain in 'kernel build' and 'netperf'
      testing, though the performance gain was not very stable in 'kernel
      build' testing.
      
      Some detailed testing results are below.
      
      Testing Environment:
      	Hardware: Romley-EP platform
      	Xen version: latest upstream
      	Linux kernel: 3.4-RC6
      	Guest vCPU number: 8
      	NIC: Intel 82599 (10GB bandwidth)
      
      In 'kernel build' testing in guest:
      	Command line  |  performance gain
          make -j 4      |    3.81%
          make -j 8      |    0.37%
          make -j 16     |    -0.52%
      
      In 'netperf' testing, we tested TCP_STREAM with default socket size
      16384 byte as large packet and 64 byte as small packet.
      I used several clients to add networking pressure, then 'netperf' server
      automatically generated several threads to response them.
      I also used large-size packet and small-size packet in the testing.
      	Packet size  |  Thread number | performance gain
      	16384 bytes  |      4       |   0.02%
      	16384 bytes  |      8       |   2.21%
      	16384 bytes  |      16      |   2.04%
      	64 bytes     |      4       |   1.07%
      	64 bytes     |      8       |   3.31%
      	64 bytes     |      16      |   0.71%
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.comTested-by: NRen, Yongjie <yongjie.ren@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      611ae8e3
    • A
      x86/tlb: add tlb_flushall_shift knob into debugfs · 3df3212f
      Alex Shi 提交于
      kernel will replace cr3 rewrite with invlpg when
        tlb_flush_entries <= active_tlb_entries / 2^tlb_flushall_factor
      if tlb_flushall_factor is -1, kernel won't do this replacement.
      
      User can modify its value according to specific CPU/applications.
      
      Thanks for Borislav providing the help message of
      CONFIG_DEBUG_TLBFLUSH.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-6-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      3df3212f
    • A
      x86/tlb: add tlb_flushall_shift for specific CPU · c4211f42
      Alex Shi 提交于
      Testing show different CPU type(micro architectures and NUMA mode) has
      different balance points between the TLB flush all and multiple invlpg.
      And there also has cases the tlb flush change has no any help.
      
      This patch give a interface to let x86 vendor developers have a chance
      to set different shift for different CPU type.
      
      like some machine in my hands, balance points is 16 entries on
      Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
      IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
      help.
      
      For untested machine, do a conservative optimization, same as NHM CPU.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      c4211f42
    • A
      x86/tlb: fall back to flush all when meet a THP large page · d8dfe60d
      Alex Shi 提交于
      We don't need to flush large pages by PAGE_SIZE step, that just waste
      time. and actually, large page don't need 'invlpg' optimizing according
      to our micro benchmark. So, just flush whole TLB is enough for them.
      
      The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
      with THP 'always' setting.
      
      Multi-thread testing, '-t' paramter is thread number:
                             without this patch 	with this patch
      ./mprotect -t 1         14ns                       13ns
      ./mprotect -t 2         13ns                       13ns
      ./mprotect -t 4         12ns                       11ns
      ./mprotect -t 8         14ns                       10ns
      ./mprotect -t 16        28ns                       28ns
      ./mprotect -t 32        54ns                       52ns
      ./mprotect -t 128       200ns                      200ns
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-4-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      d8dfe60d
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
  9. 15 5月, 2012 1 次提交
  10. 23 3月, 2012 1 次提交
  11. 15 3月, 2011 1 次提交
  12. 14 2月, 2011 1 次提交
  13. 18 11月, 2010 1 次提交
    • Y
      x86: Use online node real index in calulate_tbl_offset() · 9223081f
      Yinghai Lu 提交于
      Found a NUMA system that doesn't have RAM installed at the first
      socket which hangs while executing init scripts.
      
      bisected it to:
      
       | commit 93296720
       | Author: Shaohua Li <shaohua.li@intel.com>
       | Date:   Wed Oct 20 11:07:03 2010 +0800
       |
       |     x86: Spread tlb flush vector between nodes
      
      It turns out when first socket is not online it could have cpus on
      node1 tlb_offset set to bigger than NUM_INVALIDATE_TLB_VECTORS.
      
      That could affect systems like 4 sockets, but socket 2 doesn't
      have installed, sockets 3 will get too big tlb_offset.
      
      Need to use real online node idx.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <4CDEDE59.40603@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9223081f
  14. 01 11月, 2010 1 次提交
    • R
      x86, mm: Fix section mismatch in tlb.c · cf38d0ba
      Rakib Mullick 提交于
      Mark tlb_cpuhp_notify as __cpuinit. It's basically a callback
      function, which is called from __cpuinit init_smp_flash(). So -
      it's safe.
      
      We were warned by the following warning:
      
       WARNING: arch/x86/mm/built-in.o(.text+0x356d): Section mismatch
       in reference from the function tlb_cpuhp_notify() to the
       function .cpuinit.text:calculate_tlb_offset()
       The function tlb_cpuhp_notify() references
       the function __cpuinit calculate_tlb_offset().
       This is often because tlb_cpuhp_notify lacks a __cpuinit
       annotation or the annotation of calculate_tlb_offset is wrong.
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      LKML-Reference: <AANLkTinWQRG=HA9uB3ad0KAqRRTinL6L_4iKgF84coph@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cf38d0ba
  15. 21 10月, 2010 1 次提交
    • S
      x86: Spread tlb flush vector between nodes · 93296720
      Shaohua Li 提交于
      Currently flush tlb vector allocation is based on below equation:
      	sender = smp_processor_id() % 8
      This isn't optimal, CPUs from different node can have the same vector, this
      causes a lot of lock contention. Instead, we can assign the same vectors to
      CPUs from the same node, while different node has different vectors. This has
      below advantages:
      a. if there is lock contention, the lock contention is between CPUs from one
      node. This should be much cheaper than the contention between nodes.
      b. completely avoid lock contention between nodes. This especially benefits
      kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
      to specific node.
      
      In my test, this could reduce > 20% CPU overhead in extreme case.The test
      machine has 4 nodes and each node has 16 CPUs. I then bind each node's kswapd
      to the first CPU of the node. I run a workload with 4 sequential mmap file
      read thread. The files are empty sparse file. This workload will trigger a
      lot of page reclaim and tlbflush. The kswapd bind is to easy trigger the
      extreme tlb flush lock contention because otherwise kswapd keeps migrating
      between CPUs of a node and I can't get stable result. Sure in real workload,
      we can't always see so big tlb flush lock contention, but it's possible.
      
      [ hpa: folded in fix from Eric Dumazet to use this_cpu_read() ]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1287544023.4571.8.camel@sli10-conroe.sh.intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      93296720
  16. 22 7月, 2010 1 次提交
  17. 18 2月, 2010 1 次提交
  18. 19 11月, 2009 1 次提交
    • J
      x86: Eliminate redundant/contradicting cache line size config options · 350f8f56
      Jan Beulich 提交于
      Rather than having X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT
      (with inconsistent defaults), just having the latter suffices as
      the former can be easily calculated from it.
      
      To be consistent, also change X86_INTERNODE_CACHE_BYTES to
      X86_INTERNODE_CACHE_SHIFT, and set it to 7 (128 bytes) for NUMA
      to account for last level cache line size (which here matters
      more than L1 cache line size).
      
      Finally, make sure the default value for X86_L1_CACHE_SHIFT,
      when X86_GENERIC is selected, is being seen before that for the
      individual CPU model options (other than on x86-64, where
      GENERIC_CPU is part of the choice construct, X86_GENERIC is a
      separate option on ix86).
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Acked-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Acked-by: NNick Piggin <npiggin@suse.de>
      LKML-Reference: <4AFD5710020000780001F8F0@vpn.id2.novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      350f8f56
  19. 24 9月, 2009 1 次提交
  20. 22 8月, 2009 1 次提交
    • L
      x86: don't call '->send_IPI_mask()' with an empty mask · b04e6373
      Linus Torvalds 提交于
      As noted in 83d349f3 ("x86: don't send
      an IPI to the empty set of CPU's"), some APIC's will be very unhappy
      with an empty destination mask.  That commit added a WARN_ON() for that
      case, and avoided the resulting problem, but didn't fix the underlying
      reason for why those empty mask cases happened.
      
      This fixes that, by checking the result of 'cpumask_andnot()' of the
      current CPU actually has any other CPU's left in the set of CPU's to be
      sent a TLB flush, and not calling down to the IPI code if the mask is
      empty.
      
      The reason this started happening at all is that we started passing just
      the CPU mask pointers around in commit 4595f962 ("x86: change
      flush_tlb_others to take a const struct cpumask"), and when we did that,
      the cpumask was no longer thread-local.
      
      Before that commit, flush_tlb_mm() used to create it's own copy of
      'mm->cpu_vm_mask' and pass that copy down to the low-level flush
      routines after having tested that it was not empty.  But after changing
      it to just pass down the CPU mask pointer, the lower level TLB flush
      routines would now get a pointer to that 'mm->cpu_vm_mask', and that
      could still change - and become empty - after the test due to other
      CPU's having flushed their own TLB's.
      
      See
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=13933
      
      for details.
      Tested-by: NThomas Björnell <thomas.bjornell@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b04e6373
  21. 18 3月, 2009 1 次提交
    • S
      x86: add x2apic_wrmsr_fence() to x2apic flush tlb paths · ce4e240c
      Suresh Siddha 提交于
      Impact: optimize APIC IPI related barriers
      
      Uncached MMIO accesses for xapic are inherently serializing and hence
      we don't need explicit barriers for xapic IPI paths.
      
      x2apic MSR writes/reads don't have serializing semantics and hence need
      a serializing instruction or mfence, to make all the previous memory
      stores globally visisble before the x2apic msr write for IPI.
      
      Add x2apic_wrmsr_fence() in flush tlb path to x2apic specific paths.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "steiner@sgi.com" <steiner@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      LKML-Reference: <1237313814.27006.203.camel@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce4e240c
  22. 18 2月, 2009 2 次提交
  23. 29 1月, 2009 2 次提交
  24. 21 1月, 2009 5 次提交
    • I
      x86, mm: move tlb.c to arch/x86/mm/ · 55f4949f
      Ingo Molnar 提交于
      Impact: cleanup
      
      Now that it's unified, move the (SMP) TLB flushing code from arch/x86/kernel/
      to arch/x86/mm/, where it belongs logically.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      55f4949f
    • T
      x86: rename tlb_64.c to tlb.c · 16c2d3f8
      Tejun Heo 提交于
      Impact: file rename
      
      tlb_64.c is now the tlb code for both 32 and 64.  Rename it to tlb.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      16c2d3f8
    • T
      x86: make x86_32 use tlb_64.c · 02cf94c3
      Tejun Heo 提交于
      Impact: less contention when issuing invalidate IPI, cleanup
      
      Make x86_32 use the same tlb code as 64bit.  The 64bit code uses
      multiple IPI vectors for tlb shootdown to reduce contention.  This
      patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the
      code paths.
      
      Note that the usage of asmlinkage is inconsistent for x86_32 and 64
      and calls for further cleanup.  This has been noted with a FIXME
      comment in tlb_64.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      02cf94c3
    • T
      x86: prepare for tlb merge · 6dd01bed
      Tejun Heo 提交于
      Impact: clean up, ipi vector number reordering for x86_32
      
      Make the following changes to prepare for tlb merge.
      
      * reorder x86_32 ip vectors
      
      * adjust tlb_32.c and tlb_64.c such that their logics coincide exactly
      	- on spurious invalidate ipi, tlb_32 acks the irq
      	- tlb_64 now has proper memory barriers around clearing
                flush_cpumask (no change in generated code)
      
      * unexport flush_tlb_page from tlb_32.c, there's no user
      
      * use unsigned int for cpu id
      
      * drop unnecessary includes from tlb_64.c
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6dd01bed
    • T
      x86: uv cleanup · bdbcdd48
      Tejun Heo 提交于
      Impact: cleanup
      
      Make the following uv related cleanups.
      
      * collect visible uv related definitions and interfaces into uv/uv.h
        and use it.  this cleans up the messy situation where on 64bit, uv
        is defined properly, on 32bit generic it's dummy and on the rest
        undefined.  after this clean up, uv is defined on 64 and dummy on
        32.
      
      * update uv_flush_tlb_others() such that it takes cpumask of
        to-be-flushed cpus as argument, instead of that minus self, and
        returns yet-to-be-flushed cpumask, instead of modifying the passed
        in parameter.  this interface change will ease dummy implementation
        of uv_flush_tlb_others() and makes uv tlb flush related stuff
        defined in tlb_uv proper.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bdbcdd48
  25. 18 1月, 2009 1 次提交