1. 18 11月, 2012 2 次提交
  2. 26 10月, 2012 1 次提交
  3. 25 10月, 2012 1 次提交
  4. 24 10月, 2012 2 次提交
  5. 09 10月, 2012 8 次提交
    • S
      readahead: fault retry breaks mmap file read random detection · 45cac65b
      Shaohua Li 提交于
      .fault now can retry.  The retry can break state machine of .fault.  In
      filemap_fault, if page is miss, ra->mmap_miss is increased.  In the second
      try, since the page is in page cache now, ra->mmap_miss is decreased.  And
      these are done in one fault, so we can't detect random mmap file access.
      
      Add a new flag to indicate .fault is tried once.  In the second try, skip
      ra->mmap_miss decreasing.  The filemap_fault state machine is ok with it.
      
      I only tested x86, didn't test other archs, but looks the change for other
      archs is obvious, but who knows :)
      Signed-off-by: NShaohua Li <shaohua.li@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45cac65b
    • M
      rbtree: move augmented rbtree functionality to rbtree_augmented.h · 9c079add
      Michel Lespinasse 提交于
      Provide rb_insert_augmented() and rb_erase_augmented() through a new
      rbtree_augmented.h include file.  rb_erase_augmented() is defined there as
      an __always_inline function, in order to allow inlining of augmented
      rbtree callbacks into it.  Since this generates a relatively large
      function, each augmented rbtree user should make sure to have a single
      call site.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c079add
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • M
      rbtree: add RB_DECLARE_CALLBACKS() macro · 3908836a
      Michel Lespinasse 提交于
      As proposed by Peter Zijlstra, this makes it easier to define the augmented
      rbtree callbacks.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3908836a
    • M
      rbtree: remove prior augmented rbtree implementation · 9d9e6f97
      Michel Lespinasse 提交于
      convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
      and remove the old augmented rbtree implementation.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d9e6f97
    • K
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov 提交于
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
    • S
      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn · 5180da41
      Suresh Siddha 提交于
      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
      attribute and uses it.  Expectation is that the driver reserves the
      memory attributes for the pfn before calling vm_insert_pfn().
      
      remap_pfn_range() (when called for the whole vma) will setup a new
      attribute (based on the prot argument) for the specified pfn range.
      This addresses the legacy usage which typically calls remap_pfn_range()
      with a desired memory attribute.  For ranges smaller than the vma size
      (which is typically not the case), remap_pfn_range() will use the
      existing memory attribute for the pfn range.
      
      Expose two different API's for these different behaviors.
      track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn()
      and track_pfn_remap() for the remap_pfn_range().
      
      This cleanup also prepares the ground for the track/untrack pfn vma
      routines to take over the ownership of setting PAT specific vm_flag in
      the 'vma'.
      
      [khlebnikov@openvz.org: Clear checks in track_pfn_remap()]
      [akpm@linux-foundation.org: tweak a few comments]
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5180da41
    • S
      x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines · b1a86e15
      Suresh Siddha 提交于
      'pfn' argument for track_pfn_vma_new() can be used for reserving the
      attribute for the pfn range.  No need to depend on 'vm_pgoff'
      
      Similarly, untrack_pfn_vma() can depend on the 'pfn' argument if it is
      non-zero or can use follow_phys() to get the starting value of the pfn
      range.
      
      Also the non zero 'size' argument can be used instead of recomputing it
      from vma.
      
      This cleanup also prepares the ground for the track/untrack pfn vma
      routines to take over the ownership of setting PAT specific vm_flag in the
      'vma'.
      
      [khlebnikov@openvz.org: Clear pfn to paddr conversion]
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1a86e15
  6. 28 9月, 2012 1 次提交
  7. 26 9月, 2012 1 次提交
    • F
      x86: Exception hooks for userspace RCU extended QS · 6ba3c97a
      Frederic Weisbecker 提交于
      Add necessary hooks to x86 exception for userspace
      RCU extended quiescent state support.
      
      This includes traps, page fault, debug exceptions, etc...
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6ba3c97a
  8. 22 9月, 2012 2 次提交
  9. 13 9月, 2012 1 次提交
  10. 12 9月, 2012 4 次提交
  11. 07 9月, 2012 1 次提交
  12. 22 8月, 2012 1 次提交
    • M
      mm: hugetlbfs: correctly populate shared pmd · eb48c071
      Michal Hocko 提交于
      Each page mapped in a process's address space must be correctly
      accounted for in _mapcount.  Normally the rules for this are
      straightforward but hugetlbfs page table sharing is different.  The page
      table pages at the PMD level are reference counted while the mapcount
      remains the same.
      
      If this accounting is wrong, it causes bugs like this one reported by
      Larry Woodman:
      
        kernel BUG at mm/filemap.c:135!
        invalid opcode: 0000 [#1] SMP
        CPU 22
        Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
        Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
        RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
        Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
        Call Trace:
          delete_from_page_cache+0x40/0x80
          truncate_hugepages+0x115/0x1f0
          hugetlbfs_evict_inode+0x18/0x30
          evict+0x9f/0x1b0
          iput_final+0xe3/0x1e0
          iput+0x3e/0x50
          d_kill+0xf8/0x110
          dput+0xe2/0x1b0
          __fput+0x162/0x240
      
      During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
      shared page tables with the check dst_pte == src_pte.  The logic is if
      the PMD page is the same, they must be shared.  This assumes that the
      sharing is between the parent and child.  However, if the sharing is
      with a different process entirely then this check fails as in this
      diagram:
      
        parent
          |
          ------------>pmd
                       src_pte----------> data page
                                              ^
        other--------->pmd--------------------|
                        ^
        child-----------|
                       dst_pte
      
      For this situation to occur, it must be possible for Parent and Other to
      have faulted and failed to share page tables with each other.  This is
      possible due to the following style of race.
      
        PROC A                                          PROC B
        copy_hugetlb_page_range                         copy_hugetlb_page_range
          src_pte == huge_pte_offset                      src_pte == huge_pte_offset
          !src_pte so no sharing                          !src_pte so no sharing
      
        (time passes)
      
        hugetlb_fault                                   hugetlb_fault
          huge_pte_alloc                                  huge_pte_alloc
            huge_pmd_share                                 huge_pmd_share
              LOCK(i_mmap_mutex)
              find nothing, no sharing
              UNLOCK(i_mmap_mutex)
                                                            LOCK(i_mmap_mutex)
                                                            find nothing, no sharing
                                                            UNLOCK(i_mmap_mutex)
            pmd_alloc                                       pmd_alloc
            LOCK(instantiation_mutex)
            fault
            UNLOCK(instantiation_mutex)
                                                        LOCK(instantiation_mutex)
                                                        fault
                                                        UNLOCK(instantiation_mutex)
      
      These two processes are not poing to the same data page but are not
      sharing page tables because the opportunity was missed.  When either
      process later forks, the src_pte == dst pte is potentially insufficient.
      As the check falls through, the wrong PTE information is copied in
      (harmless but wrong) and the mapcount is bumped for a page mapped by a
      shared page table leading to the BUG_ON.
      
      This patch addresses the issue by moving pmd_alloc into huge_pmd_share
      which guarantees that the shared pud is populated in the same critical
      section as pmd.  This also means that huge_pte_offset test in
      huge_pmd_share is serialized correctly now which in turn means that the
      success of the sharing will be higher as the racing tasks see the pud
      and pmd populated together.
      
      Race identified and changelog written mostly by Mel Gorman.
      
      {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
      Reported-by: NLarry Woodman <lwoodman@redhat.com>
      Tested-by: NLarry Woodman <lwoodman@redhat.com>
      Reviewed-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb48c071
  13. 15 8月, 2012 1 次提交
  14. 03 8月, 2012 1 次提交
  15. 28 6月, 2012 7 次提交
    • A
      x86/tlb: do flush_tlb_kernel_range by 'invlpg' · effee4b9
      Alex Shi 提交于
      This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
      and gain was analyzed in previous patch
      (x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).
      
      In the testing: http://lkml.org/lkml/2012/6/21/10
      
      The pay is mostly covered by long kernel path, but the gain is still
      quite clear, memory access in user APP can increase 30+% when kernel
      execute this funtion.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      effee4b9
    • A
      x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR · 52aec330
      Alex Shi 提交于
      There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
      amount of vector in IDT. But it is still not enough, since modern x86
      sever has more cpu number. That still causes heavy lock contention
      in TLB flushing.
      
      The patch using generic smp call function to replace it. That saved 32
      vector number in IDT, and resolved the lock contention in TLB
      flushing on large system.
      
      In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
      has 3% performance increase.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      52aec330
    • A
      x86/tlb: enable tlb flush range support for x86 · 611ae8e3
      Alex Shi 提交于
      Not every tlb_flush execution moment is really need to evacuate all
      TLB entries, like in munmap, just few 'invlpg' is better for whole
      process performance, since it leaves most of TLB entries for later
      accessing.
      
      This patch also rewrite flush_tlb_range for 2 purposes:
      1, split it out to get flush_blt_mm_range function.
      2, clean up to reduce line breaking, thanks for Borislav's input.
      
      My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
      show that the random memory access on other CPU has 0~50% speed up
      on a 2P * 4cores * HT NHM EP while do 'munmap'.
      
      Thanks Yongjie's testing on this patch:
      -------------
      I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
      kernel.
      After running two benchmarks in Xen HVM guest, I found his patches
      brought about 1%~3% performance gain in 'kernel build' and 'netperf'
      testing, though the performance gain was not very stable in 'kernel
      build' testing.
      
      Some detailed testing results are below.
      
      Testing Environment:
      	Hardware: Romley-EP platform
      	Xen version: latest upstream
      	Linux kernel: 3.4-RC6
      	Guest vCPU number: 8
      	NIC: Intel 82599 (10GB bandwidth)
      
      In 'kernel build' testing in guest:
      	Command line  |  performance gain
          make -j 4      |    3.81%
          make -j 8      |    0.37%
          make -j 16     |    -0.52%
      
      In 'netperf' testing, we tested TCP_STREAM with default socket size
      16384 byte as large packet and 64 byte as small packet.
      I used several clients to add networking pressure, then 'netperf' server
      automatically generated several threads to response them.
      I also used large-size packet and small-size packet in the testing.
      	Packet size  |  Thread number | performance gain
      	16384 bytes  |      4       |   0.02%
      	16384 bytes  |      8       |   2.21%
      	16384 bytes  |      16      |   2.04%
      	64 bytes     |      4       |   1.07%
      	64 bytes     |      8       |   3.31%
      	64 bytes     |      16      |   0.71%
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.comTested-by: NRen, Yongjie <yongjie.ren@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      611ae8e3
    • A
      x86/tlb: add tlb_flushall_shift knob into debugfs · 3df3212f
      Alex Shi 提交于
      kernel will replace cr3 rewrite with invlpg when
        tlb_flush_entries <= active_tlb_entries / 2^tlb_flushall_factor
      if tlb_flushall_factor is -1, kernel won't do this replacement.
      
      User can modify its value according to specific CPU/applications.
      
      Thanks for Borislav providing the help message of
      CONFIG_DEBUG_TLBFLUSH.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-6-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      3df3212f
    • A
      x86/tlb: add tlb_flushall_shift for specific CPU · c4211f42
      Alex Shi 提交于
      Testing show different CPU type(micro architectures and NUMA mode) has
      different balance points between the TLB flush all and multiple invlpg.
      And there also has cases the tlb flush change has no any help.
      
      This patch give a interface to let x86 vendor developers have a chance
      to set different shift for different CPU type.
      
      like some machine in my hands, balance points is 16 entries on
      Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
      IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
      help.
      
      For untested machine, do a conservative optimization, same as NHM CPU.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      c4211f42
    • A
      x86/tlb: fall back to flush all when meet a THP large page · d8dfe60d
      Alex Shi 提交于
      We don't need to flush large pages by PAGE_SIZE step, that just waste
      time. and actually, large page don't need 'invlpg' optimizing according
      to our micro benchmark. So, just flush whole TLB is enough for them.
      
      The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
      with THP 'always' setting.
      
      Multi-thread testing, '-t' paramter is thread number:
                             without this patch 	with this patch
      ./mprotect -t 1         14ns                       13ns
      ./mprotect -t 2         13ns                       13ns
      ./mprotect -t 4         12ns                       11ns
      ./mprotect -t 8         14ns                       10ns
      ./mprotect -t 16        28ns                       28ns
      ./mprotect -t 32        54ns                       52ns
      ./mprotect -t 128       200ns                      200ns
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-4-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      d8dfe60d
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
  16. 20 6月, 2012 1 次提交
  17. 11 6月, 2012 1 次提交
  18. 08 6月, 2012 1 次提交
  19. 06 6月, 2012 2 次提交
    • Y
      x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init() · 4af463d2
      Yasuaki Ishimatsu 提交于
      When hot-adding a CPU, the system outputs following messages
      since node_to_cpumask_map[2] was not allocated memory.
      
      Booting Node 2 Processor 32 APIC 0xc0
      node_to_cpumask_map[2] NULL
      Pid: 0, comm: swapper/32 Tainted: G       A     3.3.5-acd #21
      Call Trace:
       [<ffffffff81048845>] debug_cpumask_set_cpu+0x155/0x160
       [<ffffffff8105e28a>] ? add_timer_on+0xaa/0x120
       [<ffffffff8150665f>] numa_add_cpu+0x1e/0x22
       [<ffffffff815020bb>] identify_cpu+0x1df/0x1e4
       [<ffffffff815020d6>] identify_econdary_cpu+0x16/0x1d
       [<ffffffff81504614>] smp_store_cpu_info+0x3c/0x3e
       [<ffffffff81505263>] smp_callin+0x139/0x1be
       [<ffffffff815052fb>] start_secondary+0x13/0xeb
      
      The reason is that the bit of node 2 was not set at
      numa_nodes_parsed. numa_nodes_parsed is set by only
      acpi_numa_processor_affinity_init /
      acpi_numa_x2apic_affinity_init. Thus even if hot-added memory
      which is same PXM as hot-added CPU is written in ACPI SRAT
      Table, if the hot-added CPU is not written in ACPI SRAT table,
      numa_nodes_parsed is not set.
      
      But according to ACPI Spec Rev 5.0, it says about ACPI SRAT
      table as follows: This optional table provides information that
      allows OSPM to associate processors and memory ranges, including
      ranges of memory provided by hot-added memory devices, with
      system localities / proximity domains and clock domains.
      
      It means that ACPI SRAT table only provides information for CPUs
      present at boot time and for memory including hot-added memory.
      So hot-added memory is written in ACPI SRAT table, but hot-added
      CPU is not written in it. Thus numa_nodes_parsed should be set
      by not only acpi_numa_processor_affinity_init /
      acpi_numa_x2apic_affinity_init but also
      acpi_numa_memory_affinity_init for the case.
      
      Additionally, if system has cpuless memory node,
      acpi_numa_processor_affinity_init /
      acpi_numa_x2apic_affinity_init cannot set numa_nodes_parseds
      since these functions cannot find cpu description for the node.
      In this case, numa_nodes_parsed needs to be set by
      acpi_numa_memory_affinity_init.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: liuj97@gmail.com
      Cc: kosaki.motohiro@gmail.com
      Link: http://lkml.kernel.org/r/4FCC2098.4030007@jp.fujitsu.com
      [ merged it ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4af463d2
    • J
      x86-64/efi: Use EFI to deal with platform wall clock · bacef661
      Jan Beulich 提交于
      Other than ix86, x86-64 on EFI so far didn't set the
      {g,s}et_wallclock accessors to the EFI routines, thus
      incorrectly using raw RTC accesses instead.
      
      Simply removing the #ifdef around the respective code isn't
      enough, however: While so far early get-time calls were done in
      physical mode, this doesn't work properly for x86-64, as virtual
      addresses would still need to be set up for all runtime regions
      (which wasn't the case on the system I have access to), so
      instead the patch moves the call to efi_enter_virtual_mode()
      ahead (which in turn allows to drop all code related to calling
      efi-get-time in physical mode).
      
      Additionally the earlier calling of efi_set_executable()
      requires the CPA code to cope, i.e. during early boot it must be
      avoided to call cpa_flush_array(), as the first thing this
      function does is a BUG_ON(irqs_disabled()).
      
      Also make the two EFI functions in question here static -
      they're not being referenced elsewhere.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Tested-by: NMatt Fleming <matt.fleming@intel.com>
      Acked-by: NMatthew Garrett <mjg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FBFBF5F020000780008637F@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bacef661
  20. 30 5月, 2012 1 次提交