1. 23 10月, 2008 1 次提交
  2. 20 10月, 2008 1 次提交
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  3. 18 10月, 2008 1 次提交
  4. 16 10月, 2008 2 次提交
  5. 14 10月, 2008 5 次提交
  6. 13 10月, 2008 11 次提交
  7. 12 10月, 2008 2 次提交
    • I
      x86: memory corruption check - cleanup · 46eaa670
      Ingo Molnar 提交于
      Move the prototypes from the generic kernel.h header to the more
      appropriate include/asm-x86/bios_ebda.h header file.
      
      Also, remove the check from the power management code - this is a
      pure x86 matter for now.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      46eaa670
    • A
      x86, early_ioremap: fix fencepost error · c613ec1a
      Alan Cox 提交于
      The x86 implementation of early_ioremap has an off by one error. If we get
      an object which ends on the first byte of a page we undermap by one page and
      this causes a crash on boot with the ASUS P5QL whose DMI table happens to fit
      this alignment.
      
      The size computation is currently
      
      	last_addr = phys_addr + size - 1;
      	npages = (PAGE_ALIGN(last_addr) - phys_addr)
      
      (Consider a request for 1 byte at alignment 0...)
      
      Closes #11693
      
      Debugging work by Ian Campbell/Felix Geyer
      Signed-off-by: NAlan Cox <alan@rehat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c613ec1a
  8. 11 10月, 2008 9 次提交
    • S
      x86, cpa: make the kernel physical mapping initialization a two pass sequence, fix · b27a43c1
      Suresh Siddha 提交于
      Jeremy Fitzhardinge wrote:
      
      > I'd noticed that current tip/master hasn't been booting under Xen, and I
      > just got around to bisecting it down to this change.
      >
      > commit 065ae73c5462d42e9761afb76f2b52965ff45bd6
      > Author: Suresh Siddha <suresh.b.siddha@intel.com>
      >
      >    x86, cpa: make the kernel physical mapping initialization a two pass sequence
      >
      > This patch is causing Xen to fail various pagetable updates because it
      > ends up remapping pagetables to RW, which Xen explicitly prohibits (as
      > that would allow guests to make arbitrary changes to pagetables, rather
      > than have them mediated by the hypervisor).
      
      Instead of making init a two pass sequence, to satisfy the Intel's TLB
      Application note (developer.intel.com/design/processor/applnots/317080.pdf
      Section 6 page 26), we preserve the original page permissions
      when fragmenting the large mappings and don't touch the existing memory
      mapping (which satisfies Xen's requirements).
      
      Only open issue is: on a native linux kernel, we will go back to mapping
      the first 0-1GB kernel identity mapping as executable (because of the
      static mapping setup in head_64.S). We can fix this in a different
      patch if needed.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b27a43c1
    • I
      x86, pat: cleanups · ad2cde16
      Ingo Molnar 提交于
      clean up recently added code to be more consistent with other x86 code.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad2cde16
    • S
      x86: fix pagetable init 64-bit breakage · 28dd033f
      Suresh Siddha 提交于
      Fix _end alignment check - can trigger a crash if _end happens to be
      on a page boundary.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      28dd033f
    • S
      x86: track memtype for RAM in page struct · 9542ada8
      Suresh Siddha 提交于
      Track the memtype for RAM pages in page struct instead of using the
      memtype list. This avoids the explosion in the number of entries in
      memtype list (of the order of 20,000 with AGP) and makes the PAT
      tracking simpler.
      
      We are using PG_arch_1 bit in page->flags.
      
      We still use the memtype list for non RAM pages.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9542ada8
    • S
      x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa · ad5ca55f
      Suresh Siddha 提交于
      Do a global flush tlb after splitting the large page and before we do the
      actual change page attribute in the PTE.
      
      With out this, we violate the TLB application note, which says
          "The TLBs may contain both ordinary and large-page translations for
           a 4-KByte range of linear addresses. This may occur if software
           modifies the paging structures so that the page size used for the
           address range changes. If the two translations differ with respect
           to page frame or attributes (e.g., permissions), processor behavior
           is undefined and may be implementation-specific."
      
      And also serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity
      mappings) using cpa_lock. So that we don't allow any other cpu, with stale
      large tlb entries change the page attribute in parallel to some other cpu
      splitting a large page entry along with changing the attribute.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad5ca55f
    • S
      x86, cpa: remove cpa pool code · 8311eb84
      Suresh Siddha 提交于
      Interrupt context no longer splits large page in cpa(). So we can do away
      with cpa memory pool code.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8311eb84
    • S
      x86, cpa: no need to check alias for __set_pages_p/__set_pages_np · 55121b43
      Suresh Siddha 提交于
      No alias checking needed for setting present/not-present mapping. Otherwise,
      we may need to break large pages for 64-bit kernel text mappings (this adds to
      complexity if we want to do this from atomic context especially, for ex:
      with CONFIG_DEBUG_PAGEALLOC). Let's keep it simple!
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      55121b43
    • S
      x86, cpa: dont use large pages for kernel identity mapping with DEBUG_PAGEALLOC · 0b8fdcbc
      Suresh Siddha 提交于
      Don't use large pages for kernel identity mapping with DEBUG_PAGEALLOC.
      This will remove the need to split the large page for the
      allocated kernel page in the interrupt context.
      
      This will simplify cpa code(as we don't do the split any more from the
      interrupt context). cpa code simplication in the subsequent patches.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0b8fdcbc
    • S
      x86, cpa: make the kernel physical mapping initialization a two pass sequence · a2699e47
      Suresh Siddha 提交于
      In the first pass, kernel physical mapping will be setup using large or
      small pages but uses the same PTE attributes as that of the early
      PTE attributes setup by early boot code in head_[32|64].S
      
      After flushing TLB's, we go through the second pass, which setups the
      direct mapped PTE's with the appropriate attributes (like NX, GLOBAL etc)
      which are runtime detectable.
      
      This two pass mechanism conforms to the TLB app note which says:
      
      "Software should not write to a paging-structure entry in a way that would
       change, for any linear address, both the page size and either the page frame
       or attributes."
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a2699e47
  9. 30 9月, 2008 1 次提交
  10. 26 9月, 2008 1 次提交
  11. 15 9月, 2008 1 次提交
  12. 07 9月, 2008 4 次提交
  13. 05 9月, 2008 1 次提交