1. 29 10月, 2009 1 次提交
  2. 23 9月, 2009 1 次提交
  3. 22 9月, 2009 4 次提交
  4. 14 8月, 2009 2 次提交
    • T
      vmalloc: implement pcpu_get_vm_areas() · ca23e405
      Tejun Heo 提交于
      To directly use spread NUMA memories for percpu units, percpu
      allocator will be updated to allow sparsely mapping units in a chunk.
      As the distances between units can be very large, this makes
      allocating single vmap area for each chunk undesirable.  This patch
      implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which
      allocates and frees sparse congruent vmap areas.
      
      pcpu_get_vm_areas() take @offsets and @sizes array which define
      distances and sizes of vmap areas.  It scans down from the top of
      vmalloc area looking for the top-most address which can accomodate all
      the areas.  The top-down scan is to avoid interacting with regular
      vmallocs which can push up these congruent areas up little by little
      ending up wasting address space and page table.
      
      To speed up top-down scan, the highest possible address hint is
      maintained.  Although the scan is linear from the hint, given the
      usual large holes between memory addresses between NUMA nodes, the
      scanning is highly likely to finish after finding the first hole for
      the last unit which is scanned first.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <npiggin@suse.de>
      ca23e405
    • T
      vmalloc: separate out insert_vmalloc_vm() · cf88c790
      Tejun Heo 提交于
      Separate out insert_vmalloc_vm() from __get_vm_area_node().
      insert_vmalloc_vm() initializes vm_struct from vmap_area and inserts
      it into vmlist.  insert_vmalloc_vm() only initializes fields which can
      be determined from @vm, @flags and @caller The rest should be
      initialized by the caller.  For __get_vm_area_node(), all other fields
      just need to be cleared and this is done by using kzalloc instead of
      kmalloc.
      
      This will be used to implement pcpu_get_vm_areas().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <npiggin@suse.de>
      cf88c790
  5. 12 6月, 2009 2 次提交
  6. 07 5月, 2009 1 次提交
  7. 01 4月, 2009 1 次提交
  8. 28 2月, 2009 2 次提交
    • V
      mm: fix lazy vmap purging (use-after-free error) · cbb76676
      Vegard Nossum 提交于
      I just got this new warning from kmemcheck:
      
          WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
          a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
           f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
           ^
      
          Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
          EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
          EIP is at __purge_vmap_area_lazy+0x117/0x140
          EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
          ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
           DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
          CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
          DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
          DR6: 00004000 DR7: 00000000
           [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
           [<c1096f6a>] remove_vm_area+0x2a/0x70
           [<c1097025>] __vunmap+0x45/0xe0
           [<c10970de>] vunmap+0x1e/0x30
           [<c1008ba5>] text_poke+0x95/0x150
           [<c1008ca9>] alternatives_smp_unlock+0x49/0x60
           [<c171ef47>] alternative_instructions+0x11b/0x124
           [<c171f991>] check_bugs+0xbd/0xdc
           [<c17148c5>] start_kernel+0x2ed/0x360
           [<c171409e>] __init_begin+0x9e/0xa9
           [<ffffffff>] 0xffffffff
      
      It happened here:
      
          $ addr2line -e vmlinux -i c1096df7
          mm/vmalloc.c:540
      
      Code:
      
      	list_for_each_entry(va, &valist, purge_list)
      		__free_vmap_area(va);
      
      It's this instruction:
      
          mov    0x20(%ebx),%edx
      
      Which corresponds to a dereference of va->purge_list.next:
      
          (gdb) p ((struct vmap_area *) 0)->purge_list.next
          Cannot access memory at address 0x20
      
      It seems that we should use "safe" list traversal here, as the element
      is freed inside the loop. Please verify that this is the right fix.
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbb76676
    • N
      mm: vmap fix overflow · 7766970c
      Nick Piggin 提交于
      The new vmap allocator can wrap the address and get confused in the case
      of large allocations or VMALLOC_END near the end of address space.
      
      Problem reported by Christoph Hellwig on a 32-bit XFS workload.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7766970c
  9. 25 2月, 2009 1 次提交
  10. 24 2月, 2009 1 次提交
    • T
      vmalloc: add @align to vm_area_register_early() · c0c0a293
      Tejun Heo 提交于
      Impact: allow larger alignment for early vmalloc area allocation
      
      Some early vmalloc users might want larger alignment, for example, for
      custom large page mapping.  Add @align to vm_area_register_early().
      While at it, drop docbook comment on non-existent @size.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      c0c0a293
  11. 21 2月, 2009 1 次提交
  12. 20 2月, 2009 3 次提交
    • T
      vmalloc: add un/map_kernel_range_noflush() · 8fc48985
      Tejun Heo 提交于
      Impact: two more public map/unmap functions
      
      Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
      These functions respectively map and unmap address range in kernel VM
      area but doesn't do any vcache or tlb flushing.  These will be used by
      new percpu allocator.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      8fc48985
    • T
      vmalloc: implement vm_area_register_early() · f0aa6617
      Tejun Heo 提交于
      Impact: allow multiple early vm areas
      
      There are places where kernel VM area needs to be allocated before
      vmalloc is initialized.  This is done by allocating static vm_struct,
      initializing several fields and linking it to vmlist and later vmalloc
      initialization picking up these from vmlist.  This is currently done
      manually and if there's more than one such areas, there's no defined
      way to arbitrate who gets which address.
      
      This patch implements vm_area_register_early(), which takes vm_area
      struct with flags and size initialized, assigns address to it and puts
      it on the vmlist.  This way, multiple early vm areas can determine
      which addresses they should use.  The only current user - alpha mm
      init - is converted to use it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f0aa6617
    • T
      vmalloc: call flush_cache_vunmap() from unmap_kernel_range() · 73426952
      Tejun Heo 提交于
      Impact: proper vcache flush on unmap_kernel_range()
      
      flush_cache_vunmap() should be called before pages are unmapped.  Add
      a call to it in unmap_kernel_range().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73426952
  13. 19 2月, 2009 1 次提交
  14. 16 1月, 2009 2 次提交
    • A
      revert "mm: vmalloc use mutex for purge" · 46666d8a
      Andrew Morton 提交于
      Revert commit e97a630e ("mm: vmalloc use
      mutex for purge")
      
      Bryan Donlan reports:
      
      : After testing 2.6.29-rc1 on xen-x86 with a btrfs root filesystem, I
      : got the OOPS quoted below and a hard freeze shortly after boot.
      : Boot messages and config are attached.
      :
      : ------------[ cut here ]------------
      : Kernel BUG at c05ef80d [verbose debug info unavailable]
      : invalid opcode: 0000 [#1] SMP
      : last sysfs file: /sys/block/xvdc/size
      : Modules linked in:
      :
      : Pid: 0, comm: swapper Not tainted (2.6.29-rc1 #6)
      : EIP: 0061:[<c05ef80d>] EFLAGS: 00010087 CPU: 2
      : EIP is at schedule+0x7cd/0x950
      : EAX: d5aeca80 EBX: 00000002 ECX: 00000000 EDX: d4cb9a40
      : ESI: c12f5600 EDI: d4cb9a40 EBP: d6033fa4 ESP: d6033ef4
      :  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
      : Process swapper (pid: 0, ti=d6032000 task=d6020b70 task.ti=d6032000)
      : Stack:
      :  000d85bc 00000000 000186a0 00000000 0dd11410 c0105417 c12efe00 0dc367c3
      :  00000011 c0105d46 d5a5d310 deadbeef d4cb9a40 c07cc600 c05f1340 c12e0060
      :  deadbeef d6020b70 d6020d08 00000002 c014377d 00000000 c12f5600 00002c22
      : Call Trace:
      :  [<c0105417>] xen_force_evtchn_callback+0x17/0x30
      :  [<c0105d46>] check_events+0x8/0x12
      :  [<c05f1340>] _spin_unlock_irqrestore+0x20/0x40
      :  [<c014377d>] hrtimer_start_range_ns+0x12d/0x2e0
      :  [<c014c4f6>] tick_nohz_restart_sched_tick+0x146/0x160
      :  [<c0107485>] cpu_idle+0xa5/0xc0
      
      and bisected it to this commit.
      
      Let's remove it now while we have a think about the problem.
      Reported-by: NBryan Donlan <bdonlan@gmail.com>
      Tested-by: NChristophe Saout <christophe@saout.de>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46666d8a
    • I
      alpha: fix vmalloc breakage · 822c18f2
      Ivan Kokshaysky 提交于
      On alpha, we have to map some stuff in the VMALLOC space very early in the
      boot process (to make SRM console callbacks work and so on, see
      arch/alpha/mm/init.c).  For old VM allocator, we just manually placed a
      vm_struct onto the global vmlist and this worked for ages.
      
      Unfortunately, the new allocator isn't aware of this, so it constantly
      tries to allocate the VM space which is already in use, making vmalloc on
      alpha defunct.
      
      This patch forces KVA to import vmlist entries on init.
      
      [akpm@linux-foundation.org: remove unneeded check (per Johannes)]
      Signed-off-by: NIvan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      822c18f2
  15. 07 1月, 2009 4 次提交
  16. 05 1月, 2009 1 次提交
  17. 11 12月, 2008 1 次提交
  18. 02 12月, 2008 1 次提交
    • N
      mm: vmalloc fix lazy unmapping cache aliasing · b29acbdc
      Nick Piggin 提交于
      Jim Radford has reported that the vmap subsystem rewrite was sometimes
      causing his VIVT ARM system to behave strangely (seemed like going into
      infinite loops trying to fault in pages to userspace).
      
      We determined that the problem was most likely due to a cache aliasing
      issue.  flush_cache_vunmap was only being called at the moment the page
      tables were to be taken down, however with lazy unmapping, this can happen
      after the page has subsequently been freed and allocated for something
      else.  The dangling alias may still have dirty data attached to it.
      
      The fix for this problem is to do the cache flushing when the caller has
      called vunmap -- it would be a bug for them to write anything else to the
      mapping at that point.
      
      That appeared to solve Jim's problems.
      Reported-by: NJim Radford <radford@blackbean.org>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b29acbdc
  19. 20 11月, 2008 3 次提交
  20. 07 11月, 2008 2 次提交
  21. 31 10月, 2008 1 次提交
  22. 23 10月, 2008 1 次提交
  23. 21 10月, 2008 1 次提交
  24. 20 10月, 2008 1 次提交
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  25. 16 10月, 2008 1 次提交
    • L
      Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL · 73bdf0a6
      Linus Torvalds 提交于
      Impact: crash on module insertion with CONFIG_DEBUG_VIRTUAL
      
      We would incorrectly BUG due to:
      
         VIRTUAL_BUG_ON(!is_vmalloc_addr(vmalloc_addr) &&
         	          !is_module_address(addr));
      
      ... because, at least on x86-64, is_module_address() doesn't do what
      it should.  This patch introduces is_vmalloc_or_module_addr(), which
      is what we really want anyway, and uses it instead.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      73bdf0a6