1. 01 4月, 2009 4 次提交
  2. 31 3月, 2009 1 次提交
    • I
      lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB · 19cefdff
      Ingo Molnar 提交于
      Impact: build fix
      
      fix typo in mm/slob.c:
      
       mm/slob.c:469: error: ‘flags’ undeclared (first use in this function)
       mm/slob.c:469: error: (Each undeclared identifier is reported only once
       mm/slob.c:469: error: for each function it appears in.)
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090128135457.350751756@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19cefdff
  3. 30 3月, 2009 2 次提交
  4. 27 3月, 2009 1 次提交
    • W
      writeback: double the dirty thresholds · 1b5e62b4
      Wu Fengguang 提交于
      Enlarge default dirty ratios from 5/10 to 10/20.  This fixes [Bug
      #12809] iozone regression with 2.6.29-rc6.
      
      The iozone benchmarks are performed on a 1200M file, with 8GB ram.
      
        iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
        iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
      
      The performance regression is triggered by commit 1cf6e7d8(mm: task
      dirty accounting fix), which makes more correct/thorough dirty
      accounting.
      
      The default 5/10 dirty ratios were picked (a) with the old dirty logic
      and (b) largely at random and (c) designed to be aggressive.  In
      particular, that (a) means that having fixed some of the dirty
      accounting, maybe the real bug is now that it was always too aggressive,
      just hidden by an accounting issue.
      
      The enlarged 10/20 dirty ratios are just about enough to fix the regression.
      
      [ We will have to look at how this affects the old fsync() latency issue,
        but that probably will need independent work.  - Linus ]
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: N"Lin, Ming M" <ming.m.lin@intel.com>
      Tested-by: N"Lin, Ming M" <ming.m.lin@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b5e62b4
  5. 26 3月, 2009 1 次提交
  6. 23 3月, 2009 2 次提交
  7. 16 3月, 2009 1 次提交
    • N
      highmem: atomic highmem kmap page pinning · 3297e760
      Nicolas Pitre 提交于
      Most ARM machines have a non IO coherent cache, meaning that the
      dma_map_*() set of functions must clean and/or invalidate the affected
      memory manually before DMA occurs.  And because the majority of those
      machines have a VIVT cache, the cache maintenance operations must be
      performed using virtual
      addresses.
      
      When a highmem page is kunmap'd, its mapping (and cache) remains in place
      in case it is kmap'd again. However if dma_map_page() is then called with
      such a page, some cache maintenance on the remaining mapping must be
      performed. In that case, page_address(page) is non null and we can use
      that to synchronize the cache.
      
      It is unlikely but still possible for kmap() to race and recycle the
      virtual address obtained above, and use it for another page before some
      on-going cache invalidation loop in dma_map_page() is done. In that case,
      the new mapping could end up with dirty cache lines for another page,
      and the unsuspecting cache invalidation loop in dma_map_page() might
      simply discard those dirty cache lines resulting in data loss.
      
      For example, let's consider this sequence of events:
      
      	- dma_map_page(..., DMA_FROM_DEVICE) is called on a highmem page.
      
      	-->	- vaddr = page_address(page) is non null. In this case
      		it is likely that the page has valid cache lines
      		associated with vaddr. Remember that the cache is VIVT.
      
      		-->	for (i = vaddr; i < vaddr + PAGE_SIZE; i += 32)
      				invalidate_cache_line(i);
      
      	*** preemption occurs in the middle of the loop above ***
      
      	- kmap_high() is called for a different page.
      
      	-->	- last_pkmap_nr wraps to zero and flush_all_zero_pkmaps()
      		  is called.  The pkmap_count value for the page passed
      		  to dma_map_page() above happens to be 1, so the page
      		  is unmapped.  But prior to that, flush_cache_kmaps()
      		  cleared the cache for it.  So far so good.
      
      		- A fresh pkmap entry is assigned for this kmap request.
      		  The Murphy law says this pkmap entry will eventually
      		  happen to use the same vaddr as the one which used to
      		  belong to the other page being processed by
      		  dma_map_page() in the preempted thread above.
      
      	- The kmap_high() caller start dirtying the cache using the
      	  just assigned virtual mapping for its page.
      
      	*** the first thread is rescheduled ***
      
      			- The for(...) loop is resumed, but now cached
      			  data belonging to a different physical page is
      			  being discarded !
      
      And this is not only a preemption issue as ARM can be SMP as well,
      making the above scenario just as likely. Hence the need for some kind
      of pkmap page pinning which can be used in any context, primarily for
      the benefit of dma_map_page() on ARM.
      
      This provides the necessary interface to cope with the above issue if
      ARCH_NEEDS_KMAP_HIGH_GET is defined, otherwise the resulting code is
      unchanged.
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Reviewed-by: NMinChan Kim <minchan.kim@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      3297e760
  8. 15 3月, 2009 1 次提交
  9. 14 3月, 2009 1 次提交
  10. 13 3月, 2009 2 次提交
  11. 11 3月, 2009 1 次提交
  12. 10 3月, 2009 3 次提交
    • T
      percpu: generalize embedding first chunk setup helper · 66c3a757
      Tejun Heo 提交于
      Impact: code reorganization
      
      Separate out embedding first chunk setup helper from x86 embedding
      first chunk allocator and put it in mm/percpu.c.  This will be used by
      the default percpu first chunk allocator and possibly by other archs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      66c3a757
    • T
      percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk() · 6074d5b0
      Tejun Heo 提交于
      Impact: cleanup, more flexibility for first chunk init
      
      Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
      This restriction stemmed from implementation detail and made things a
      bit less intuitive.  This patch allows @dyn_size to be specified
      regardless of @unit_size and swaps the positions of @dyn_size and
      @unit_size so that the parameter order makes more sense (static,
      reserved and dyn sizes followed by enclosing unit_size).
      
      While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6074d5b0
    • T
      percpu: make x86 addr <-> pcpu ptr conversion macros generic · e0100983
      Tejun Heo 提交于
      Impact: generic addr <-> pcpu ptr conversion macros
      
      There's nothing arch specific about x86 __addr_to_pcpu_ptr() and
      __pcpu_ptr_to_addr().  With proper __per_cpu_load and __per_cpu_start
      defined, they'll do the right thing regardless of actual layout.
      
      Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c
      and allow archs to override it as necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e0100983
  13. 07 3月, 2009 1 次提交
    • T
      percpu: finer grained locking to break deadlock and allow atomic free · ccea34b5
      Tejun Heo 提交于
      Impact: fix deadlock and allow atomic free
      
      Percpu allocation always uses GFP_KERNEL and whole alloc/free paths
      were protected by single mutex.  All percpu allocations have been from
      GFP_KERNEL-safe context and the original allocator had this assumption
      too.  However, by protecting both alloc and free paths with the same
      mutex, the new allocator creates free -> alloc -> GFP_KERNEL
      dependency which the original allocator didn't have.  This can lead to
      deadlock if free is called from FS or IO paths.  Also, in general,
      allocators are expected to allow free to be called from atomic
      context.
      
      This patch implements finer grained locking to break the deadlock and
      allow atomic free.  For details, please read the "Synchronization
      rules" comment.
      
      While at it, also add CONTEXT: to function comments to describe which
      context they expect to be called from and what they do to it.
      
      This problem was reported by Thomas Gleixner and Peter Zijlstra.
      
        http://thread.gmane.org/gmane.linux.kernel/802384Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      ccea34b5
  14. 06 3月, 2009 8 次提交
    • T
      percpu: move fully free chunk reclamation into a work · a56dbddf
      Tejun Heo 提交于
      Impact: code reorganization for later changes
      
      Do fully free chunk reclamation using a work.  This change is to
      prepare for locking changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a56dbddf
    • T
      percpu: move chunk area map extension out of area allocation · 9f7dcf22
      Tejun Heo 提交于
      Impact: code reorganization for later changes
      
      Separate out chunk area map extension into a separate function -
      pcpu_extend_area_map() - and call it directly from pcpu_alloc() such
      that pcpu_alloc_area() is guaranteed to have enough area map slots on
      invocation.
      
      With this change, pcpu_alloc_area() does only area allocation and the
      only failure mode is when the chunk doens't have enough room, so
      there's no need to distinguish it from memory allocation failures.
      Make it return -1 on such cases instead of hacky -ENOSPC.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9f7dcf22
    • T
      percpu: replace pcpu_realloc() with pcpu_mem_alloc() and pcpu_mem_free() · 1880d93b
      Tejun Heo 提交于
      Impact: code reorganization for later changes
      
      With static map handling moved to pcpu_split_block(), pcpu_realloc()
      only clutters the code and it's also unsuitable for scheduled locking
      changes.  Implement and use pcpu_mem_alloc/free() instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1880d93b
    • T
      percpu, module: implement reserved allocation and use it for module percpu variables · edcb4639
      Tejun Heo 提交于
      Impact: add reserved allocation functionality and use it for module
      	percpu variables
      
      This patch implements reserved allocation from the first chunk.  When
      setting up the first chunk, arch can ask to set aside certain number
      of bytes right after the core static area which is available only
      through a separate reserved allocator.  This will be used primarily
      for module static percpu variables on architectures with limited
      relocation range to ensure that the module perpcu symbols are inside
      the relocatable range.
      
      If reserved area is requested, the first chunk becomes reserved and
      isn't available for regular allocation.  If the first chunk also
      includes piggy-back dynamic allocation area, a separate chunk mapping
      the same region is created to serve dynamic allocation.  The first one
      is called static first chunk and the second dynamic first chunk.
      Although they share the page map, their different area map
      initializations guarantee they serve disjoint areas according to their
      purposes.
      
      If arch doesn't setup reserved area, reserved allocation is handled
      like any other allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      edcb4639
    • T
      percpu: add an indirection ptr for chunk page map access · 3e24aa58
      Tejun Heo 提交于
      Impact: allow sharing page map, no functional difference yet
      
      Make chunk->page access indirect by adding a pointer and renaming the
      actual array to page_ar.  This will be used by future changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3e24aa58
    • T
      percpu: use negative for auto for pcpu_setup_first_chunk() arguments · cafe8816
      Tejun Heo 提交于
      Impact: argument semantic cleanup
      
      In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant
      auto-sizing.  It's okay for @unit_size as 0 doesn't make sense but 0
      dynamic reserve size is valid.  Alos, if arch @dyn_size is calculated
      from other parameters, it might end up passing in 0 @dyn_size and
      malfunction when the size is automatically adjusted.
      
      This patch makes both @unit_size and @dyn_size ssize_t and use -1 for
      auto sizing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cafe8816
    • T
      percpu: improve first chunk initial area map handling · 61ace7fa
      Tejun Heo 提交于
      Impact: no functional change
      
      When the first chunk is created, its initial area map is not allocated
      because kmalloc isn't online yet.  The map is allocated and
      initialized on the first allocation request on the chunk.  This works
      fine but the scattering of initialization logic between the init
      function and allocation path is a bit confusing.
      
      This patch makes the first chunk initialize and use minimal statically
      allocated map from pcpu_setpu_first_chunk().  The map resizing path
      still needs to handle this specially but it's more straight-forward
      and gives more latitude to the init path.  This will ease future
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      61ace7fa
    • T
      percpu: cosmetic renames in pcpu_setup_first_chunk() · 2441d15c
      Tejun Heo 提交于
      Impact: cosmetic, preparation for future changes
      
      Make the following renames in pcpur_setup_first_chunk() in preparation
      for future changes.
      
      * s/free_size/dyn_size/
      * s/static_vm/first_vm/
      * s/static_chunk/schunk/
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2441d15c
  15. 02 3月, 2009 1 次提交
    • I
      x86, mm: dont use non-temporal stores in pagecache accesses · f1800536
      Ingo Molnar 提交于
      Impact: standardize IO on cached ops
      
      On modern CPUs it is almost always a bad idea to use non-temporal stores,
      as the regression in this commit has shown it:
      
        30d697fa: x86: fix performance regression in write() syscall
      
      The kernel simply has no good information about whether using non-temporal
      stores is a good idea or not - and trying to add heuristics only increases
      complexity and inserts fragility.
      
      The regression on cached write()s took very long to be found - over two
      years. So dont take any chances and let the hardware decide how it makes
      use of its caches.
      
      The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
      absolutely sure that another entity (the GPU) will pick up the dirty
      data immediately and that the CPU will not touch that data before the
      GPU will.
      
      Also, keep the _nocache() primitives to make it easier for people to
      experiment with these details. There may be more clear-cut cases where
      non-cached copies can be used, outside of filemap.c.
      
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f1800536
  16. 01 3月, 2009 2 次提交
    • T
      bootmem, x86: further fixes for arch-specific bootmem wrapping · d0c4f570
      Tejun Heo 提交于
      Impact: fix new breakages introduced by previous fix
      
      Commit c1329375 tried to clean up
      bootmem arch wrapper but it wasn't quite correct.  Before the commit,
      the followings were broken.
      
      * Low level interface functions prefixed with __ ignored arch
        preference.
      
      * reserve_bootmem(...) can't be mapped into
        reserve_bootmem_node(NODE_DATA(0)->bdata, ...) because the node is
        not preference here.  The region specified MUST fall into the
        specified region; otherwise, it will panic.
      
      After the commit,
      
      * If allocation fails for the arch preferred node, it should fallback
        to whatever is available.  Instead, it simply failed allocation.
      
      There are too many internal details to allow generic wrapping and
      still keep things simple for archs.  Plus, all that arch wants is a
      way to prefer certain node over another.
      
      This patch drops the generic wrapping around alloc_bootmem_core() and
      add alloc_bootmem_core() instead.  If necessary, arch can define
      bootmem_arch_referred_node() macro or function which takes all
      allocation information and returns the preferred node.  bootmem
      generic code will always try the preferred node first and then
      fallback to other nodes as usual.
      
      Breakages noted and changes reviewed by Johannes Weiner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      d0c4f570
    • T
      percpu: kill compile warning in pcpu_populate_chunk() · 02d51fdf
      Tejun Heo 提交于
      Impact: remove compile warning
      
      Mark local variable map_end in pcpu_populate_chunk() with
      uninitialized_var().  The variable is always used in tandem with
      map_start and guaranteed to be initialized before use but gcc doesn't
      understand that.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      02d51fdf
  17. 28 2月, 2009 2 次提交
    • V
      mm: fix lazy vmap purging (use-after-free error) · cbb76676
      Vegard Nossum 提交于
      I just got this new warning from kmemcheck:
      
          WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
          a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
           f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
           ^
      
          Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
          EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
          EIP is at __purge_vmap_area_lazy+0x117/0x140
          EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
          ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
           DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
          CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
          DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
          DR6: 00004000 DR7: 00000000
           [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
           [<c1096f6a>] remove_vm_area+0x2a/0x70
           [<c1097025>] __vunmap+0x45/0xe0
           [<c10970de>] vunmap+0x1e/0x30
           [<c1008ba5>] text_poke+0x95/0x150
           [<c1008ca9>] alternatives_smp_unlock+0x49/0x60
           [<c171ef47>] alternative_instructions+0x11b/0x124
           [<c171f991>] check_bugs+0xbd/0xdc
           [<c17148c5>] start_kernel+0x2ed/0x360
           [<c171409e>] __init_begin+0x9e/0xa9
           [<ffffffff>] 0xffffffff
      
      It happened here:
      
          $ addr2line -e vmlinux -i c1096df7
          mm/vmalloc.c:540
      
      Code:
      
      	list_for_each_entry(va, &valist, purge_list)
      		__free_vmap_area(va);
      
      It's this instruction:
      
          mov    0x20(%ebx),%edx
      
      Which corresponds to a dereference of va->purge_list.next:
      
          (gdb) p ((struct vmap_area *) 0)->purge_list.next
          Cannot access memory at address 0x20
      
      It seems that we should use "safe" list traversal here, as the element
      is freed inside the loop. Please verify that this is the right fix.
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbb76676
    • N
      mm: vmap fix overflow · 7766970c
      Nick Piggin 提交于
      The new vmap allocator can wrap the address and get confused in the case
      of large allocations or VMALLOC_END near the end of address space.
      
      Problem reported by Christoph Hellwig on a 32-bit XFS workload.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7766970c
  18. 26 2月, 2009 1 次提交
    • H
      shmem: fix shared anonymous accounting · 0b0a0806
      Hugh Dickins 提交于
      Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
      400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
      prohibit.
      
      Commit fc8744ad "Stop playing silly
      games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
      shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
      the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
      call which is expected to pre-account a shared anonymous object.
      
      It's all clearer if we switch shmem.c over to use VM_NORESERVE
      throughout in place of !VM_ACCOUNT.
      
      But I very nearly sent in a patch which mistakenly removed the
      accounting from tmpfs files: shmem_get_inode()'s memset was good for not
      setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.
      
      Rather than setting that by default, then perhaps clearing it again in
      shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
      allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b0a0806
  19. 25 2月, 2009 3 次提交
  20. 24 2月, 2009 2 次提交
    • T
      percpu: add __read_mostly to variables which are mostly read only · 40150d37
      Tejun Heo 提交于
      Most global variables in percpu allocator are initialized during boot
      and read only from that point on.  Add __read_mostly as per Rusty's
      suggestion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      40150d37
    • T
      percpu: give more latitude to arch specific first chunk initialization · 8d408b4b
      Tejun Heo 提交于
      Impact: more latitude for first percpu chunk allocation
      
      The first percpu chunk serves the kernel static percpu area and may or
      may not contain extra room for further dynamic allocation.
      Initialization of the first chunk needs to be done before normal
      memory allocation service is up, so it has its own init path -
      pcpu_setup_static().
      
      It seems archs need more latitude while initializing the first chunk
      for example to take advantage of large page mapping.  This patch makes
      the following changes to allow this.
      
      * Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
        to reserve in the first chunk for further dynamic allocation.
      
      * Rename pcpu_setup_static() to pcpu_setup_first_chunk().
      
      * Make pcpu_setup_first_chunk() much more flexible by fetching page
        pointer by callback and adding optional @unit_size, @free_size and
        @base_addr arguments which allow archs to selectively part of chunk
        initialization to their likings.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8d408b4b