1. 22 2月, 2009 3 次提交
  2. 21 2月, 2009 2 次提交
  3. 19 2月, 2009 4 次提交
    • K
      mm: fix memmap init for handling memory hole · cc2559bc
      KAMEZAWA Hiroyuki 提交于
      Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole.
      and memmap initialization was not done. This was a trouble for
      sparc boot.
      
      To fix this, the PFN should be initialized and marked as PG_reserved.
      This patch changes early_pfn_in_nid() return true if PFN is a hole.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: NDavid Miller <davem@davemlloft.net>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc2559bc
    • K
      mm: clean up for early_pfn_to_nid() · f2dbcfa7
      KAMEZAWA Hiroyuki 提交于
      What's happening is that the assertion in mm/page_alloc.c:move_freepages()
      is triggering:
      
      	BUG_ON(page_zone(start_page) != page_zone(end_page));
      
      Once I knew this is what was happening, I added some annotations:
      
      	if (unlikely(page_zone(start_page) != page_zone(end_page))) {
      		printk(KERN_ERR "move_freepages: Bogus zones: "
      		       "start_page[%p] end_page[%p] zone[%p]\n",
      		       start_page, end_page, zone);
      		printk(KERN_ERR "move_freepages: "
      		       "start_zone[%p] end_zone[%p]\n",
      		       page_zone(start_page), page_zone(end_page));
      		printk(KERN_ERR "move_freepages: "
      		       "start_pfn[0x%lx] end_pfn[0x%lx]\n",
      		       page_to_pfn(start_page), page_to_pfn(end_page));
      		printk(KERN_ERR "move_freepages: "
      		       "start_nid[%d] end_nid[%d]\n",
      		       page_to_nid(start_page), page_to_nid(end_page));
       ...
      
      And here's what I got:
      
      	move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
      	move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
      	move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
      	move_freepages: start_nid[1] end_nid[0]
      
      My memory layout on this box is:
      
      [    0.000000] Zone PFN ranges:
      [    0.000000]   Normal   0x00000000 -> 0x0081ff5d
      [    0.000000] Movable zone start PFN for each node
      [    0.000000] early_node_map[8] active PFN ranges
      [    0.000000]     0: 0x00000000 -> 0x00020000
      [    0.000000]     1: 0x00800000 -> 0x0081f7ff
      [    0.000000]     1: 0x0081f800 -> 0x0081fe50
      [    0.000000]     1: 0x0081fed1 -> 0x0081fed8
      [    0.000000]     1: 0x0081feda -> 0x0081fedb
      [    0.000000]     1: 0x0081fedd -> 0x0081fee5
      [    0.000000]     1: 0x0081fee7 -> 0x0081ff51
      [    0.000000]     1: 0x0081ff59 -> 0x0081ff5d
      
      So it's a block move in that 0x81f600-->0x81f7ff region which triggers
      the problem.
      
      This patch:
      
      Declaration of early_pfn_to_nid() is scattered over per-arch include
      files, and it seems it's complicated to know when the declaration is used.
       I think it makes fix-for-memmap-init not easy.
      
      This patch moves all declaration to include/linux/mm.h
      
      After this,
        if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
           -> Use static definition in include/linux/mm.h
        else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
           -> Use generic definition in mm/page_alloc.c
        else
           -> per-arch back end function will be called.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: NDavid Miller <davem@davemlloft.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2dbcfa7
    • N
      mm: task dirty accounting fix · 1cf6e7d8
      Nick Piggin 提交于
      YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
      cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).
      
      Additionally, there is some inconsistency about when task_dirty_inc is
      called.  It is used for dirty balancing, however it even gets called for
      __set_page_dirty_no_writeback.
      
      So rather than increment it in a set_page_dirty wrapper, move it down to
      exactly where the dirty page accounting stats are incremented.
      
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cf6e7d8
    • B
      vmalloc: add __get_vm_area_caller() · c2968612
      Benjamin Herrenschmidt 提交于
      We have get_vm_area_caller() and __get_vm_area() but not
      __get_vm_area_caller()
      
      On powerpc, I use __get_vm_area() to separate the ranges of addresses
      given to vmalloc vs.  ioremap (various good reasons for that) so in order
      to be able to implement the new caller tracking in /proc/vmallocinfo, I
      need a "_caller" variant of it.
      
      (akpm: needed for ongoing powerpc development, so merge it early)
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2968612
  4. 18 2月, 2009 1 次提交
  5. 13 2月, 2009 1 次提交
    • N
      Fix page writeback thinko, causing Berkeley DB slowdown · 3a4c6800
      Nick Piggin 提交于
      A bug was introduced into write_cache_pages cyclic writeout by commit
      31a12666 ("mm: write_cache_pages cyclic
      fix").  The intention (and comments) is that we should cycle back and
      look for more dirty pages at the beginning of the file if there is no
      more work to be done.
      
      But the !done condition was dropped from the test.  This means that any
      time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
      will set index to 0, then goto again.  This will set done_index to
      index, then find done is set, so will proceed to the end of the
      function.  When updating mapping->writeback_index for cyclic writeout,
      we now use done_index == 0, so we're always cycling back to 0.
      
      This seemed to be causing random mmap writes (slapadd and iozone) to
      start writing more pages from the LRU and writeout would slowdown, and
      caused bugzilla entry
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=12604
      
      about Berkeley DB slowing down dramatically.
      
      With this patch, iozone random write performance is increased nearly
      5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reported-and-tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a4c6800
  6. 12 2月, 2009 8 次提交
  7. 11 2月, 2009 2 次提交
    • M
      x86, ptrace, mm: fix double-free on race · 9f339e70
      Markus Metzger 提交于
      Ptrace_detach() races with __ptrace_unlink() if the traced task is
      reaped while detaching. This might cause a double-free of the BTS
      buffer.
      
      Change the ptrace_detach() path to only do the memory accounting in
      ptrace_bts_detach() and leave the buffer free to ptrace_bts_untrace()
      which will be called from __ptrace_unlink().
      
      The fix follows a proposal from Oleg Nesterov.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMarkus Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9f339e70
    • M
      Do not account for the address space used by hugetlbfs using VM_ACCOUNT · 5a6fe125
      Mel Gorman 提交于
      When overcommit is disabled, the core VM accounts for pages used by anonymous
      shared, private mappings and special mappings. It keeps track of VMAs that
      should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
      with VM_NORESERVE.
      
      Overcommit for hugetlbfs is much riskier than overcommit for base pages
      due to contiguity requirements. It avoids overcommiting on both shared and
      private mappings using reservation counters that are checked and updated
      during mmap(). This ensures (within limits) that hugepages exist in the
      future when faults occurs or it is too easy to applications to be SIGKILLed.
      
      As hugetlbfs makes its own reservations of a different unit to the base page
      size, VM_ACCOUNT should never be set. Even if the units were correct, we would
      double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
      be set because an application can request no reserves be made for hugetlbfs
      at the risk of getting killed later.
      
      With commit fc8744ad, VM_NORESERVE and
      VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
      breaks the accounting for both the core VM and hugetlbfs, can trigger an
      OOM storm when hugepage pools are too small lockups and corrupted counters
      otherwise are used. This patch brings hugetlbfs more in line with how the
      core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a6fe125
  8. 09 2月, 2009 1 次提交
  9. 06 2月, 2009 1 次提交
    • C
      do_wp_page: fix regression with execute in place · ab92661d
      Carsten Otte 提交于
      Fix do_wp_page for VM_MIXEDMAP mappings.
      
      In the case where pfn_valid returns 0 for a pfn at the beginning of
      do_wp_page and the mapping is not shared writable, the code branches to
      label `gotten:' with old_page == NULL.
      
      In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page,
      clear_page_mlock, and unlock_page try to access the old_page.
      
      This patch checks whether old_page is valid before it is dereferenced.
      
      The regression was introduced by "mlock: mlocked pages are unevictable"
      (commit b291f000).
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab92661d
  10. 04 2月, 2009 1 次提交
    • A
      write-back: fix nr_to_write counter · dcf6a79d
      Artem Bityutskiy 提交于
      Commit 05fe478d introduced some
      @wbc->nr_to_write breakage.
      
      It made the following changes:
       1. Decrement wbc->nr_to_write instead of nr_to_write
       2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
       3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
          WB_SYNC_NONE, otherwise keep going.
      
      However, according to the commit message, the intention was to only make
      change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
      and it breaks UBIFS expectations, so if needed, it should be done
      separately later.  And change 2 does not seem to be documented in the
      commit message.
      
      This patch does the following:
       1. Undo changes 1 and 2
       2. Add a comment explaining change 3 (it very useful to have comments
          in _code_, not only in the commit).
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf6a79d
  11. 02 2月, 2009 1 次提交
    • L
      Manually revert "mlock: downgrade mmap sem while populating mlocked regions" · 27421e21
      Linus Torvalds 提交于
      This essentially reverts commit 8edb08ca.
      
      It downgraded our mmap semaphore to a read-lock while mlocking pages, in
      order to allow other threads (and external accesses like "ps" et al) to
      walk the vma lists and take page faults etc.  Which is a nice idea, but
      the implementation does not work.
      
      Because we cannot upgrade the lock back to a write lock without
      releasing the mmap semaphore, the code had to release the lock entirely
      and then re-take it as a writelock.  However, that meant that the caller
      possibly lost the vma chain that it was following, since now another
      thread could come in and mmap/munmap the range.
      
      The code tried to work around that by just looking up the vma again and
      erroring out if that happened, but quite frankly, that was just a buggy
      hack that doesn't actually protect against anything (the other thread
      could just have replaced the vma with another one instead of totally
      unmapping it).
      
      The only way to downgrade to a read map _reliably_ is to do it at the
      end, which is likely the right thing to do: do all the 'vma' operations
      with the write-lock held, then downgrade to a read after completing them
      all, and then do the "populate the newly mlocked regions" while holding
      just the read lock.  And then just drop the read-lock and return to user
      space.
      
      The (perhaps somewhat simpler) alternative is to just make all the
      callers of mlock_vma_pages_range() know that the mmap lock got dropped,
      and just re-grab the mmap semaphore if it needs to mlock more than one
      vma region.
      
      So we can do this "downgrade mmap sem while populating mlocked regions"
      thing right, but the way it was done here was absolutely not correct.
      Thus the revert, in the expectation that we will do it all correctly
      some day.
      
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27421e21
  12. 01 2月, 2009 1 次提交
    • L
      Stop playing silly games with the VM_ACCOUNT flag · fc8744ad
      Linus Torvalds 提交于
      The mmap_region() code would temporarily set the VM_ACCOUNT flag for
      anonymous shared mappings just to inform shmem_zero_setup() that it
      should enable accounting for the resulting shm object.  It would then
      clear the flag after calling ->mmap (for the /dev/zero case) or doing
      shmem_zero_setup() (for the MAP_ANON case).
      
      This just resulted in vma merge issues, but also made for just
      unnecessary confusion.  Use the already-existing VM_NORESERVE flag for
      this instead, and let shmem_{zero|file}_setup() just figure it out from
      that.
      
      This also happens to make it obvious that the new DRI2 GEM layer uses a
      non-reserving backing store for its object allocation - which is quite
      possibly not intentional.  But since I didn't want to change semantics
      in this patch, I left it alone, and just updated the caller to use the
      new flag semantics.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc8744ad
  13. 31 1月, 2009 1 次提交
    • L
      Allow opportunistic merging of VM_CAN_NONLINEAR areas · 33bfad54
      Linus Torvalds 提交于
      Commit de33c8db ("Fix OOPS in
      mmap_region() when merging adjacent VM_LOCKED file segments") unified
      the vma merging of anonymous and file maps to just one place, which
      simplified the code and fixed a use-after-free bug that could cause an
      oops.
      
      But by doing the merge opportunistically before even having called
      ->mmap() on the file method, it now compares two different 'vm_flags'
      values: the pre-mmap() value of the new not-yet-formed vma, and previous
      mappings of the same file around it.
      
      And in doing so, it refused to merge the common file case, which adds a
      marker to say "I can be made non-linear".
      
      This fixes it by just adding a set of flags that don't have to match,
      because we know they are ok to merge.  Currently it's only that single
      VM_CAN_NONLINEAR flag, but at least conceptually there could be others
      in the future.
      Reported-and-acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg KH <gregkh@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33bfad54
  14. 30 1月, 2009 4 次提交
  15. 28 1月, 2009 1 次提交
  16. 27 1月, 2009 1 次提交
  17. 21 1月, 2009 1 次提交
  18. 16 1月, 2009 6 次提交