1. 25 7月, 2008 4 次提交
    • M
      mm: print out the zonelists on request for manual verification · 68ad8df4
      Mel Gorman 提交于
      This patch prints out the zonelists during boot for manual verification by the
      user if the mminit_loglevel is MMINIT_VERIFY or higher.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ad8df4
    • M
      mm: make defensive checks around PFN values registered for memory usage · 2dbb51c4
      Mel Gorman 提交于
      There are a number of different views to how much memory is currently active.
      There is the arch-independent zone-sizing view, the bootmem allocator and
      memory models view.
      
      Architectures register this information at different times and is not
      necessarily in sync particularly with respect to some SPARSEMEM limitations.
      
      This patch introduces mminit_validate_memmodel_limits() which is able to
      validate and correct PFN ranges with respect to the memory model.  It is only
      SPARSEMEM that currently validates itself.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dbb51c4
    • M
      mm: verify the page links and memory model · 708614e6
      Mel Gorman 提交于
      Print out information on how the page flags are being used if mminit_loglevel
      is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
      flags regardless of loglevel.
      
      When the page flags are updated with section, node and zone information, a
      check are made to ensure the values can be retrieved correctly.  Finally we
      confirm that pfn_to_page and page_to_pfn are the correct inverse functions.
      
      [akpm@linux-foundation.org: fix printk warnings]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      708614e6
    • M
      mm: add a basic debugging framework for memory initialisation · 6b74ab97
      Mel Gorman 提交于
      Boot initialisation is very complex, with significant numbers of
      architecture-specific routines, hooks and code ordering.  While significant
      amounts of the initialisation is architecture-independent, it trusts the data
      received from the architecture layer.  This is a mistake, and has resulted in
      a number of difficult-to-diagnose bugs.
      
      This patchset adds some validation and tracing to memory initialisation.  It
      also introduces a few basic defensive measures.  The validation code can be
      explicitly disabled for embedded systems.
      
      This patch:
      
      Add additional debugging and verification code for memory initialisation.
      
      Once enabled, the verification checks are always run and when required
      additional debugging information may be outputted via a mminit_loglevel=
      command-line parameter.
      
      The verification code is placed in a new file mm/mm_init.c.  Ideally other mm
      initialisation code will be moved here over time.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b74ab97
  2. 20 7月, 2008 1 次提交
  3. 19 7月, 2008 1 次提交
  4. 17 7月, 2008 1 次提交
  5. 16 7月, 2008 3 次提交
  6. 15 7月, 2008 1 次提交
  7. 14 7月, 2008 1 次提交
  8. 12 7月, 2008 2 次提交
    • A
      mm: Add range_cont mode for writeback · 06d6cf69
      Aneesh Kumar K.V 提交于
      Filesystems like ext4 needs to start a new transaction in
      the writepages for block allocation. This happens with delayed
      allocation and there is limit to how many credits we can request
      from the journal layer. So we call write_cache_pages multiple
      times with wbc->nr_to_write set to the maximum possible value
      limitted by the max journal credits available.
      
      Add a new mode to writeback that enables us to handle this
      behaviour. In the new mode we update the wbc->range_start
      to point to the new offset to be written. Next call to
      call to write_cache_pages will start writeout from specified
      range_start offset. In the new mode we also limit writing
      to the specified wbc->range_end.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06d6cf69
    • J
      vfs: export filemap_fdatawrite_range() · f4c0a0fd
      Jan Kara 提交于
      Make filemap_fdatawrite_range() function public, so that it can later
      be used in ordered mode rewrite by JBD/JBD2.
      Signed-off-by: NJan Kara <jack@suse.cz>
      f4c0a0fd
  9. 11 7月, 2008 1 次提交
    • D
      slub: Fix use-after-preempt of per-CPU data structure · bdb21928
      Dmitry Adamushko 提交于
      Vegard Nossum reported a crash in kmem_cache_alloc():
      
      	BUG: unable to handle kernel paging request at da87d000
      	IP: [<c01991c7>] kmem_cache_alloc+0xc7/0xe0
      	*pde = 28180163 *pte = 1a87d160
      	Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      	Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
      	EIP: 0060:[<c01991c7>] EFLAGS: 00210203 CPU: 0
      	EIP is at kmem_cache_alloc+0xc7/0xe0
      	EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
      	ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
      	DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      
      and analyzed it:
      
        "The register %ecx looks innocent but is very important here. The disassembly:
      
             mov    %edx,%ecx
             shr    $0x2,%ecx
             rep stos %eax,%es:(%edi) <-- the fault
      
         So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
         (0x6b6b6b6b >> 2 == 0x1adadada.)
      
         %ecx is the counter for the memset, from here:
      
             memset(object, 0, c->objsize);
      
        i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
        Where did "c" come from? Uh-oh...
      
             c = get_cpu_slab(s, smp_processor_id());
      
        This looks like it has very much to do with CPU hotplug/unplug. Is
        there a race between SLUB/hotplug since the CPU slab is used after it
        has been freed?"
      
      Good analysis.
      
      Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
      can be migrated on another CPU right after local_irq_restore() and
      before memset().  The inital cpu can become offline in the mean time (or
      a migration is a consequence of the CPU going offline) so its
      'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).
      
      At some point of time the caller continues on another CPU having an
      obsolete pointer...
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdb21928
  10. 09 7月, 2008 1 次提交
  11. 08 7月, 2008 6 次提交
  12. 05 7月, 2008 4 次提交
  13. 04 7月, 2008 2 次提交
  14. 02 7月, 2008 1 次提交
  15. 26 6月, 2008 1 次提交
  16. 25 6月, 2008 1 次提交
    • J
      mm: add a ptep_modify_prot transaction abstraction · 1ea0704e
      Jeremy Fitzhardinge 提交于
      This patch adds an API for doing read-modify-write updates to a pte's
      protection bits which may race against hardware updates to the pte.
      After reading the pte, the hardware may asynchonously set the accessed
      or dirty bits on a pte, which would be lost when writing back the
      modified pte value.
      
      The existing technique to handle this race is to use
      ptep_get_and_clear() atomically fetch the old pte value and clear it
      in memory.  This has the effect of marking the pte as non-present,
      which will prevent the hardware from updating its state.  When the new
      value is written back, the pte will be present again, and the hardware
      can resume updating the access/dirty flags.
      
      When running in a virtualized environment, pagetable updates are
      relatively expensive, since they generally involve some trap into the
      hypervisor.  To mitigate the cost of these updates, we tend to batch
      them.
      
      However, because of the atomic nature of ptep_get_and_clear(), it is
      inherently non-batchable.  This new interface allows batching by
      giving the underlying implementation enough information to open a
      transaction between the read and write phases:
      
      ptep_modify_prot_start() returns the current pte value, and puts the
        pte entry into a state where either the hardware will not update the
        pte, or if it does, the updates will be preserved on commit.
      
      ptep_modify_prot_commit() writes back the updated pte, makes sure that
        any hardware updates made since ptep_modify_prot_start() are
        preserved.
      
      ptep_modify_prot_start() and _commit() must be exactly paired, and
      used while holding the appropriate pte lock.  They do not protect
      against other software updates of the pte in any way.
      
      The current implementations of ptep_modify_prot_start and _commit are
      functionally unchanged from before: _start() uses ptep_get_and_clear()
      fetch the pte and zero the entry, preventing any hardware updates.
      _commit() simply writes the new pte value back knowing that the
      hardware has not updated the pte in the meantime.
      
      The only current user of this interface is mprotect
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1ea0704e
  17. 24 6月, 2008 2 次提交
    • N
      mm: fix race in COW logic · 945754a1
      Nick Piggin 提交于
      There is a race in the COW logic.  It contains a shortcut to avoid the
      COW and reuse the page if we have the sole reference on the page,
      however it is possible to have two racing do_wp_page()ers with one
      causing the other to mistakenly believe it is safe to take the shortcut
      when it is not.  This could lead to data corruption.
      
      Process 1 and process2 each have a wp pte of the same anon page (ie.
      one forked the other).  The page's mapcount is 2.  Then they both
      attempt to write to it around the same time...
      
        proc1				proc2 thr1			proc2 thr2
        CPU0				CPU1				CPU3
        do_wp_page()			do_wp_page()
      				 trylock_page()
      				  can_share_swap_page()
      				   load page mapcount (==2)
      				  reuse = 0
      				 pte unlock
      				 copy page to new_page
      				 pte lock
      				 page_remove_rmap(page);
         trylock_page()
          can_share_swap_page()
           load page mapcount (==1)
          reuse = 1
         ptep_set_access_flags (allow W)
      
        write private key into page
      								read from page
      				ptep_clear_flush()
      				set_pte_at(pte of new_page)
      
      Fix this by moving the page_remove_rmap of the old page after the pte
      clear and flush.  Potentially the entire branch could be moved down
      here, but in order to stay consistent, I won't (should probably move all
      the *_mm_counter stuff with one patch).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      945754a1
    • L
      Fix ZERO_PAGE breakage with vmware · 672ca28e
      Linus Torvalds 提交于
      Commit 89f5b7da ("Reinstate ZERO_PAGE
      optimization in 'get_user_pages()' and fix XIP") broke vmware, as
      reported by Jeff Chua:
      
        "This broke vmware 6.0.4.
         Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
         /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"
      
      and the reason seems to be that there's an old bug in how we handle do
      FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
      triggered if the whole page table was missing, nobody had apparently hit
      it before.
      
      The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
      not just for whole missing page tables, but for individual pages as
      well, and exposed this problem.
      
      This fixes it by making the test for when FOLL_ANON is used more
      careful, and also makes the code easier to read and understand by moving
      the logic to a separate inline function.
      Reported-and-tested-by: NJeff Chua <jeff.chua.linux@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      672ca28e
  18. 22 6月, 2008 2 次提交
  19. 21 6月, 2008 1 次提交
    • L
      Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP · 89f5b7da
      Linus Torvalds 提交于
      KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
      557ed1fa ("remove ZERO_PAGE") removed
      the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
      generally now populate the VM with real empty pages needlessly.
      
      We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
      since fault handling no longer uses ZERO_PAGE for new anonymous pages,
      we now need to handle that special case in follow_page() instead.
      
      In particular, the removal of ZERO_PAGE effectively removed the core
      file writing optimization where we would skip writing pages that had not
      been populated at all, and increased memory pressure a lot by allocating
      all those useless newly zeroed pages.
      
      This reinstates the optimization by making the unmapped PTE case the
      same as for a non-existent page table, which already did this correctly.
      
      While at it, this also fixes the XIP case for follow_page(), where the
      caller could not differentiate between the case of a page that simply
      could not be used (because it had no "struct page" associated with it)
      and a page that just wasn't mapped.
      
      We do that by simply returning an error pointer for pages that could not
      be turned into a "struct page *".  The error is arbitrarily picked to be
      EFAULT, since that was what get_user_pages() already used for the
      equivalent IO-mapped page case.
      
      [ Also removed an impossible test for pte_offset_map_lock() failing:
        that's not how that function works ]
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89f5b7da
  20. 13 6月, 2008 2 次提交
  21. 12 6月, 2008 1 次提交
    • P
      nommu: Correct kobjsize() page validity checks. · 5a1603be
      Paul Mundt 提交于
      This implements a few changes on top of the recent kobjsize() refactoring
      introduced by commit 6cfd53fc.
      
      As Christoph points out:
      
      	virt_to_head_page cannot return NULL. virt_to_page also
      	does not return NULL. pfn_valid() needs to be used to
      	figure out if a page is valid.  Otherwise the page struct
      	reference that was returned may have PageReserved() set
      	to indicate that it is not a valid page.
      
      As discussed further in the thread, virt_addr_valid() is the preferable
      way to validate the object pointer in this case. In addition to fixing
      up the reserved page case, it also has the benefit of encapsulating the
      hack introduced by commit 4016a139 on
      the impacted platforms, allowing us to get rid of the extra checking in
      kobjsize() for the platforms that don't perform this type of bizarre
      memory_end abuse (every nommu platform that isn't blackfin). If blackfin
      decides to get in line with every other platform and use PageReserved
      for the DMA pages in question, kobjsize() will also continue to work
      fine.
      
      It also turns out that compound_order() will give us back 0-order for
      non-head pages, so we can get rid of the PageCompound check and just
      use compound_order() directly. Clean that up while we're at it.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a1603be
  22. 10 6月, 2008 1 次提交
    • Y
      mm, x86: shrink_active_range() should check all · cc1a9d86
      Yinghai Lu 提交于
      Now we are using register_e820_active_regions() instead of
      add_active_range() directly. So end_pfn could be different between the
      value in early_node_map to node_end_pfn.
      
      So we need to make shrink_active_range() smarter.
      
      shrink_active_range() is a generic MM function in mm/page_alloc.c but
      it is only used on 32-bit x86. Should we move it back to some file in
      arch/x86?
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc1a9d86