1. 15 11月, 2007 8 次提交
    • A
      hugetlb: fix quota management for private mappings · c79fb75e
      Adam Litke 提交于
      The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
      mappings when that support was added.  Currently, quota is debited at page
      instantiation and credited at file truncation.  This approach works correctly
      for shared pages but is incomplete for private pages.  In addition to
      hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
      this function does not respect quotas.
      
      Private huge pages are treated very much like normal, anonymous pages.  They
      are not "backed" by the hugetlbfs file and are not stored in the mapping's
      radix tree.  This means that private pages are invisible to
      truncate_hugepages() so that function will not credit the quota.
      
      This patch (based on a prototype provided by Ken Chen) moves quota crediting
      for all pages into free_huge_page().  page->private is used to store a pointer
      to the mapping to which this page belongs.  This is used to credit quota on
      the appropriate hugetlbfs instance.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c79fb75e
    • A
      hugetlb: split alloc_huge_page into private and shared components · 348ea204
      Adam Litke 提交于
      Hugetlbfs implements a quota system which can limit the amount of memory that
      can be used by the filesystem.  Before allocating a new huge page for a file,
      the quota is checked and debited.  The quota is then credited when truncating
      the file.  I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
      mappings.  Before detailing the problems and my proposed solutions, we should
      agree on a definition of quotas that properly addresses both private and
      shared pages.  Since the purpose of quotas is to limit total memory
      consumption on a per-filesystem basis, I argue that all pages allocated by the
      fs (private and shared) should be charged against quota.
      
      Private Mappings
      ================
      
      The current code will debit quota for private pages sometimes, but will never
      credit it.  At a minimum, this causes a leak in the quota accounting which
      renders the accounting essentially useless as it is.  Shared pages have a one
      to one mapping with a hugetlbfs file and are easy to account by debiting on
      allocation and crediting on truncate.  Private pages are anonymous in nature
      and have a many to one relationship with their hugetlbfs files (due to copy on
      write).  Because private pages are not indexed by the mapping's radix tree,
      thier quota cannot be credited at file truncation time.  Crediting must be
      done when the page is unmapped and freed.
      
      Shared Pages
      ============
      
      I discovered an issue concerning the interaction between the MAP_SHARED
      reservation system and quotas.  Since quota is not checked until page
      instantiation, an over-quota mmap/reservation will initially succeed.  When
      instantiating the first over-quota page, the program will receive SIGBUS.
      This is inconsistent since the reservation is supposed to be a guarantee.  The
      solution is to debit the full amount of quota at reservation time and credit
      the unused portion when the reservation is released.
      
      This patch series brings quotas back in line by making the following
      modifications:
       * Private pages
         - Debit quota in alloc_huge_page()
         - Credit quota in free_huge_page()
       * Shared pages
         - Debit quota for entire reservation at mmap time
         - Credit quota for instantiated pages in free_huge_page()
         - Credit quota for unused reservation at munmap time
      
      This patch:
      
      The shared page reservation and dynamic pool resizing features have made the
      allocation of private vs.  shared huge pages quite different.  By splitting
      out the private/shared-specific portions of the process into their own
      functions, readability is greatly improved.  alloc_huge_page now calls the
      proper helper and performs common operations.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      348ea204
    • A
      hugetlb: follow_hugetlb_page() for write access · 5b23dbe8
      Adam Litke 提交于
      When calling get_user_pages(), a write flag is passed in by the caller to
      indicate if write access is required on the faulted-in pages.  Currently,
      follow_hugetlb_page() ignores this flag and always faults pages for
      read-only access.  This can cause data corruption because a device driver
      that calls get_user_pages() with write set will not expect COW faults to
      occur on the returned pages.
      
      This patch passes the write flag down to follow_hugetlb_page() and makes
      sure hugetlb_fault() is called with the right write_access parameter.
      
      [ezk@cs.sunysb.edu: build fix]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Reviewed-by: NKen Chen <kenchen@google.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NErez Zadok <ezk@cs.sunysb.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b23dbe8
    • Y
      Add IORESOUCE_BUSY flag for System RAM · 887c3cb1
      Yasunori Goto 提交于
      i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.
      
      But ia64 registers it as IORESOURCE_MEM only.
      In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.
      
      This difference causes a failure of memory unplug of x86-64.  This patch
      fixes it.
      
      This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI
      device.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Luck, Tony" <tony.luck@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      887c3cb1
    • P
      mm: speed up writeback ramp-up on clean systems · 5fce25a9
      Peter Zijlstra 提交于
      We allow violation of bdi limits if there is a lot of room on the system.
      Once we hit half the total limit we start enforcing bdi limits and bdi
      ramp-up should happen.  Doing it this way avoids many small writeouts on an
      otherwise idle system and should also speed up the ramp-up.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fce25a9
    • K
      memory hotremove: unset migrate type "ISOLATE" after removal · dbc0e4ce
      KAMEZAWA Hiroyuki 提交于
      We should unset migrate type "ISOLATE" when we successfully removed memory.
       But current code has BUG and cannot works well.
      
      This patch also includes bugfix?  to change get_pageblock_flags to
      get_pageblock_migratetype().
      
      Thanks to Badari Pulavarty for finding this.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbc0e4ce
    • L
      Migration: find correct vma in new_vma_page() · 3ad33b24
      Lee Schermerhorn 提交于
      We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
      mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas.  For
      anon-regions, we just fail to migrate any pages beyond the 1st vma in the
      range.
      
      This occurs because do_mbind() collects a list of pages to migrate by
      calling check_range().  check_range() walks the task's mm, spanning vmas as
      necessary, to collect the migratable pages into a list.  Then, do_mbind()
      calls migrate_pages() passing the list of pages, a function to allocate new
      pages based on vma policy [new_vma_page()], and a pointer to the first vma
      of the range.
      
      For each page in the list, new_vma_page() calls page_address_in_vma()
      passing the page and the vma [first in range] to obtain the address to get
      for alloc_page_vma().  The page address is needed to get interleaving
      policy correct.  If the pages in the list come from multiple vmas,
      eventually, new_page_address() will pass that page to page_address_in_vma()
      with the incorrect vma.  For !PageAnon pages, this will result in a bug
      check in rmap.c:vma_address().  For anon pages, vma_address() will just
      return EFAULT and fail the migration.
      
      This patch modifies new_vma_page() to check the return value from
      page_address_in_vma().  If the return value is EFAULT, new_vma_page()
      searchs forward via vm_next for the vma that maps the page--i.e., that does
      not return EFAULT.  This assumes that the pages in the list handed to
      migrate_pages() is in address order.  This is currently case.  The patch
      documents this assumption in a new comment block for new_vma_page().
      
      If new_vma_page() cannot locate the vma mapping the page in a forward
      search in the mm, it will pass a NULL vma to alloc_page_vma().  This will
      result in the allocation using the task policy, if any, else system default
      policy.  This situation is unlikely, but the patch documents this behavior
      with a comment.
      
      Note, this patch results in restarting from the first vma in a multi-vma
      range each time new_vma_page() is called.  If this is not acceptable, we
      can make the vma argument a pointer, both in new_vma_page() and it's caller
      unmap_and_move() so that the value held by the loop in migrate_pages()
      always passes down the last vma in which a page was found.  This will
      require changes to all new_page_t functions passed to migrate_pages().  Is
      this necessary?
      
      For this patch to work, we can't bug check in vma_address() for pages
      outside the argument vma.  This patch removes the BUG_ON().  All other
      callers [besides new_vma_page()] already check the return status.
      
      Tested on x86_64, 4 node NUMA platform.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad33b24
    • A
      slab: fix typo in allocation failure handling · cc550def
      Akinobu Mita 提交于
      This patch fixes wrong array index in allocation failure handling.
      
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc550def
  2. 13 11月, 2007 2 次提交
  3. 06 11月, 2007 1 次提交
  4. 05 11月, 2007 1 次提交
  5. 01 11月, 2007 1 次提交
    • L
      Remove broken ptrace() special-case code from file mapping · 5307cc1a
      Linus Torvalds 提交于
      The kernel has for random historical reasons allowed ptrace() accesses
      to access (and insert) pages into the page cache above the size of the
      file.
      
      However, Nick broke that by mistake when doing the new fault handling in
      commit 54cb8821 ("mm: merge populate and
      nopage into fault (fixes nonlinear)".  The breakage caused a hang with
      gdb when trying to access the invalid page.
      
      The ptrace "feature" really isn't worth resurrecting, since it really is
      wrong both from a portability _and_ from an internal page cache validity
      standpoint.  So this removes those old broken remnants, and fixes the
      ptrace() hang in the process.
      
      Noticed and bisected by Duane Griffin, who also supplied a test-case
      (quoth Nick: "Well that's probably the best bug report I've ever had,
      thanks Duane!").
      
      Cc: Duane Griffin <duaneg@dghda.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5307cc1a
  6. 31 10月, 2007 1 次提交
    • Z
      dio: fix cache invalidation after sync writes · bdb76ef5
      Zach Brown 提交于
      Commit commit 65b8291c ("dio: invalidate
      clean pages before dio write") introduced a bug which stopped dio from
      ever invalidating the page cache after writes.  It still invalidated it
      before writes so most users were fine.
      
      Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
      this bug when he had a buffered reader immediately reading file data
      after an O_DIRECT wirter had written the data.  The kernel issued
      read-ahead beyond the position of the reader which overlapped with the
      O_DIRECT writer.  The failure to invalidate after writes caused the
      reader to see stale data from the read-ahead.
      
      The following patch is originally from Karl.  The following commentary
      is his:
      
      	The below 3rd try takes on your suggestion of just invalidating
      	no matter what the retval from the direct_IO call.  I ran it
      	thru the test-case several times and it has worked every time.
      	The post-invalidate is probably still too early for async-directio,
      	but I don't have a testcase for that;  just sync.  And, this
      	won't be any worse in the async case.
      
      I added a test to the aio-dio-regress repository which mimics Karl's IO
      pattern.  It verifed the bad behaviour and that the patch fixed it.  I
      agree with Karl, this still doesn't help the case where a buffered
      reader follows an AIO O_DIRECT writer.  That will require a bit more
      work.
      
      This gives up on the idea of returning EIO to indicate to userspace that
      stale data remains if the invalidation failed.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Cc: Karl Schendel <kschendel@datallegro.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdb76ef5
  7. 30 10月, 2007 3 次提交
    • H
      fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE · 487e9bf2
      Hugh Dickins 提交于
      It's possible to provoke unionfs (not yet in mainline, though in mm and
      some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).  I expect
      it's possible to provoke the 2.6.23 ecryptfs in the same way (but the
      2.6.24 ecryptfs no longer calls lower level's ->writepage).
      
      This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could
      leak from tmpfs via write_cache_pages and unionfs to userspace.  There's
      already a fix (e4230030 - writeback: don't
      propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so
      far as it goes; but insufficient because it doesn't address the underlying
      issue, that shmem_writepage expects to be called only by vmscan (relying on
      backing_dev_info capabilities to prevent the normal writeback path from
      ever approaching it).
      
      That's an increasingly fragile assumption, and ramdisk_writepage (the other
      source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check
      wbc->for_reclaim before returning it.  Make the same check in
      shmem_writepage, thereby sidestepping the page_mapped BUG also.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Erez Zadok <ezk@cs.sunysb.edu>
      Cc: <stable@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      487e9bf2
    • G
      mm/sparse-vmemmap.c: make sure init_mm is included · 8bca44bb
      Glauber de Oliveira Costa 提交于
      mm/sparse-vmemmap.c uses init_mm in some places.  However, it is not
      present in any of the headers currently included in the file.
      
      init_mm is defined as extern in sched.h, so we add it to the headers list
      
      Up to now, this problem was masked by the fact that functions like
      set_pte_at() and pmd_populate_kernel() are usually macros that expand to
      simpler variants that does not use the first parameter at all.
      Signed-off-by: NGlauber de Oliveira Costa <gcosta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bca44bb
    • L
      Revert "x86_64: allocate sparsemem memmap above 4G" · 6a22c57b
      Linus Torvalds 提交于
      This reverts commit 2e1c49db.
      
      First off, testing in Fedora has shown it to cause boot failures,
      bisected down by Martin Ebourne, and reported by Dave Jobes.  So the
      commit will likely be reverted in the 2.6.23 stable kernels.
      
      Secondly, in the 2.6.24 model, x86-64 has now grown support for
      SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the
      bug is not visible any more, it's become invisible due to the code just
      being irrelevant and no longer enabled on the only architecture that
      this ever affected.
      Reported-by: NDave Jones <davej@redhat.com>
      Tested-by: NMartin Ebourne <fedora@ebourne.me.uk>
      Cc: Zou Nan hai <nanhai.zou@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a22c57b
  8. 29 10月, 2007 3 次提交
  9. 23 10月, 2007 1 次提交
  10. 22 10月, 2007 4 次提交
  11. 21 10月, 2007 1 次提交
  12. 20 10月, 2007 14 次提交