1. 28 6月, 2006 1 次提交
  2. 27 6月, 2006 3 次提交
    • E
      [PATCH] proc: don't lock task_structs indefinitely · 99f89551
      Eric W. Biederman 提交于
      Every inode in /proc holds a reference to a struct task_struct.  If a
      directory or file is opened and remains open after the the task exits this
      pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
      file descriptor is about 10K.
      
      Normally I would figure a reasonable per user process limit is about 100
      processes.  With 80 processes, with a 1000 file descriptors each I can trigger
      the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
      data.
      
      This patch replaces the struct task_struct pointer with a pointer to a struct
      task_ref which has a struct task_struct pointer.  The so the pinning of dead
      tasks does not happen.
      
      The code now has to contend with the fact that the task may now exit at any
      time.  Which is a little but not muh more complicated.
      
      With this change it takes about 1000 processes each opening up 1000 file
      descriptors before I can trigger the OOM killer.  Much better.
      
      [mlp@google.com: task_mmu small fixes]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Signed-off-by: NPrasanna Meda <mlp@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      99f89551
    • A
      [PATCH] core: use list_move() · 1bfba4e8
      Akinobu Mita 提交于
      This patch converts the combination of list_del(A) and list_add(A, B) to
      list_move(A, B).
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Signed-off-by: NAkinobu Mita <mita@miraclelinux.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1bfba4e8
    • A
      spelling fixes · d6e05edc
      Andreas Mohr 提交于
      acquired (aquired)
      contiguous (contigious)
      successful (succesful, succesfull)
      surprise (suprise)
      whether (weather)
      some other misspellings
      Signed-off-by: NAndreas Mohr <andi@lisas.de>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      d6e05edc
  3. 26 6月, 2006 8 次提交
    • W
      [PATCH] readahead: backoff on I/O error · 76d42bd9
      Wu Fengguang 提交于
      Backoff readahead size exponentially on I/O error.
      
      Michael Tokarev <mjt@tls.msk.ru> described the problem as:
      
      [QUOTE]
      Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
      In order to "fix" it, one have to read it and write to another CD-rom,
      or something.. or just ignore the error (if it's just a skip in a video
      stream).  Let's assume the unreadable block is number U.
      
      But current behavior is just insane.  An application requests block
      number N, which is before U. Kernel tries to read-ahead blocks N..U.
      Cdrom drive tries to read it, re-read it.. for some time.  Finally,
      when all the N..U-1 blocks are read, kernel returns block number N
      (as requested) to an application, successefully.
      
      Now an app requests block number N+1, and kernel tries to read
      blocks N+1..U+1.  Retrying again as in previous step.
      
      And so on, up to when an app requests block number U-1.  And when,
      finally, it requests block U, it receives read error.
      
      So, kernel currentry tries to re-read the same failing block as
      many times as the current readahead value (256 (times?) by default).
      
      This whole process already killed my cdrom drive (I posted about it
      to LKML several months ago) - literally, the drive has fried, and
      does not work anymore.  Ofcourse that problem was a bug in firmware
      (or whatever) of the drive *too*, but.. main problem with that is
      current readahead logic as described above.
      [/QUOTE]
      
      Which was confirmed by Jens Axboe <axboe@suse.de>:
      
      [QUOTE]
      For ide-cd, it tends do only end the first part of the request on a
      medium error. So you may see a lot of repeats :/
      [/QUOTE]
      
      With this patch, retries are expected to be reduced from, say, 256, to 5.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NWu Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      76d42bd9
    • R
      [PATCH] kernel-doc: mm/readhead fixup · bd40cdda
      Randy Dunlap 提交于
      Put short function description for read_cache_pages() on one line as needed
      by kernel-doc.
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bd40cdda
    • N
      [PATCH] Prepare for __copy_from_user_inatomic to not zero missed bytes · 01408c49
      NeilBrown 提交于
      The problem is that when we write to a file, the copy from userspace to
      pagecache is first done with preemption disabled, so if the source address is
      not immediately available the copy fails *and* *zeros* *the* *destination*.
      
      This is a problem because a concurrent read (which admittedly is an odd thing
      to do) might see zeros rather that was there before the write, or what was
      there after, or some mixture of the two (any of these being a reasonable thing
      to see).
      
      If the copy did fail, it will immediately be retried with preemption
      re-enabled so any transient problem with accessing the source won't cause an
      error.
      
      The first copying does not need to zero any uncopied bytes, and doing so
      causes the problem.  It uses copy_from_user_atomic rather than copy_from_user
      so the simple expedient is to change copy_from_user_atomic to *not* zero out
      bytes on failure.
      
      The first of these two patches prepares for the change by fixing two places
      which assume copy_from_user_atomic does zero the tail.  The two usages are
      very similar pieces of code which copy from a userspace iovec into one or more
      page-cache pages.  These are changed to remove the assumption.
      
      The second patch changes __copy_from_user_inatomic* to not zero the tail.
      Once these are accepted, I will look at similar patches of other architectures
      where this is important (ppc, mips and sparc being the ones I can find).
      
      This patch:
      
      There is a problem with __copy_from_user_inatomic zeroing the tail of the
      buffer in the case of an error.  As it is called in atomic context, the error
      may be transient, so it results in zeros being written where maybe they
      shouldn't be.
      
      In the usage in filemap, this opens a window for a well timed read to see data
      (zeros) which is not consistent with any ordering of reads and writes.
      
      Most cases where __copy_from_user_inatomic is called, a failure results in
      __copy_from_user being called immediately.  As long as the latter zeros the
      tail, the former doesn't need to.  However in *copy_from_user_iovec
      implementations (in both filemap and ntfs/file), it is assumed that
      copy_from_user_inatomic will zero the tail.
      
      This patch removes that assumption, so that after this patch it will
      be safe for copy_from_user_inatomic to not zero the tail.
      
      This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.
      
      After this patch, all architectures that might disable preempt when
      kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
      This includes
       - powerpc
       - i386
       - mips
       - sparc
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      01408c49
    • C
      [PATCH] cpuset: remove extra cpuset_zone_allowed check in __alloc_pages · 43b0bc00
      Chris Wright 提交于
      This is redundant with check in wakeup_kswapd.
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      43b0bc00
    • A
      [PATCH] pdflush: handle resume wakeups · d616e09a
      Andrew Morton 提交于
      pdflush is carefully designed to ensure that all wakeups have some
      corresponding work to do - if a woken-up pdflush thread discovers that it
      hasn't been given any work to do then this is considered an error.
      
      That all broke when swsusp came along - because a timer-delivered wakeup to a
      frozen pdflush thread will just get lost.  This causes the pdflush thread to
      get lost as well: the writeback timer is supposed to be re-armed by pdflush in
      process context, but pdflush doesn't execute the callout which does this.
      
      Fix that up by ignoring the return value from try_to_freeze(): jsut proceed,
      see if we have any work pending and only go back to sleep if that is not the
      case.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d616e09a
    • C
      [PATCH] Allow migration of mlocked pages · e6a1530d
      Christoph Lameter 提交于
      Hugh clarified the role of VM_LOCKED.  So we can now implement page
      migration for mlocked pages.
      
      Allow the migration of mlocked pages.  This means that try_to_unmap must
      unmap mlocked pages in the migration case.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e6a1530d
    • C
      [PATCH] page migration: Support a vma migration function · 7b2259b3
      Christoph Lameter 提交于
      Hooks for calling vma specific migration functions
      
      With this patch a vma may define a vma->vm_ops->migrate function.  That
      function may perform page migration on its own (some vmas may not contain page
      structs and therefore cannot be handled by regular page migration.  Pages in a
      vma may require special preparatory treatment before migration is possible
      etc) .  Only mmap_sem is held when the migration function is called.  The
      migrate() function gets passed two sets of nodemasks describing the source and
      the target of the migration.  The flags parameter either contains
      
      MPOL_MF_MOVE	which means that only pages used exclusively by
      		the specified mm should be moved
      
      or
      
      MPOL_MF_MOVE_ALL which means that pages shared with other processes
      		should also be moved.
      
      The migration function returns 0 on success or an error condition.  An error
      condition will prevent regular page migration from occurring.
      
      On its own this patch cannot be included since there are no users for this
      functionality.  But it seems that the uncached allocator will need this
      functionality at some point.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7b2259b3
    • Z
      [PATCH] AOP_TRUNCATED_PAGE victims in read_pages() belong in the LRU · 9f1a3cfc
      Zach Brown 提交于
      AOP_TRUNCATED_PAGE victims in read_pages() belong in the LRU
      
      Nick Piggin rightly pointed out that the introduction of AOP_TRUNCATED_PAGE
      to read_pages() was wrong to leave A_T_P victim pages in the page cache but
      not put them in the LRU.  Failing to do so hid them from the VM.
      
      A_T_P just means that the aop method unlocked the page rather than
      performing IO.  It would be very rare that the page was truncated between
      the unlock and testing A_T_P.  So we leave the pages in the LRU for likely
      reuse soon rather than backing them back out of the page cache.  We do this
      by matching the behaviour before the A_T_P introduction which added pages
      to the LRU regardless of what ->readpage() did.
      
      This doesn't include the unrelated cleanup in Nick's initial fix which
      changed read_pages() to return void to match its only caller's behaviour of
      ignoring errors.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9f1a3cfc
  4. 23 6月, 2006 28 次提交
    • J
      [PATCH] Kill PF_SYNCWRITE flag · b31dc66a
      Jens Axboe 提交于
      A process flag to indicate whether we are doing sync io is incredibly
      ugly. It also causes performance problems when one does a lot of async
      io and then proceeds to sync it. Part of the io will go out as async,
      and the other part as sync. This causes a disconnect between the
      previously submitted io and the synced io. For io schedulers such as CFQ,
      this will cause us lost merges and suboptimal behaviour in scheduling.
      
      Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
      the O_DIRECT path just directly indicate that the writes are sync
      by using WRITE_SYNC instead.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b31dc66a
    • E
      [PATCH] More BUG_ON conversion · 125e1874
      Eric Sesterhenn 提交于
      Signed-off-by: NEric Sesterhenn <snakebyte@gmx.de>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Acked-by: N"Salyzyn, Mark" <mark_salyzyn@adaptec.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      125e1874
    • N
      [PATCH] Remove semi-softlockup from invalidate_mapping_pages · e0f23603
      NeilBrown 提交于
      If invalidate_mapping_pages is called to invalidate a very large mapping
      (e.g.  a very large block device) and if the only active page in that
      device is near the end (or at least, at a very large index), such as, say,
      the superblock of an md array, and if that page happens to be locked when
      invalidate_mapping_pages is called, then
      
        pagevec_lookup will return this page and
        as it is locked, 'next' will be incremented and pagevec_lookup
        will be called again. and again. and again.
        while we count from 0 upto a very large number.
      
      We should really always set 'next' to 'page->index+1' before going around
      the loop again, not just if the page isn't locked.
      
      Cc: "Steinar H. Gunderson" <sgunderson@bigfoot.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e0f23603
    • R
      [PATCH] percpu_counters: create lib/percpu_counter.c · 3cbc5640
      Ravikiran G Thirumalai 提交于
      - Move percpu_counter routines from mm/swap.c to lib/percpu_counter.c
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3cbc5640
    • P
      [PATCH] read_mapping_page for address space · 090d2b18
      Pekka Enberg 提交于
      Add read_mapping_page() which is used for callers that pass
      mapping->a_ops->readpage as the filler for read_cache_page.  This removes
      some duplication from filesystem code.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      090d2b18
    • H
      [PATCH] x86: cache pollution aware __copy_from_user_ll() · c22ce143
      Hiro Yoshioka 提交于
      Use the x86 cache-bypassing copy instructions for copy_from_user().
      
      Some performance data are
      
      Total of GLOBAL_POWER_EVENTS (CPU cycle samples)
      
      2.6.12.4.orig    1921587
      2.6.12.4.nt      1599424
      1599424/1921587=83.23% (16.77% reduction)
      
      BSQ_CACHE_REFERENCE (L3 cache miss)
      2.6.12.4.orig      57427
      2.6.12.4.nt        20858
      20858/57427=36.32% (63.7% reduction)
      
      L3 cache miss reduction of __copy_from_user_ll
      samples  %
      37408    65.1412  vmlinux                  __copy_from_user_ll
      23        0.1103  vmlinux                  __copy_user_zeroing_intel_nocache
      23/37408=0.061% (99.94% reduction)
      
      Top 5 of 2.6.12.4.nt
      Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
      samples  %        app name                 symbol name
      128392    8.0274  vmlinux                  __copy_user_zeroing_intel_nocache
      64206     4.0143  vmlinux                  journal_add_journal_head
      59746     3.7355  vmlinux                  do_get_write_access
      47674     2.9807  vmlinux                  journal_put_journal_head
      46021     2.8774  vmlinux                  journal_dirty_metadata
      pattern9-0-cpu4-0-09011728/summary.out
      
      Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
      samples  %        app name                 symbol name
      69755     4.2861  vmlinux                  __copy_user_zeroing_intel_nocache
      55685     3.4215  vmlinux                  journal_add_journal_head
      52371     3.2179  vmlinux                  __find_get_block
      45504     2.7960  vmlinux                  journal_put_journal_head
      36005     2.2123  vmlinux                  journal_stop
      pattern9-0-cpu4-0-09011744/summary.out
      
      Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
      samples  %        app name                 symbol name
      1147      5.4994  vmlinux                  journal_add_journal_head
      881       4.2240  vmlinux                  journal_dirty_data
      872       4.1809  vmlinux                  blk_rq_map_sg
      734       3.5192  vmlinux                  journal_commit_transaction
      617       2.9582  vmlinux                  radix_tree_delete
      pattern9-0-cpu4-0-09011731/summary.out
      
      iozone results are
      
      original 2.6.12.4 CPU time = 207.768 sec
      cache aware       CPU time = 184.783 sec
      (three times run)
      184.783/207.768=88.94% (11.06% reduction)
      
      original:
      pattern9-0-cpu4-0-08191720/iozone.out:  CPU Utilization: Wall time   45.997    CPU time   64.527    CPU utilization 140.28 %
      pattern9-0-cpu4-0-08191741/iozone.out:  CPU Utilization: Wall time   46.878    CPU time   71.933    CPU utilization 153.45 %
      pattern9-0-cpu4-0-08191743/iozone.out:  CPU Utilization: Wall time   45.152    CPU time   71.308    CPU utilization 157.93 %
      
      cache awre:
      pattern9-0-cpu4-0-09011728/iozone.out:  CPU Utilization: Wall time   44.842    CPU time   62.465    CPU utilization 139.30 %
      pattern9-0-cpu4-0-09011731/iozone.out:  CPU Utilization: Wall time   44.718    CPU time   59.273    CPU utilization 132.55 %
      pattern9-0-cpu4-0-09011744/iozone.out:  CPU Utilization: Wall time   44.367    CPU time   63.045    CPU utilization 142.10 %
      Signed-off-by: NHiro Yoshioka <hyoshiok@miraclelinux.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c22ce143
    • D
      [PATCH] SELinux: add security_task_movememory calls to mm code · 86c3a764
      David Quigley 提交于
      This patch inserts security_task_movememory hook calls into memory management
      code to enable security modules to mediate this operation between tasks.
      
      Since the last posting, the hook has been renamed following feedback from
      Christoph Lameter.
      Signed-off-by: NDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Cc: Andi Kleen <ak@muc.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86c3a764
    • C
      [PATCH] page migration: sys_move_pages(): support moving of individual pages · 742755a1
      Christoph Lameter 提交于
      move_pages() is used to move individual pages of a process. The function can
      be used to determine the location of pages and to move them onto the desired
      node. move_pages() returns status information for each page.
      
      long move_pages(pid, number_of_pages_to_move,
      		addresses_of_pages[],
      		nodes[] or NULL,
      		status[],
      		flags);
      
      The addresses of pages is an array of void * pointing to the
      pages to be moved.
      
      The nodes array contains the node numbers that the pages should be moved
      to. If a NULL is passed instead of an array then no pages are moved but
      the status array is updated. The status request may be used to determine
      the page state before issuing another move_pages() to move pages.
      
      The status array will contain the state of all individual page migration
      attempts when the function terminates. The status array is only valid if
      move_pages() completed successfullly.
      
      Possible page states in status[]:
      
      0..MAX_NUMNODES	The page is now on the indicated node.
      
      -ENOENT		Page is not present
      
      -EACCES		Page is mapped by multiple processes and can only
      		be moved if MPOL_MF_MOVE_ALL is specified.
      
      -EPERM		The page has been mlocked by a process/driver and
      		cannot be moved.
      
      -EBUSY		Page is busy and cannot be moved. Try again later.
      
      -EFAULT		Invalid address (no VMA or zero page).
      
      -ENOMEM		Unable to allocate memory on target node.
      
      -EIO		Unable to write back page. The page must be written
      		back in order to move it since the page is dirty and the
      		filesystem does not provide a migration function that
      		would allow the moving of dirty pages.
      
      -EINVAL		A dirty page cannot be moved. The filesystem does not provide
      		a migration function and has no ability to write back pages.
      
      The flags parameter indicates what types of pages to move:
      
      MPOL_MF_MOVE	Move pages that are only mapped by the process.
      
      MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
      		Requires sufficient capabilities.
      
      Possible return codes from move_pages()
      
      -ENOENT		No pages found that would require moving. All pages
      		are either already on the target node, not present, had an
      		invalid address or could not be moved because they were
      		mapped by multiple processes.
      
      -EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
      		to migrate pages in a kernel thread.
      
      -EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
      		or an attempt to move a process belonging to another user.
      
      -EACCES		One of the target nodes is not allowed by the current cpuset.
      
      -ENODEV		One of the target nodes is not online.
      
      -ESRCH		Process does not exist.
      
      -E2BIG		Too many pages to move.
      
      -ENOMEM		Not enough memory to allocate control array.
      
      -EFAULT		Parameters could not be accessed.
      
      A test program for move_pages() may be found with the patches
      on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
      
      From: Christoph Lameter <clameter@sgi.com>
      
        Detailed results for sys_move_pages()
      
        Pass a pointer to an integer to get_new_page() that may be used to
        indicate where the completion status of a migration operation should be
        placed.  This allows sys_move_pags() to report back exactly what happened to
        each page.
      
        Wish there would be a better way to do this. Looks a bit hacky.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      742755a1
    • C
      [PATCH] page migration: use allocator function for migrate_pages() · 95a402c3
      Christoph Lameter 提交于
      Instead of passing a list of new pages, pass a function to allocate a new
      page.  This allows the correct placement of MPOL_INTERLEAVE pages during page
      migration.  It also further simplifies the callers of migrate pages.
      migrate_pages() becomes similar to migrate_pages_to() so drop
      migrate_pages_to().  The batching of new page allocations becomes unnecessary.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      95a402c3
    • C
      [PATCH] page migration: handle freeing of pages in migrate_pages() · aaa994b3
      Christoph Lameter 提交于
      Do not leave pages on the lists passed to migrate_pages().  Seems that we will
      not need any postprocessing of pages.  This will simplify the handling of
      pages by the callers of migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aaa994b3
    • C
      [PATCH] page migration: simplify migrate_pages() · e24f0b8f
      Christoph Lameter 提交于
      Currently migrate_pages() is mess with lots of goto.  Extract two functions
      from migrate_pages() and get rid of the gotos.
      
      Plus we can just unconditionally set the locked bit on the new page since we
      are the only one holding a reference.  Locking is to stop others from
      accessing the page once we establish references to the new page.
      
      Remove the list_del from move_to_lru in order to have finer control over list
      processing.
      
      [akpm@osdl.org: add debug check]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e24f0b8f
    • K
      [PATCH] printk() should not be called under zone->lock · 8f9de51a
      Kirill Korotaev 提交于
      This patch fixes printk() under zone->lock in show_free_areas().  It can be
      unsafe to call printk() under this lock, since caller can try to
      allocate/free some memory and selfdeadlock on this lock.  I found
      allocations/freeing mem both in netconsole and serial console.
      
      This issue was faced in reallity when meminfo was periodically printed for
      debug purposes and netconsole was used.
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f9de51a
    • R
      [PATCH] kernel-doc for mm/filemap.c · 485bb99b
      Randy Dunlap 提交于
      mm/filemap.c:
      - add lots of kernel-doc;
      - fix some typos and kernel-doc errors;
      - drop some blank lines between function close and EXPORT_SYMBOL();
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      485bb99b
    • P
      [PATCH] slab: kmalloc, kzalloc comments cleanup and fix · 800590f5
      Paul Drynoff 提交于
      - Move comments for kmalloc to right place, currently it near __do_kmalloc
      
      - Comments for kzalloc
      
      - More detailed comments for kmalloc
      
      - Appearance of "kmalloc" and "kzalloc" man pages after "make mandocs"
      
      [rdunlap@xenotime.net: simplification]
      Signed-off-by: NPaul Drynoff <pauldrynoff@gmail.com>
      Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      800590f5
    • K
    • A
      [PATCH] initialise total_memory() earlier · bd1e22b8
      Andrew Morton 提交于
      Initialise total_memory earlier in boot.  Because if for some reason we run
      page reclaim early in boot, we don't want total_memory to be zero when we use
      it as a divisor.
      
      And rename total_memory to vm_total_pages to avoid naming clashes with
      architectures.
      
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Martin Bligh <mbligh@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bd1e22b8
    • I
      [PATCH] mm/slab.c: fix early init assumption · e0a42726
      Ingo Molnar 提交于
      The SLAB bootstrap code assumes that the first two kmalloc caches created
      (the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab.  But due to AC
      and L3 structure size increase in lockdep, one of them ended up being
      off-slab, and subsequently crashing with:
      
      Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
       [<ffffffff80267478>] kmem_cache_alloc+0x26/0x7d
      
      The fix is to introduce a bootstrap flag and to use it to prevent off-slab
      caches being created so early during bootup.
      
      (The calculation for off-slab caches is quite complex so i didnt want to
      complicate things with introducing yet another INDEX_ calculation, the flag
      approach is simpler and smaller.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e0a42726
    • H
      [PATCH] fix update_mmu_cache in fremap.c · 668e0d8f
      Hugh Dickins 提交于
      There are two calls to update_mmu_cache in fremap.c, both defective.
      The one in install_page needs to be accompanied by lazy_mmu_prot_update
      (some other cleanup time, move that into ia64 update_mmu_cache itself); and
      the one in install_file_pte should be removed since the pte is not present.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      668e0d8f
    • H
      [PATCH] swapoff: use atomic_inc_not_zero() on mm_users · 70af7c5c
      Hugh Dickins 提交于
      Now that we have atomic_inc_not_zero, it's more elegant for try_to_unuse to
      use that on mm_users: doesn't actually matter at present, but safer to be
      sure that once mm_users has gone to 0, nothing raises it for an instant.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      70af7c5c
    • D
      [PATCH] add page_mkwrite() vm_operations method · 9637a5ef
      David Howells 提交于
      Add a new VMA operation to notify a filesystem or other driver about the
      MMU generating a fault because userspace attempted to write to a page
      mapped through a read-only PTE.
      
      This facility permits the filesystem or driver to:
      
       (*) Implement storage allocation/reservation on attempted write, and so to
           deal with problems such as ENOSPC more gracefully (perhaps by generating
           SIGBUS).
      
       (*) Delay making the page writable until the contents have been written to a
           backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
           It permits the filesystem to have some guarantee about the state of the
           cache.
      
       (*) Account and limit number of dirty pages. This is one piece of the puzzle
           needed to make shared writable mapping work safely in FUSE.
      
      Needed by cachefs (Or is it cachefiles?  Or fscache? <head spins>).
      
      At least four other groups have stated an interest in it or a desire to use
      the functionality it provides: FUSE, OCFS2, NTFS and JFFS2.  Also, things like
      EXT3 really ought to use it to deal with the case of shared-writable mmap
      encountering ENOSPC before we permit the page to be dirtied.
      
      From: Peter Zijlstra <a.p.zijlstra@chello.nl>
      
        get_user_pages(.write=1, .force=1) can generate COW hits on read-only
        shared mappings, this patch traps those as mkpage_write candidates and fails
        to handle them the old way.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Joel Becker <Joel.Becker@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9637a5ef
    • A
      [PATCH] sparsemem: record nid during memory present · 30c253e6
      Andy Whitcroft 提交于
      Record the node id as we mark sections for instantiation.  Use this nid
      during instantiation to direct allocations.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Bob Picco <bob.picco@hp.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Martin Bligh <mbligh@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      30c253e6
    • P
      [PATCH] slab: verify pointers before free · ddc2e812
      Pekka Enberg 提交于
      Passing an invalid pointer to kfree() and kmem_cache_free() is likely to
      cause bad memory corruption or even take down the whole system because the
      bad pointer is likely reused immediately due to the per-CPU caches.  Until
      now, we don't do any verification for this if CONFIG_DEBUG_SLAB is
      disabled.
      
      As suggested by Linus, add PageSlab check to page_to_cache() and
      page_to_slab() to verify pointers passed to kfree().  Also, move the
      stronger check from cache_free_debugcheck() to kmem_cache_free() to ensure
      the passed pointer actually belongs to the cache we're about to free the
      object.
      
      For page_to_cache() and page_to_slab(), the assertions should have
      virtually no extra cost (two instructions, no data cache pressure) and for
      kmem_cache_free() the overhead should be minimal.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Linus Torvalds <torvalds@osdl.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ddc2e812
    • C
      [PATCH] More page migration: use migration entries for file pages · 04e62a29
      Christoph Lameter 提交于
      This implements the use of migration entries to preserve ptes of file backed
      pages during migration.  Processes can therefore be migrated back and forth
      without loosing their connection to pagecache pages.
      
      Note that we implement the migration entries only for linear mappings.
      Nonlinear mappings still require the unmapping of the ptes for migration.
      
      And another writepage() ugliness shows up.  writepage() can drop the page
      lock.  Therefore we have to remove migration ptes before calling writepages()
      in order to avoid having migration entries point to unlocked pages.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      04e62a29
    • C
      [PATCH] More page migration: do not inc/dec rss counters · 442c9137
      Christoph Lameter 提交于
      If we install a migration entry then the rss not really decreases since the
      page is just moved somewhere else.  We can save ourselves the work of
      decrementing and later incrementing which will just eventually cause cacheline
      bouncing.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      442c9137
    • C
      [PATCH] Swapless page migration: modify core logic · 6c5240ae
      Christoph Lameter 提交于
      Use the migration entries for page migration
      
      This modifies the migration code to use the new migration entries.  It now
      becomes possible to migrate anonymous pages without having to add a swap
      entry.
      
      We add a couple of new functions to replace migration entries with the proper
      ptes.
      
      We cannot take the tree_lock for migrating anonymous pages anymore.  However,
      we know that we hold the only remaining reference to the page when the page
      count reaches 1.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6c5240ae
    • C
      [PATCH] Swapless page migration: rip out swap based logic · d75a0fcd
      Christoph Lameter 提交于
      Rip the page migration logic out.
      
      Remove all code that has to do with swapping during page migration.
      
      This also guts the ability to migrate pages to swap.  No one used that so lets
      let it go for good.
      
      Page migration should be a bit broken after this patch.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d75a0fcd
    • C
      [PATCH] Swapless page migration: add R/W migration entries · 0697212a
      Christoph Lameter 提交于
      Implement read/write migration ptes
      
      We take the upper two swapfiles for the two types of migration ptes and define
      a series of macros in swapops.h.
      
      The VM is modified to handle the migration entries.  migration entries can
      only be encountered when the page they are pointing to is locked.  This limits
      the number of places one has to fix.  We also check in copy_pte_range and in
      mprotect_pte_range() for migration ptes.
      
      We check for migration ptes in do_swap_cache and call a function that will
      then wait on the page lock.  This allows us to effectively stop all accesses
      to apge.
      
      Migration entries are created by try_to_unmap if called for migration and
      removed by local functions in migrate.c
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Several times while testing swapless page migration (I've no NUMA, just
        hacking it up to migrate recklessly while running load), I've hit the
        BUG_ON(!PageLocked(p)) in migration_entry_to_page.
      
        This comes from an orphaned migration entry, unrelated to the current
        correctly locked migration, but hit by remove_anon_migration_ptes as it
        checks an address in each vma of the anon_vma list.
      
        Such an orphan may be left behind if an earlier migration raced with fork:
        copy_one_pte can duplicate a migration entry from parent to child, after
        remove_anon_migration_ptes has checked the child vma, but before it has
        removed it from the parent vma.  (If the process were later to fault on this
        orphaned entry, it would hit the same BUG from migration_entry_wait.)
      
        This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
        not.  There's no such problem with file pages, because vma_prio_tree_add
        adds child vma after parent vma, and the page table locking at each end is
        enough to serialize.  Follow that example with anon_vma: add new vmas to the
        tail instead of the head.
      
        (There's no corresponding problem when inserting migration entries,
        because a missed pte will leave the page count and mapcount high, which is
        allowed for.  And there's no corresponding problem when migrating via swap,
        because a leftover swap entry will be correctly faulted.  But the swapless
        method has no refcounting of its entries.)
      
      From: Ingo Molnar <mingo@elte.hu>
      
        pte_unmap_unlock() takes the pte pointer as an argument.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Several times while testing swapless page migration, gcc has tried to exec
        a pointer instead of a string: smells like COW mappings are not being
        properly write-protected on fork.
      
        The protection in copy_one_pte looks very convincing, until at last you
        realize that the second arg to make_migration_entry is a boolean "write",
        and SWP_MIGRATION_READ is 30.
      
        Anyway, it's better done like in change_pte_range, using
        is_write_migration_entry and make_migration_entry_read.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Remove unnecessary obfuscation from sys_swapon's range check on swap type,
        which blew up causing memory corruption once swapless migration made
        MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      From: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0697212a
    • C
      [PATCH] page migration cleanup: move fallback handling into special function · 8351a6e4
      Christoph Lameter 提交于
      Move the fallback code into a new fallback function and make the function
      behave like any other migration function.  This requires retaking the lock if
      pageout() drops it.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8351a6e4