1. 23 6月, 2006 40 次提交
    • A
      [PATCH] fs/locks.c: make posix_locks_deadlock() static · b0904e14
      Adrian Bunk 提交于
      We can now make posix_locks_deadlock() static.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b0904e14
    • M
      [PATCH] vfs: add lock owner argument to flush operation · 75e1fcc0
      Miklos Szeredi 提交于
      Pass the POSIX lock owner ID to the flush operation.
      
      This is useful for filesystems which don't want to store any locking state
      in inode->i_flock but want to handle locking/unlocking POSIX locks
      internally.  FUSE is one such filesystem but I think it possible that some
      network filesystems would need this also.
      
      Also add a flag to indicate that a POSIX locking request was generated by
      close(), so filesystems using the above feature won't send an extra locking
      request in this case.
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      75e1fcc0
    • P
      [PATCH] read_mapping_page for address space · 090d2b18
      Pekka Enberg 提交于
      Add read_mapping_page() which is used for callers that pass
      mapping->a_ops->readpage as the filler for read_cache_page.  This removes
      some duplication from filesystem code.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      090d2b18
    • J
      [PATCH] Add a sysfs file to determine if a kexec kernel is loaded · c330dda9
      Jeff Moyer 提交于
      Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded.  Each
      file contains a simple boolean value indicating whether the relevant kernel
      has been loaded into memory.  The motivation for this is geared around
      support.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c330dda9
    • R
      [PATCH] m68k: Remove some unused definitions in zorro.h · 98317f12
      Roman Zippel 提交于
      These definitions have long been superseded by asm-offsets.h
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      98317f12
    • S
      [PATCH] swsusp: add architecture special saveable pages support · ce4ab001
      Shaohua Li 提交于
      1. Add architecture specific pages save/restore support.  Next two patches
         will use this to save/restore 'ACPI NVS' pages.
      
      2. Allow reserved pages 'nosave'.  This could avoid save/restore BIOS
         reserved pages.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Nigel Cunningham <nigel@suspend2.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ce4ab001
    • Z
      [PATCH] x86: kernel irq balance doesn't work · 1b61b910
      Zhang Yanmin 提交于
      On i386, kernel irq balance doesn't work.
      
      1) In function do_irq_balance, after kernel finds the min_loaded cpu but
         before calling set_pending_irq to really pin the selected_irq to the
         target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
         Later on, when the irq is acked, kernel would calls
         move_native_irq=>desc->handler->set_affinity to change the irq affinity.
          However, every function pointed by
         hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
         always changes irq_affinity[irq] to cpumask.  Next time when recalling
         do_irq_balance, it has to do cpu_ands again with
         irq_affinity[selected_irq], but irq_affinity[selected_irq] already
         becomes one cpu selected by the first irq balance.
      
      2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
         issue.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NZhang Yanmin <yanmin.zhang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1b61b910
    • H
      [PATCH] x86: cache pollution aware __copy_from_user_ll() · c22ce143
      Hiro Yoshioka 提交于
      Use the x86 cache-bypassing copy instructions for copy_from_user().
      
      Some performance data are
      
      Total of GLOBAL_POWER_EVENTS (CPU cycle samples)
      
      2.6.12.4.orig    1921587
      2.6.12.4.nt      1599424
      1599424/1921587=83.23% (16.77% reduction)
      
      BSQ_CACHE_REFERENCE (L3 cache miss)
      2.6.12.4.orig      57427
      2.6.12.4.nt        20858
      20858/57427=36.32% (63.7% reduction)
      
      L3 cache miss reduction of __copy_from_user_ll
      samples  %
      37408    65.1412  vmlinux                  __copy_from_user_ll
      23        0.1103  vmlinux                  __copy_user_zeroing_intel_nocache
      23/37408=0.061% (99.94% reduction)
      
      Top 5 of 2.6.12.4.nt
      Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
      samples  %        app name                 symbol name
      128392    8.0274  vmlinux                  __copy_user_zeroing_intel_nocache
      64206     4.0143  vmlinux                  journal_add_journal_head
      59746     3.7355  vmlinux                  do_get_write_access
      47674     2.9807  vmlinux                  journal_put_journal_head
      46021     2.8774  vmlinux                  journal_dirty_metadata
      pattern9-0-cpu4-0-09011728/summary.out
      
      Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
      samples  %        app name                 symbol name
      69755     4.2861  vmlinux                  __copy_user_zeroing_intel_nocache
      55685     3.4215  vmlinux                  journal_add_journal_head
      52371     3.2179  vmlinux                  __find_get_block
      45504     2.7960  vmlinux                  journal_put_journal_head
      36005     2.2123  vmlinux                  journal_stop
      pattern9-0-cpu4-0-09011744/summary.out
      
      Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
      samples  %        app name                 symbol name
      1147      5.4994  vmlinux                  journal_add_journal_head
      881       4.2240  vmlinux                  journal_dirty_data
      872       4.1809  vmlinux                  blk_rq_map_sg
      734       3.5192  vmlinux                  journal_commit_transaction
      617       2.9582  vmlinux                  radix_tree_delete
      pattern9-0-cpu4-0-09011731/summary.out
      
      iozone results are
      
      original 2.6.12.4 CPU time = 207.768 sec
      cache aware       CPU time = 184.783 sec
      (three times run)
      184.783/207.768=88.94% (11.06% reduction)
      
      original:
      pattern9-0-cpu4-0-08191720/iozone.out:  CPU Utilization: Wall time   45.997    CPU time   64.527    CPU utilization 140.28 %
      pattern9-0-cpu4-0-08191741/iozone.out:  CPU Utilization: Wall time   46.878    CPU time   71.933    CPU utilization 153.45 %
      pattern9-0-cpu4-0-08191743/iozone.out:  CPU Utilization: Wall time   45.152    CPU time   71.308    CPU utilization 157.93 %
      
      cache awre:
      pattern9-0-cpu4-0-09011728/iozone.out:  CPU Utilization: Wall time   44.842    CPU time   62.465    CPU utilization 139.30 %
      pattern9-0-cpu4-0-09011731/iozone.out:  CPU Utilization: Wall time   44.718    CPU time   59.273    CPU utilization 132.55 %
      pattern9-0-cpu4-0-09011744/iozone.out:  CPU Utilization: Wall time   44.367    CPU time   63.045    CPU utilization 142.10 %
      Signed-off-by: NHiro Yoshioka <hyoshiok@miraclelinux.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c22ce143
    • D
      [PATCH] SELinux: add task_movememory hook · 35601547
      David Quigley 提交于
      This patch adds new security hook, task_movememory, to be called when memory
      owened by a task is to be moved (e.g.  when migrating pages to a this hook is
      identical to the setscheduler implementation, but a separate hook introduced
      to allow this check to be specialized in the future if necessary.
      
      Since the last posting, the hook has been renamed following feedback from
      Christoph Lameter.
      Signed-off-by: NDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      35601547
    • J
      [PATCH] lsm: add task_setioprio hook · 03e68060
      James Morris 提交于
      Implement an LSM hook for setting a task's IO priority, similar to the hook
      for setting a tasks's nice value.
      
      A previous version of this LSM hook was included in an older version of
      multiadm by Jan Engelhardt, although I don't recall it being submitted
      upstream.
      
      Also included is the corresponding SELinux hook, which re-uses the setsched
      permission in the proccess class.
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Jens Axboe <axboe@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      03e68060
    • C
      [PATCH] move_pages: fix 32 -> 64 bit compat function · 9216dfad
      Christoph Lameter 提交于
      The definition of the third parameter is a pointer to an array of virtual
      addresses which give us some trouble.  The existing code calculated the
      wrong address in the array since I used void to avoid having to specify a
      type.
      
      I now use the correct type "compat_uptr_t __user *" in the definition of
      the function in kernel/compat.c.
      
      However, I used __u32 in syscalls.h.  Would have to include compat.h there
      in order to provide the same definition which would generate an ugly
      include situation.
      
      On both ia64 and x86_64 compat_uptr_t is u32. So this works although
      parameter declarations differ.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9216dfad
    • C
      [PATCH] sys_move_pages: 32bit support (i386, x86_64) · 1b2db9fb
      Christoph Lameter 提交于
      sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)
      
      Add support for move_pages() on i386 and also add the compat functions
      necessary to run 32 bit binaries on x86_64.
      
      Add compat_sys_move_pages to the x86_64 32bit binary layer.  Note that it is
      not up to date so I added the missing pieces.  Not sure if this is done the
      right way.
      
      [akpm@osdl.org: compile fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1b2db9fb
    • C
      [PATCH] page migration: sys_move_pages(): support moving of individual pages · 742755a1
      Christoph Lameter 提交于
      move_pages() is used to move individual pages of a process. The function can
      be used to determine the location of pages and to move them onto the desired
      node. move_pages() returns status information for each page.
      
      long move_pages(pid, number_of_pages_to_move,
      		addresses_of_pages[],
      		nodes[] or NULL,
      		status[],
      		flags);
      
      The addresses of pages is an array of void * pointing to the
      pages to be moved.
      
      The nodes array contains the node numbers that the pages should be moved
      to. If a NULL is passed instead of an array then no pages are moved but
      the status array is updated. The status request may be used to determine
      the page state before issuing another move_pages() to move pages.
      
      The status array will contain the state of all individual page migration
      attempts when the function terminates. The status array is only valid if
      move_pages() completed successfullly.
      
      Possible page states in status[]:
      
      0..MAX_NUMNODES	The page is now on the indicated node.
      
      -ENOENT		Page is not present
      
      -EACCES		Page is mapped by multiple processes and can only
      		be moved if MPOL_MF_MOVE_ALL is specified.
      
      -EPERM		The page has been mlocked by a process/driver and
      		cannot be moved.
      
      -EBUSY		Page is busy and cannot be moved. Try again later.
      
      -EFAULT		Invalid address (no VMA or zero page).
      
      -ENOMEM		Unable to allocate memory on target node.
      
      -EIO		Unable to write back page. The page must be written
      		back in order to move it since the page is dirty and the
      		filesystem does not provide a migration function that
      		would allow the moving of dirty pages.
      
      -EINVAL		A dirty page cannot be moved. The filesystem does not provide
      		a migration function and has no ability to write back pages.
      
      The flags parameter indicates what types of pages to move:
      
      MPOL_MF_MOVE	Move pages that are only mapped by the process.
      
      MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
      		Requires sufficient capabilities.
      
      Possible return codes from move_pages()
      
      -ENOENT		No pages found that would require moving. All pages
      		are either already on the target node, not present, had an
      		invalid address or could not be moved because they were
      		mapped by multiple processes.
      
      -EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
      		to migrate pages in a kernel thread.
      
      -EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
      		or an attempt to move a process belonging to another user.
      
      -EACCES		One of the target nodes is not allowed by the current cpuset.
      
      -ENODEV		One of the target nodes is not online.
      
      -ESRCH		Process does not exist.
      
      -E2BIG		Too many pages to move.
      
      -ENOMEM		Not enough memory to allocate control array.
      
      -EFAULT		Parameters could not be accessed.
      
      A test program for move_pages() may be found with the patches
      on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
      
      From: Christoph Lameter <clameter@sgi.com>
      
        Detailed results for sys_move_pages()
      
        Pass a pointer to an integer to get_new_page() that may be used to
        indicate where the completion status of a migration operation should be
        placed.  This allows sys_move_pags() to report back exactly what happened to
        each page.
      
        Wish there would be a better way to do this. Looks a bit hacky.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      742755a1
    • C
      [PATCH] page migration: use allocator function for migrate_pages() · 95a402c3
      Christoph Lameter 提交于
      Instead of passing a list of new pages, pass a function to allocate a new
      page.  This allows the correct placement of MPOL_INTERLEAVE pages during page
      migration.  It also further simplifies the callers of migrate pages.
      migrate_pages() becomes similar to migrate_pages_to() so drop
      migrate_pages_to().  The batching of new page allocations becomes unnecessary.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      95a402c3
    • C
      [PATCH] page migration: handle freeing of pages in migrate_pages() · aaa994b3
      Christoph Lameter 提交于
      Do not leave pages on the lists passed to migrate_pages().  Seems that we will
      not need any postprocessing of pages.  This will simplify the handling of
      pages by the callers of migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aaa994b3
    • P
      [PATCH] slab: kmalloc, kzalloc comments cleanup and fix · 800590f5
      Paul Drynoff 提交于
      - Move comments for kmalloc to right place, currently it near __do_kmalloc
      
      - Comments for kzalloc
      
      - More detailed comments for kmalloc
      
      - Appearance of "kmalloc" and "kzalloc" man pages after "make mandocs"
      
      [rdunlap@xenotime.net: simplification]
      Signed-off-by: NPaul Drynoff <pauldrynoff@gmail.com>
      Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      800590f5
    • A
      [PATCH] initialise total_memory() earlier · bd1e22b8
      Andrew Morton 提交于
      Initialise total_memory earlier in boot.  Because if for some reason we run
      page reclaim early in boot, we don't want total_memory to be zero when we use
      it as a divisor.
      
      And rename total_memory to vm_total_pages to avoid naming clashes with
      architectures.
      
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Martin Bligh <mbligh@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bd1e22b8
    • D
      [PATCH] add page_mkwrite() vm_operations method · 9637a5ef
      David Howells 提交于
      Add a new VMA operation to notify a filesystem or other driver about the
      MMU generating a fault because userspace attempted to write to a page
      mapped through a read-only PTE.
      
      This facility permits the filesystem or driver to:
      
       (*) Implement storage allocation/reservation on attempted write, and so to
           deal with problems such as ENOSPC more gracefully (perhaps by generating
           SIGBUS).
      
       (*) Delay making the page writable until the contents have been written to a
           backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
           It permits the filesystem to have some guarantee about the state of the
           cache.
      
       (*) Account and limit number of dirty pages. This is one piece of the puzzle
           needed to make shared writable mapping work safely in FUSE.
      
      Needed by cachefs (Or is it cachefiles?  Or fscache? <head spins>).
      
      At least four other groups have stated an interest in it or a desire to use
      the functionality it provides: FUSE, OCFS2, NTFS and JFFS2.  Also, things like
      EXT3 really ought to use it to deal with the case of shared-writable mmap
      encountering ENOSPC before we permit the page to be dirtied.
      
      From: Peter Zijlstra <a.p.zijlstra@chello.nl>
      
        get_user_pages(.write=1, .force=1) can generate COW hits on read-only
        shared mappings, this patch traps those as mkpage_write candidates and fails
        to handle them the old way.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Joel Becker <Joel.Becker@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9637a5ef
    • C
      [PATCH] mm: fix swap unused warning · bd96b9eb
      Con Kolivas 提交于
      If CONFIG_SWAP is not defined we get:
      
      mm/vmscan.c: In function ‘remove_mapping’:
      mm/vmscan.c:387: warning: unused variable ‘swap’
      
      Convert defines in swap.h into blank inline functions to fix this warning
      and be consistent.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bd96b9eb
    • A
      [PATCH] sparsemem: record nid during memory present · 30c253e6
      Andy Whitcroft 提交于
      Record the node id as we mark sections for instantiation.  Use this nid
      during instantiation to direct allocations.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Bob Picco <bob.picco@hp.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Martin Bligh <mbligh@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      30c253e6
    • C
      [PATCH] More page migration: use migration entries for file pages · 04e62a29
      Christoph Lameter 提交于
      This implements the use of migration entries to preserve ptes of file backed
      pages during migration.  Processes can therefore be migrated back and forth
      without loosing their connection to pagecache pages.
      
      Note that we implement the migration entries only for linear mappings.
      Nonlinear mappings still require the unmapping of the ptes for migration.
      
      And another writepage() ugliness shows up.  writepage() can drop the page
      lock.  Therefore we have to remove migration ptes before calling writepages()
      in order to avoid having migration entries point to unlocked pages.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      04e62a29
    • C
      [PATCH] Swapless page migration: rip out swap based logic · d75a0fcd
      Christoph Lameter 提交于
      Rip the page migration logic out.
      
      Remove all code that has to do with swapping during page migration.
      
      This also guts the ability to migrate pages to swap.  No one used that so lets
      let it go for good.
      
      Page migration should be a bit broken after this patch.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d75a0fcd
    • C
      [PATCH] Swapless page migration: add R/W migration entries · 0697212a
      Christoph Lameter 提交于
      Implement read/write migration ptes
      
      We take the upper two swapfiles for the two types of migration ptes and define
      a series of macros in swapops.h.
      
      The VM is modified to handle the migration entries.  migration entries can
      only be encountered when the page they are pointing to is locked.  This limits
      the number of places one has to fix.  We also check in copy_pte_range and in
      mprotect_pte_range() for migration ptes.
      
      We check for migration ptes in do_swap_cache and call a function that will
      then wait on the page lock.  This allows us to effectively stop all accesses
      to apge.
      
      Migration entries are created by try_to_unmap if called for migration and
      removed by local functions in migrate.c
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Several times while testing swapless page migration (I've no NUMA, just
        hacking it up to migrate recklessly while running load), I've hit the
        BUG_ON(!PageLocked(p)) in migration_entry_to_page.
      
        This comes from an orphaned migration entry, unrelated to the current
        correctly locked migration, but hit by remove_anon_migration_ptes as it
        checks an address in each vma of the anon_vma list.
      
        Such an orphan may be left behind if an earlier migration raced with fork:
        copy_one_pte can duplicate a migration entry from parent to child, after
        remove_anon_migration_ptes has checked the child vma, but before it has
        removed it from the parent vma.  (If the process were later to fault on this
        orphaned entry, it would hit the same BUG from migration_entry_wait.)
      
        This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
        not.  There's no such problem with file pages, because vma_prio_tree_add
        adds child vma after parent vma, and the page table locking at each end is
        enough to serialize.  Follow that example with anon_vma: add new vmas to the
        tail instead of the head.
      
        (There's no corresponding problem when inserting migration entries,
        because a missed pte will leave the page count and mapcount high, which is
        allowed for.  And there's no corresponding problem when migrating via swap,
        because a leftover swap entry will be correctly faulted.  But the swapless
        method has no refcounting of its entries.)
      
      From: Ingo Molnar <mingo@elte.hu>
      
        pte_unmap_unlock() takes the pte pointer as an argument.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Several times while testing swapless page migration, gcc has tried to exec
        a pointer instead of a string: smells like COW mappings are not being
        properly write-protected on fork.
      
        The protection in copy_one_pte looks very convincing, until at last you
        realize that the second arg to make_migration_entry is a boolean "write",
        and SWP_MIGRATION_READ is 30.
      
        Anyway, it's better done like in change_pte_range, using
        is_write_migration_entry and make_migration_entry_read.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Remove unnecessary obfuscation from sys_swapon's range check on swap type,
        which blew up causing memory corruption once swapless migration made
        MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      From: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0697212a
    • C
      [PATCH] page migration cleanup: pass "mapping" to migration functions · 2d1db3b1
      Christoph Lameter 提交于
      Change handling of address spaces.
      
      Pass a pointer to the address space in which the page is migrated to all
      migration function.  This avoids repeatedly having to retrieve the address
      space pointer from the page and checking it for validity.  The old page
      mapping will change once migration has gone to a certain step, so it is less
      confusing to have the pointer always available.
      
      Move the setting of the mapping and index for the new page into
      migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2d1db3b1
    • C
      [PATCH] page migration cleanup: remove useless definitions · e7340f73
      Christoph Lameter 提交于
      Remove the export for migrate_page_remove_references() and migrate_page_copy()
      that are unlikely to be used directly by filesystems implementing migration.
      The export was useful when buffer_migrate_page() lived in fs/buffer.c but it
      has now been moved to migrate.c in the migration reorg.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e7340f73
    • O
      [PATCH] writeback: fix range handling · 111ebb6e
      OGAWA Hirofumi 提交于
      When a writeback_control's `start' and `end' fields are used to
      indicate a one-byte-range starting at file offset zero, the required
      values of .start=0,.end=0 mean that the ->writepages() implementation
      has no way of telling that it is being asked to perform a range
      request.  Because we're currently overloading (start == 0 && end == 0)
      to mean "this is not a write-a-range request".
      
      To make all this sane, the patch changes range of writeback_control.
      
      So caller does: If it is calling ->writepages() to write pages, it
      sets range (range_start/end or range_cyclic) always.
      
      And if range_cyclic is true, ->writepages() thinks the range is
      cyclic, otherwise it just uses range_start and range_end.
      
      This patch does,
      
          - Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
            -1 is usually ok for range_end (type is long long). But, if someone did,
      
      		range_end += val;		range_end is "val - 1"
      		u64val = range_end >> bits;	u64val is "~(0ULL)"
      
            or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
            things, and uses LLONG_MAX for range_end.
      
          - All callers of ->writepages() sets range_start/end or range_cyclic.
      
          - Fix updates of ->writeback_index. It seems already bit strange.
            If it starts at 0 and ended by check of nr_to_write, this last
            index may reduce chance to scan end of file.  So, this updates
            ->writeback_index only if range_cyclic is true or whole-file is
            scanned.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: "Vladimir V. Saveliev" <vs@namesys.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      111ebb6e
    • N
      [PATCH] radix-tree: direct data · 612d6c19
      Nick Piggin 提交于
      The ability to have height 0 radix trees (a direct pointer to the data item
      rather than going through a full node->slot) quietly disappeared with
      old-2.6-bkcvs commit ffee171812d51652f9ba284302d9e5c5cc14bdfd.  On 64-bit
      machines this causes nearly 600 bytes to be used for every <= 4K file in
      pagecache.
      
      Re-introduce this feature, root tags stored in spare ->gfp_mask bits.
      
      Simplify radix_tree_delete's complex tag clearing arrangement (which would
      become even more complex) by just falling back to tag clearing functions
      (the pagecache radix-tree never uses this path anyway, so the icache
      savings will mean it's actually a speedup).
      
      On my 4GB G5, this saves 8MB RAM per kernel kernel source+object tree in
      pagecache.
      
      Pagecache lookup, insertion, and removal speed for small files will also be
      improved.
      
      This makes RCU radix tree harder, but it's worth it.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      612d6c19
    • D
      [PATCH] change gen_pool allocator to not touch managed memory · 929f9727
      Dean Nelson 提交于
      Modify the gen_pool allocator (lib/genalloc.c) to utilize a bitmap scheme
      instead of the buddy scheme.  The purpose of this change is to eliminate
      the touching of the actual memory being allocated.
      
      Since the change modifies the interface, a change to the uncached allocator
      (arch/ia64/kernel/uncached.c) is also required.
      
      Both Andrey Volkov and Jes Sorenson have expressed a desire that the
      gen_pool allocator not write to the memory being managed. See the
      following:
      
        http://marc.theaimsgroup.com/?l=linux-kernel&m=113518602713125&w=2
        http://marc.theaimsgroup.com/?l=linux-kernel&m=113533568827916&w=2Signed-off-by: NDean Nelson <dcn@sgi.com>
      Cc: Andrey Volkov <avolkov@varma-el.com>
      Acked-by: NJes Sorensen <jes@trained-monkey.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      929f9727
    • N
      [PATCH] mm: introduce remap_vmalloc_range() · 83342314
      Nick Piggin 提交于
      Add remap_vmalloc_range, vmalloc_user, and vmalloc_32_user so that drivers
      can have a nice interface for remapping vmalloc memory.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      83342314
    • Y
      [PATCH] Unify pxm_to_node() and node_to_pxm() · 762834e8
      Yasunori Goto 提交于
      Consolidate the various arch-specific implementations of pxm_to_node() and
      node_to_pxm() into a single generic version.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      762834e8
    • C
      [PATCH] tightening hugetlb strict accounting · a43a8c39
      Chen, Kenneth W 提交于
      Current hugetlb strict accounting for shared mapping always assume mapping
      starts at zero file offset and reserves pages between zero and size of the
      file.  This assumption often reserves (or lock down) a lot more pages then
      necessary if application maps at none zero file offset.  libhugetlbfs is
      one example that requires proper reservation on shared mapping starts at
      none zero offset.
      
      This patch extends the reservation and hugetlb strict accounting to support
      any arbitrary pair of (offset, len), resulting a much more robust and
      accurate scheme.  More importantly, it won't lock down any hugetlb pages
      outside file mapping.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a43a8c39
    • A
      [PATCH] reserve space for swap label · e8f03d02
      Andreas Dilger 提交于
      Reserve space in the swap disk header for a LABEL and UUID to be specified.
       This has been possible with util-linux-2.12b (via e2fsprogs 1.36
      libblkid), and is used by at least FC3 and later.  The kernel doesn't
      really care about this, but the space shouldn't accidentally be used by
      something else either.
      
      Also make the on-disk structures be fixed-size types, instead of "int",
      though I don't know of any architecture in use where an "int" isn't the
      same size as a "__u32" (all current kernel arches have it as "unsigned
      int").
      Signed-off-by: NAndreas Dilger <adilger@shaw.ca>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e8f03d02
    • K
      [PATCH] support for panic at OOM · fadd8fbd
      KAMEZAWA Hiroyuki 提交于
      This patch adds panic_on_oom sysctl under sys.vm.
      
      When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
      processes.  And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
      the same way as it does today.  Of course, the default value is 0 and only
      root can modifies it.
      
      In general, oom_killer works well and kill rogue processes.  So the whole
      system can survive.  But there are environments where panic is preferable
      rather than kill some processes.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fadd8fbd
    • Y
      [PATCH] wait_table and zonelist initializing for memory hotadd: add return... · 718127cc
      Yasunori Goto 提交于
      [PATCH] wait_table and zonelist initializing for memory hotadd: add return code for init_current_empty_zone
      
      When add_zone() is called against empty zone (not populated zone), we have to
      initialize the zone which didn't initialize at boot time.  But,
      init_currently_empty_zone() may fail due to allocation of wait table.  So,
      this patch is to catch its error code.
      
      Changes against wait_table is in the next patch.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      718127cc
    • Y
      [PATCH] wait_table and zonelist initializing for memory hotadd: change to... · 86356ab1
      Yasunori Goto 提交于
      [PATCH] wait_table and zonelist initializing for memory hotadd: change to meminit for build_zonelist
      
      Change definitions of some functions and data from __init to __meminit.
      
      These functions and data can be used after bootup by this patch to be used for
      hot-add codes.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86356ab1
    • Y
      [PATCH] wait_table and zonelist initializing for memory hotadd: change name of wait_table_size() · 02b694de
      Yasunori Goto 提交于
      This is just to rename from wait_table_size() to wait_table_hash_nr_entries().
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02b694de
    • A
      [PATCH] PG_uncached is ia64 only · f886ed44
      Andrew Morton 提交于
      As Nick points out, only ia64 uses PG_uncached.  So we can push it up into the
      higher bits of the lower half of page->flags and make room for another flag on
      32-bit machines.
      
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Jesse Barnes <jbarnes@sgi.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f886ed44
    • A
      [PATCH] zone handle unaligned zone boundaries · cb2b95e1
      Andy Whitcroft 提交于
      The buddy allocator has a requirement that boundaries between contigious
      zones occur aligned with the the MAX_ORDER ranges.  Where they do not we
      will incorrectly merge pages cross zone boundaries.  This can lead to pages
      from the wrong zone being handed out.
      
      Originally the buddy allocator would check that buddies were in the same
      zone by referencing the zone start and end page frame numbers.  This was
      removed as it became very expensive and the buddy allocator already made
      the assumption that zones boundaries were aligned.
      
      It is clear that not all configurations and architectures are honouring
      this alignment requirement.  Therefore it seems safest to reintroduce
      support for non-aligned zone boundaries.  This patch introduces a new check
      when considering a page a buddy it compares the zone_table index for the
      two pages and refuses to merge the pages where they do not match.  The
      zone_table index is unique for each node/zone combination when
      FLATMEM/DISCONTIGMEM is enabled and for each section/zone combination when
      SPARSEMEM is enabled (a SPARSEMEM section is at least a MAX_ORDER size).
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb2b95e1
    • D
      [PATCH] VFS: Permit filesystem to perform statfs with a known root dentry · 726c3342
      David Howells 提交于
      Give the statfs superblock operation a dentry pointer rather than a superblock
      pointer.
      
      This complements the get_sb() patch.  That reduced the significance of
      sb->s_root, allowing NFS to place a fake root there.  However, NFS does
      require a dentry to use as a target for the statfs operation.  This permits
      the root in the vfsmount to be used instead.
      
      linux/mount.h has been added where necessary to make allyesconfig build
      successfully.
      
      Interest has also been expressed for use with the FUSE and XFS filesystems.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      726c3342
    • D
      [PATCH] VFS: Permit filesystem to override root dentry on mount · 454e2398
      David Howells 提交于
      Extend the get_sb() filesystem operation to take an extra argument that
      permits the VFS to pass in the target vfsmount that defines the mountpoint.
      
      The filesystem is then required to manually set the superblock and root dentry
      pointers.  For most filesystems, this should be done with simple_set_mnt()
      which will set the superblock pointer and then set the root dentry to the
      superblock's s_root (as per the old default behaviour).
      
      The get_sb() op now returns an integer as there's now no need to return the
      superblock pointer.
      
      This patch permits a superblock to be implicitly shared amongst several mount
      points, such as can be done with NFS to avoid potential inode aliasing.  In
      such a case, simple_set_mnt() would not be called, and instead the mnt_root
      and mnt_sb would be set directly.
      
      The patch also makes the following changes:
      
       (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
           pointer argument and return an integer, so most filesystems have to change
           very little.
      
       (*) If one of the convenience function is not used, then get_sb() should
           normally call simple_set_mnt() to instantiate the vfsmount. This will
           always return 0, and so can be tail-called from get_sb().
      
       (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
           dcache upon superblock destruction rather than shrink_dcache_anon().
      
           This is required because the superblock may now have multiple trees that
           aren't actually bound to s_root, but that still need to be cleaned up. The
           currently called functions assume that the whole tree is rooted at s_root,
           and that anonymous dentries are not the roots of trees which results in
           dentries being left unculled.
      
           However, with the way NFS superblock sharing are currently set to be
           implemented, these assumptions are violated: the root of the filesystem is
           simply a dummy dentry and inode (the real inode for '/' may well be
           inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
           with child trees.
      
           [*] Anonymous until discovered from another tree.
      
       (*) The documentation has been adjusted, including the additional bit of
           changing ext2_* into foo_* in the documentation.
      
      [akpm@osdl.org: convert ipath_fs, do other stuff]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      454e2398