1. 24 3月, 2006 16 次提交
    • A
      [PATCH] msync(): use do_fsync() · 8f2e9f15
      Andrew Morton 提交于
      No need to duplicate all that code.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f2e9f15
    • A
      [PATCH] msync: fix return value · 676758bd
      Andrew Morton 提交于
      msync() does a strange thing.  Essentially:
      
      	vma = find_vma();
      	for ( ; ; ) {
      		if (!vma)
      			return -ENOMEM;
      		...
      		vma = vma->vm_next;
      	}
      
      so an msync() request which starts within or before a valid VMA and which ends
      within or beyond the final VMA will incorrectly return -ENOMEM.
      
      Fix.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      676758bd
    • A
      [PATCH] msync(MS_SYNC): don't hold mmap_sem while syncing · 707c21c8
      Andrew Morton 提交于
      It seems bad to hold mmap_sem while performing synchronous disk I/O.  Alter
      the msync(MS_SYNC) code so that the lock is released while we sync the file.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      707c21c8
    • A
      [PATCH] msync(): perform dirty page levelling · 9c50823e
      Andrew Morton 提交于
      It seems sensible to perform dirty page throttling in msync: as the application
      dirties pages we can kick off pdflush early, or even force the msync() caller
      to perform writeout, or even throttle the msync() caller.
      
      The main effect of this is to start disk writeback earlier if we've just
      discovered that a large amount of pagecache has been dirtied.  (Otherwise it
      wouldn't happen for up to five seconds, next time pdflush wakes up).
      
      It also will cause the page-dirtying process to get panalised for dirtying
      those pages rather than whacking someone else with the problem.
      
      We should do this for munmap() and possibly even exit(), too.
      
      We drop the mmap_sem while performing the dirty page balancing.  It doesn't
      seem right to hold mmap_sem for that long.
      
      Note that this patch only affects MS_ASYNC.  MS_SYNC will be syncing all the
      dirty pages anyway.
      
      We note that msync(MS_SYNC) does a full-file-sync inside mmap_sem, and always
      has.  We can fix that up...
      
      The patch also tightens up the mmap_sem coverage in sys_msync(): no point in
      taking it while we perform the incoming arg checking.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9c50823e
    • A
      [PATCH] set_page_dirty() return value fixes · 4741c9fd
      Andrew Morton 提交于
      We need set_page_dirty() to return true if it actually transitioned the page
      from a clean to dirty state.  This wasn't right in a couple of places.  Do a
      kernel-wide audit, fix things up.
      
      This leaves open the possibility of returning a negative errno from
      set_page_dirty() sometime in the future.  But we don't do that at present.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4741c9fd
    • A
      [PATCH] balance_dirty_pages_ratelimited: take nr_pages arg · fa5a734e
      Andrew Morton 提交于
      Modify balance_dirty_pages_ratelimited() so that it can take a
      number-of-pages-which-I-just-dirtied argument.  For msync().
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fa5a734e
    • A
      [PATCH] fadvise(): write commands · ebcf28e1
      Andrew Morton 提交于
      Add two new linux-specific fadvise extensions():
      
      LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
      offsets `offset' and `offset+len'.  Any pages which are currently under
      writeout are skipped, whether or not they are dirty.
      
      LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
      offsets `offset' and `offset+len'.
      
      By combining these two operations the application may do several things:
      
      LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
      
      LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
      pages at the disk.
      
      LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
      of the currently dirty pages at the disk, wait until they have been written.
      
      It should be noted that none of these operations write out the file's
      metadata.  So unless the application is strictly performing overwrites of
      already-instantiated disk blocks, there are no guarantees here that the data
      will be available after a crash.
      
      To complete this suite of operations I guess we should have a "sync file
      metadata only" operation.  This gives applications access to all the building
      blocks needed for all sorts of sync operations.  But sync-metadata doesn't fit
      well with the fadvise() interface.  Probably it should be a new syscall:
      sys_fmetadatasync().
      
      The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
      It is made to represent that last affected byte in the file (ie: it is
      inclusive).  Generally, all these byterange and pagerange functions are
      inclusive so we can easily represent EOF with -1.
      
      As Ulrich notes, these two functions are somewhat abusive of the fadvise()
      concept, which appears to be "set the future policy for this fd".
      
      But these commands are a perfect fit with the fadvise() impementation, and
      several of the existing fadvise() commands are synchronous and don't affect
      future policy either.   I think we can live with the slight incongruity.
      
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ebcf28e1
    • A
      [PATCH] filemap_fdatawrite_range() api: clarify -end parameter · 469eb4d0
      Andrew Morton 提交于
      I had trouble understanding working out whether filemap_fdatawrite_range()'s
      `end' parameter describes the last-byte-to-be-written or the last-plus-one.
      Clarify that in comments.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      469eb4d0
    • P
      [PATCH] cpuset: memory_spread_slab drop useless PF_SPREAD_PAGE check · b2455396
      Paul Jackson 提交于
      The hook in the slab cache allocation path to handle cpuset memory
      spreading for tasks in cpusets with 'memory_spread_slab' enabled has a
      modest performance bug.  The hook calls into the memory spreading handler
      alternate_node_alloc() if either of 'memory_spread_slab' or
      'memory_spread_page' is enabled, even though the handler does nothing
      (albeit harmlessly) for the page case
      
      Fix - drop PF_SPREAD_PAGE from the set of flag bits that are used to
      trigger a call to alternate_node_alloc().
      
      The page case is handled by separate hooks -- see the calls conditioned on
      cpuset_do_page_mem_spread() in mm/filemap.c
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b2455396
    • P
      [PATCH] cpuset memory spread slab cache optimizations · c61afb18
      Paul Jackson 提交于
      The hooks in the slab cache allocator code path for support of NUMA
      mempolicies and cpuset memory spreading are in an important code path.  Many
      systems will use neither feature.
      
      This patch optimizes those hooks down to a single check of some bits in the
      current tasks task_struct flags.  For non NUMA systems, this hook and related
      code is already ifdef'd out.
      
      The optimization is done by using another task flag, set if the task is using
      a non-default NUMA mempolicy.  Taking this flag bit along with the
      PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
      memory spreading' patch set, one can check for the combination of any of these
      special case memory placement mechanisms with a single test of the current
      tasks task_struct flags.
      
      This patch also tightens up the code, to save a few bytes of kernel text
      space, and moves some of it out of line.  Due to the nested inlines called
      from multiple places, we were ending up with three copies of this code, which
      once we get off the main code path (for local node allocation) seems a bit
      wasteful of instruction memory.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c61afb18
    • P
      [PATCH] cpuset memory spread slab cache implementation · 101a5001
      Paul Jackson 提交于
      Provide the slab cache infrastructure to support cpuset memory spreading.
      
      See the previous patches, cpuset_mem_spread, for an explanation of cpuset
      memory spreading.
      
      This patch provides a slab cache SLAB_MEM_SPREAD flag.  If set in the
      kmem_cache_create() call defining a slab cache, then any task marked with the
      process state flag PF_MEMSPREAD will spread memory page allocations for that
      cache over all the allowed nodes, instead of preferring the local (faulting)
      node.
      
      On systems not configured with CONFIG_NUMA, this results in no change to the
      page allocation code path for slab caches.
      
      On systems with cpusets configured in the kernel, but the "memory_spread"
      cpuset option not enabled for the current tasks cpuset, this adds a call to a
      cpuset routine and failed bit test of the processor state flag PF_SPREAD_SLAB.
      
      For tasks so marked, a second inline test is done for the slab cache flag
      SLAB_MEM_SPREAD, and if that is set and if the allocation is not
      in_interrupt(), this adds a call to to a cpuset routine that computes which of
      the tasks mems_allowed nodes should be preferred for this allocation.
      
      ==> This patch adds another hook into the performance critical
          code path to allocating objects from the slab cache, in the
          ____cache_alloc() chunk, below.  The next patch optimizes this
          hook, reducing the impact of the combined mempolicy plus memory
          spreading hooks on this critical code path to a single check
          against the tasks task_struct flags word.
      
      This patch provides the generic slab flags and logic needed to apply memory
      spreading to a particular slab.
      
      A subsequent patch will mark a few specific slab caches for this placement
      policy.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      101a5001
    • P
      [PATCH] cpuset memory spread page cache implementation and hooks · 44110fe3
      Paul Jackson 提交于
      Change the page cache allocation calls to support cpuset memory spreading.
      
      See the previous patch, cpuset_mem_spread, for an explanation of cpuset memory
      spreading.
      
      On systems without cpusets configured in the kernel, this is no change.
      
      On systems with cpusets configured in the kernel, but the "memory_spread"
      cpuset option not enabled for the current tasks cpuset, this adds a call to a
      cpuset routine and failed bit test of the processor state flag PF_SPREAD_PAGE.
      
      On tasks in cpusets with "memory_spread" enabled, this adds a call to a cpuset
      routine that computes which of the tasks mems_allowed nodes should be
      preferred for this allocation.
      
      If memory spreading applies to a particular allocation, then any other NUMA
      mempolicy does not apply.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      44110fe3
    • C
      [PATCH] cpusets: only wakeup kswapd for zones in the current cpuset · 0b1303fc
      Christoph Lameter 提交于
      If we get under some memory pressure in a cpuset (we only scan zones that
      are in the cpuset for memory) then kswapd is woken up for all zones.  This
      patch only wakes up kswapd in zones that are part of the current cpuset.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0b1303fc
    • B
      [PATCH] Represent laptop_mode as jiffies internally · ed5b43f1
      Bart Samwel 提交于
      Make that the internal value for /proc/sys/vm/laptop_mode is stored as
      jiffies instead of seconds.  Let the sysctl interface do the conversions,
      instead of doing on-the-fly conversions every time the value is used.
      
      Add a description of the fact that laptop_mode doubles as a flag and a
      timeout to the comment above the laptop_mode variable.
      Signed-off-by: NBart Samwel <bart@samwel.tk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ed5b43f1
    • B
      [PATCH] Represent dirty_*_centisecs as jiffies internally · f6ef9438
      Bart Samwel 提交于
      Make that the internal values for:
      
      /proc/sys/vm/dirty_writeback_centisecs
      /proc/sys/vm/dirty_expire_centisecs
      
      are stored as jiffies instead of centiseconds.  Let the sysctl interface do
      the conversions with full precision using clock_t_to_jiffies, instead of
      doing overflow-sensitive on-the-fly conversions every time the values are
      used.
      
      Cons: apparent precision loss if HZ is not a multiple of 100, because of
      conversion back and forth.  This is a common problem for all sysctl values
      that use proc_dointvec_userhz_jiffies.  (There is only one other in-tree
      use, in net/core/neighbour.c.)
      Signed-off-by: NBart Samwel <bart@samwel.tk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f6ef9438
    • J
      2056a782
  2. 23 3月, 2006 3 次提交
    • A
      [PATCH] ext3_readdir: use generic readahead · d8733c29
      Andrew Morton 提交于
      Linus points out that ext3_readdir's readahead only cuts in when
      ext3_readdir() is operating at the very start of the directory.  So for large
      directories we end up performing no readahead at all and we suck.
      
      So take it all out and use the core VM's page_cache_readahead().  This means
      that ext3 directory reads will use all of readahead's dynamic sizing goop.
      
      Note that we're using the directory's filp->f_ra to hold the readahead state,
      but readahead is actually being performed against the underlying blockdev's
      address_space.  Fortunately the readahead code is all set up to handle this.
      
      Tested with printk.  It works.  I was struggling to find a real workload which
      actually cared.
      
      (The patch also exports page_cache_readahead() to GPL modules)
      
      Cc: "Stephen C. Tweedie" <sct@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d8733c29
    • R
      [PATCH] swsusp: userland interface · 6e1819d6
      Rafael J. Wysocki 提交于
      This patch introduces a user space interface for swsusp.
      
      The interface is based on a special character device, called the snapshot
      device, that allows user space processes to perform suspend and resume-related
      operations with the help of some ioctls and the read()/write() functions.
       Additionally it allows these processes to allocate free swap pages from a
      selected swap partition, called the resume partition, so that they know which
      sectors of the resume partition are available to them.
      
      The interface uses the same low-level system memory snapshot-handling
      functions that are used by the built-it swap-writing/reading code of swsusp.
      
      The interface documentation is included in the patch.
      
      The patch assumes that the major and minor numbers of the snapshot device will
      be 10 (ie.  misc device) and 231, the registration of which has already been
      requested.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6e1819d6
    • R
      [PATCH] swsusp: low level interface · f577eb30
      Rafael J. Wysocki 提交于
      Introduce the low level interface that can be used for handling the
      snapshot of the system memory by the in-kernel swap-writing/reading code of
      swsusp and the userland interface code (to be introduced shortly).
      
      Also change the way in which swsusp records the allocated swap pages and,
      consequently, simplifies the in-kernel swap-writing/reading code (this is
      necessary for the userland interface too).  To this end, it introduces two
      helper functions in mm/swapfile.c, so that the swsusp code does not refer
      directly to the swap internals.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f577eb30
  3. 22 3月, 2006 21 次提交
    • C
      [PATCH] page migration reorg · b20a3503
      Christoph Lameter 提交于
      Centralize the page migration functions in anticipation of additional
      tinkering.  Creates a new file mm/migrate.c
      
      1. Extract buffer_migrate_page() from fs/buffer.c
      
      2. Extract central migration code from vmscan.c
      
      3. Extract some components from mempolicy.c
      
      4. Export pageout() and remove_from_swap() from vmscan.c
      
      5. Make it possible to configure NUMA systems without page migration
         and non-NUMA systems with page migration.
      
      I had to so some #ifdeffing in mempolicy.c that may need a cleanup.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b20a3503
    • P
      [PATCH] mm: slab cache interleave rotor fix · 442295c9
      Paul Jackson 提交于
      The alien cache rotor in mm/slab.c assumes that the first online node is
      node 0.  Eventually for some archs, especially with hotplug, this will no
      longer be true.
      
      Fix the interleave rotor to handle the general case of node numbering.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      442295c9
    • P
      [PATCH] mm: hugetlb alloc_fresh_huge_page bogus node loop fix · fdb7cc59
      Paul Jackson 提交于
      Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
      assuming that nodes are numbered contiguously from 0 to num_online_nodes().
      Once the hotplug folks get this far, that will be false.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fdb7cc59
    • A
      [PATCH] fix swap cluster offset · 9b65ef59
      Akinobu Mita 提交于
      When we've allocated SWAPFILE_CLUSTER pages, ->cluster_next should be the
      first index of swap cluster.  But current code probably sets it wrong offset.
      Signed-off-by: NAkinobu Mita <mita@miraclelinux.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9b65ef59
    • C
      [PATCH] drain_node_pages: interrupt latency reduction / optimization · 879336c3
      Christoph Lameter 提交于
      1. Only disable interrupts if there is actually something to free
      
      2. Only dirty the pcp cacheline if we actually freed something.
      
      3. Disable interrupts for each single pcp and not for cleaning
        all the pcps in all zones of a node.
      
      drain_node_pages is called every 2 seconds from cache_reap. This
      fix should avoid most disabling of interrupts.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      879336c3
    • C
      [PATCH] slab: fix drain_array() so that it works correctly with the shared_array · b18e7e65
      Christoph Lameter 提交于
      The list_lock also protects the shared array and we call drain_array() with
      the shared array.  Therefore we cannot go as far as I wanted to but have to
      take the lock in a way so that it also protects the array_cache in
      drain_pages.
      
      (Note: maybe we should make the array_cache locking more consistent?  I.e.
      always take the array cache lock for shared arrays and disable interrupts
      for the per cpu arrays?)
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b18e7e65
    • C
      [PATCH] slab: remove drain_array_locked · 1b55253a
      Christoph Lameter 提交于
      Remove drain_array_locked and use that opportunity to limit the time the l3
      lock is taken further.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1b55253a
    • C
      [PATCH] slab: make drain_array more universal by adding more parameters · aab2207c
      Christoph Lameter 提交于
      And a parameter to drain_array to control the freeing of all objects and
      then use drain_array() to replace instances of drain_array_locked with
      drain_array.  Doing so will avoid taking locks in those locations if the
      arrays are empty.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aab2207c
    • C
      [PATCH] slab: cache_reap(): further reduction in interrupt holdoff · 35386e3b
      Christoph Lameter 提交于
      cache_reap takes the l3->list_lock (disabling interrupts) unconditionally
      and then does a few checks and maybe does some cleanup.  This patch makes
      cache_reap() only take the lock if there is work to do and then the lock is
      taken and released for each cleaning action.
      
      The checking of when to do the next reaping is done without any locking and
      becomes racy.  Should not matter since reaping can also be skipped if the
      slab mutex cannot be acquired.
      
      The same is true for the touched processing.  If we get this wrong once in
      awhile then we will mistakenly clean or not clean the shared cache.  This
      will impact performance slightly.
      
      Note that the additional drain_array() function introduced here will fall
      out in a subsequent patch since array cleaning will now be very similar
      from all callers.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      35386e3b
    • R
      [PATCH] mm: make shrink_all_memory try harder · 248a0301
      Rafael J. Wysocki 提交于
      Make shrink_all_memory() repeat the attempts to free more memory if there
      seems to be no pages to free.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      248a0301
    • C
      [PATCH] optimize follow_hugetlb_page · d5d4b0aa
      Chen, Kenneth W 提交于
      follow_hugetlb_page() walks a range of user virtual address and then fills
      in list of struct page * into an array that is passed from the argument
      list.  It also gets a reference count via get_page().  For compound page,
      get_page() actually traverse back to head page via page_private() macro and
      then adds a reference count to the head page.  Since we are doing a virt to
      pte look up, kernel already has a struct page pointer into the head page.
      So instead of traverse into the small unit page struct and then follow a
      link back to the head page, optimize that with incrementing the reference
      count directly on the head page.
      
      The benefit is that we don't take a cache miss on accessing page struct for
      the corresponding user address and more importantly, not to pollute the
      cache with a "not very useful" round trip of pointer chasing.  This adds a
      moderate performance gain on an I/O intensive database transaction
      workload.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d5d4b0aa
    • D
      [PATCH] hugepage: Fix hugepage logic in free_pgtables() harder · 4866920b
      David Gibson 提交于
      Turns out the hugepage logic in free_pgtables() was doubly broken.  The
      loop coalescing multiple normal page VMAs into one call to free_pgd_range()
      had an off by one error, which could mean it would coalesce one hugepage
      VMA into the same bundle (checking 'vma' not 'next' in the loop).  I
      transferred this bug into the new is_vm_hugetlb_page() based version.
      Here's the fix.
      
      This one didn't bite on powerpc previously for the same reason the
      is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range()
      is identical to free_pgd_range().  It didn't bite on ia64 because the
      hugepage region is distant enough from any other region that the separated
      PMD_SIZE distance test would always prevent coalescing the two together.
      
      No libhugetlbfs testsuite regressions (ppc64, POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4866920b
    • D
      [PATCH] hugepage: Fix hugepage logic in free_pgtables() · 9da61aef
      David Gibson 提交于
      free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
      of the normal free_pgd_range() on hugepage VMAs.  However, the test it uses
      to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
      range at the start of the vma.  is_hugepage_only_range() will return true
      if the given range has any intersection with a hugepage address region, and
      in this case the given region need not be hugepage aligned.  So, for
      example, this test can return true if called on, say, a 4k VMA immediately
      preceding a (nicely aligned) hugepage VMA.
      
      At present we get away with this because the powerpc version of
      hugetlb_free_pgd_range() is just a call to free_pgd_range().  On ia64 (the
      only other arch with a non-trivial is_hugepage_only_range()) we get away
      with it for a different reason; the hugepage area is not contiguous with
      the rest of the user address space, and VMAs are not permitted in between,
      so the test can't return a false positive there.
      
      Nonetheless this should be fixed.  We do that in the patch below by
      replacing the is_hugepage_only_range() test with an explicit test of the
      VMA using is_vm_hugetlb_page().
      
      This in turn changes behaviour for platforms where is_hugepage_only_range()
      returns false always (everything except powerpc and ia64).  We address this
      by ensuring that hugetlb_free_pgd_range() is defined to be identical to
      free_pgd_range() (instead of a no-op) on everything except ia64.  Even so,
      it will prevent some otherwise possible coalescing of calls down to
      free_pgd_range().  Since this only happens for hugepage VMAs, removing this
      small optimization seems unlikely to cause any trouble.
      
      This patch causes no regressions on the libhugetlbfs testsuite - ppc64
      POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9da61aef
    • D
      [PATCH] hugepage: Make {alloc,free}_huge_page() local · 27a85ef1
      David Gibson 提交于
      Originally, mm/hugetlb.c just handled the hugepage physical allocation path
      and its {alloc,free}_huge_page() functions were used from the arch specific
      hugepage code.  These days those functions are only used with mm/hugetlb.c
      itself.  Therefore, this patch makes them static and removes their
      prototypes from hugetlb.h.  This requires a small rearrangement of code in
      mm/hugetlb.c to avoid a forward declaration.
      
      This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
      POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      27a85ef1
    • D
      [PATCH] hugepage: Strict page reservation for hugepage inodes · b45b5bd6
      David Gibson 提交于
      These days, hugepages are demand-allocated at first fault time.  There's a
      somewhat dubious (and racy) heuristic when making a new mmap() to check if
      there are enough available hugepages to fully satisfy that mapping.
      
      A particularly obvious case where the heuristic breaks down is where a
      process maps its hugepages not as a single chunk, but as a bunch of
      individually mmap()ed (or shmat()ed) blocks without touching and
      instantiating the pages in between allocations.  In this case the size of
      each block is compared against the total number of available hugepages.
      It's thus easy for the process to become overcommitted, because each block
      mapping will succeed, although the total number of hugepages required by
      all blocks exceeds the number available.  In particular, this defeats such
      a program which will detect a mapping failure and adjust its hugepage usage
      downward accordingly.
      
      The patch below addresses this problem, by strictly reserving a number of
      physical hugepages for hugepage inodes which have been mapped, but not
      instatiated.  MAP_SHARED mappings are thus "safe" - they will fail on
      mmap(), not later with an OOM SIGKILL.  MAP_PRIVATE mappings can still
      trigger an OOM.  (Actually SHARED mappings can technically still OOM, but
      only if the sysadmin explicitly reduces the hugepage pool between mapping
      and instantiation)
      
      This patch appears to address the problem at hand - it allows DB2 to start
      correctly, for instance, which previously suffered the failure described
      above.
      
      This patch causes no regressions on the libhugetblfs testsuite, and makes a
      test (designed to catch this problem) pass which previously failed (ppc64,
      POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b45b5bd6
    • D
      [PATCH] hugepage: serialize hugepage allocation and instantiation · 3935baa9
      David Gibson 提交于
      Currently, no lock or mutex is held between allocating a hugepage and
      inserting it into the pagetables / page cache.  When we do go to insert the
      page into pagetables or page cache, we recheck and may free the newly
      allocated hugepage.  However, since the number of hugepages in the system
      is strictly limited, and it's usualy to want to use all of them, this can
      still lead to spurious allocation failures.
      
      For example, suppose two processes are both mapping (MAP_SHARED) the same
      hugepage file, large enough to consume the entire available hugepage pool.
      If they race instantiating the last page in the mapping, they will both
      attempt to allocate the last available hugepage.  One will fail, of course,
      returning OOM from the fault and thus causing the process to be killed,
      despite the fact that the entire mapping can, in fact, be instantiated.
      
      The patch fixes this race by the simple method of adding a (sleeping) mutex
      to serialize the hugepage fault path between allocation and insertion into
      pagetables and/or page cache.  It would be possible to avoid the
      serialization by catching the allocation failures, waiting on some
      condition, then rechecking to see if someone else has instantiated the page
      for us.  Given the likely frequency of hugepage instantiations, it seems
      very doubtful it's worth the extra complexity.
      
      This patch causes no regression on the libhugetlbfs testsuite, and one
      test, which can trigger this race now passes where it previously failed.
      
      Actually, the test still sometimes fails, though less often and only as a
      shmat() failure, rather processes getting OOM killed by the VM.  The dodgy
      heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
      space aren't protected by the new mutex, and would be ugly to do so, so
      there's still a race there.  Another patch to replace those tests with
      something saner for this reason as well as others coming...
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3935baa9
    • D
      [PATCH] hugepage: Small fixes to hugepage clear/copy path · 79ac6ba4
      David Gibson 提交于
      Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
      own functions for clarity.  As we do so, we add some checks of need_resched
      - we are, after all copying megabytes of memory here.  We also add
      might_sleep() accordingly.  We generally dropped locks around the clear and
      copy, already but not everyone has PREEMPT enabled, so we should still be
      checking explicitly.
      
      For this to work, we need to remove the clear_huge_page() from
      alloc_huge_page(), which is called with the page_table_lock held in the COW
      path.  We move the clear_huge_page() to just after the alloc_huge_page() in
      the hugepage no-page path.  In the COW path, the new page is about to be
      copied over, so clearing it was just a waste of time anyway.  So as a side
      effect we also fix the fact that we held the page_table_lock for far too
      long in this path by calling alloc_huge_page() under it.
      
      It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      79ac6ba4
    • Z
      [PATCH] Enable mprotect on huge pages · 8f860591
      Zhang, Yanmin 提交于
      2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
      mprotect.
      
      From: David Gibson <david@gibson.dropbear.id.au>
      
        Remove a test from the mprotect() path which checks that the mprotect()ed
        range on a hugepage VMA is hugepage aligned (yes, really, the sense of
        is_aligned_hugepage_range() is the opposite of what you'd guess :-/).
      
        In fact, we don't need this test.  If the given addresses match the
        beginning/end of a hugepage VMA they must already be suitably aligned.  If
        they don't, then mprotect_fixup() will attempt to split the VMA.  The very
        first test in split_vma() will check for a badly aligned address on a
        hugepage VMA and return -EINVAL if necessary.
      
      From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      
        On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE.  The
        identify of hugetlb pte is lost when changing page protection via mprotect.
        A page fault occurs later will trigger a bug check in huge_pte_alloc().
      
        The fix is to always make new pte a hugetlb pte and also to clean up
        legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.
      Signed-off-by: NZhang Yanmin <yanmin.zhang@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f860591
    • S
      [PATCH] readahead: fix initial window size calculation · aed75ff3
      Steven Pratt 提交于
      The current current get_init_ra_size is not optimal across different IO
      sizes and max_readahead values.  Here is a quick summary of sizes computed
      under current design and under the attached patch.  All of these assume 1st
      IO at offset 0, or 1st detected sequential IO.
      
      	32k max, 4k request
      
      	old         new
      	-----------------
      	 8k        8k
      	16k       16k
      	32k       32k
      
      	128k max, 4k request
      	old         new
      	-----------------
      	32k         16k
      	64k         32k
      	128k        64k
      	128k       128k
      
      	128k max, 32k request
      	old         new
      	-----------------
      	32k         64k    <-----
      	64k        128k
      	128k       128k
      
      	512k max, 4k request
      	old         new
      	-----------------
      	4k         32k     <----
      	16k        64k
      	64k       128k
      	128k      256k
      	512k      512k
      
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aed75ff3
    • O
      [PATCH] readahead: ->prev_page can overrun the ahead window · a564da39
      Oleg Nesterov 提交于
      If get_next_ra_size() does not grow fast enough, ->prev_page can overrun
      the ahead window.  This means the caller will read the pages from
      ->ahead_start + ->ahead_size to ->prev_page synchronously.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a564da39
    • H
      [PATCH] shmem: inline to avoid warning · d15c023b
      Hugh Dickins 提交于
      shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings:
      shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and
      only called from that one site, so mark it inline like its non-NUMA stub.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d15c023b