1. 17 10月, 2007 39 次提交
    • N
      implement simple fs aops · 800d15a5
      Nick Piggin 提交于
      Implement new aops for some of the simpler filesystems.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800d15a5
    • N
      mm: restore KERNEL_DS optimisations · 674b892e
      Nick Piggin 提交于
      Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
      path.
      
      This may be a pretty questionable gain in most cases, especially after the
      legacy 2copy write path is removed, but it doesn't cost much.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      674b892e
    • N
      fs: introduce write_begin, write_end, and perform_write aops · afddba49
      Nick Piggin 提交于
      These are intended to replace prepare_write and commit_write with more
      flexible alternatives that are also able to avoid the buffered write
      deadlock problems efficiently (which prepare_write is unable to do).
      
      [mark.fasheh@oracle.com: API design contributions, code review and fixes]
      [akpm@linux-foundation.org: various fixes]
      [dmonakhov@sw.ru: new aop block_write_begin fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afddba49
    • N
      mm: buffered write iterator · 2f718ffc
      Nick Piggin 提交于
      Add an iterator data structure to operate over an iovec.  Add usercopy
      operators needed by generic_file_buffered_write, and convert that function
      over.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f718ffc
    • N
      mm: fix pagecache write deadlocks · 08291429
      Nick Piggin 提交于
      Modify the core write() code so that it won't take a pagefault while holding a
      lock on the pagecache page. There are a number of different deadlocks possible
      if we try to do such a thing:
      
      1.  generic_buffered_write
      2.   lock_page
      3.    prepare_write
      4.     unlock_page+vmtruncate
      5.     copy_from_user
      6.      mmap_sem(r)
      7.       handle_mm_fault
      8.        lock_page (filemap_nopage)
      9.    commit_write
      10.  unlock_page
      
      a. sys_munmap / sys_mlock / others
      b.  mmap_sem(w)
      c.   make_pages_present
      d.    get_user_pages
      e.     handle_mm_fault
      f.      lock_page (filemap_nopage)
      
      2,8	- recursive deadlock if page is same
      2,8;2,8	- ABBA deadlock is page is different
      2,6;b,f	- ABBA deadlock if page is same
      
      The solution is as follows:
      1.  If we find the destination page is uptodate, continue as normal, but use
          atomic usercopies which do not take pagefaults and do not zero the uncopied
          tail of the destination. The destination is already uptodate, so we can
          commit_write the full length even if there was a partial copy: it does not
          matter that the tail was not modified, because if it is dirtied and written
          back to disk it will not cause any problems (uptodate *means* that the
          destination page is as new or newer than the copy on disk).
      
      1a. The above requires that fault_in_pages_readable correctly returns access
          information, because atomic usercopies cannot distinguish between
          non-present pages in a readable mapping, from lack of a readable mapping.
      
      2.  If we find the destination page is non uptodate, unlock it (this could be
          made slightly more optimal), then allocate a temporary page to copy the
          source data into. Relock the destination page and continue with the copy.
          However, instead of a usercopy (which might take a fault), copy the data
          from the pinned temporary page via the kernel address space.
      
      (also, rename maxlen to seglen, because it was confusing)
      
      This increases the CPU/memory copy cost by almost 50% on the affected
      workloads. That will be solved by introducing a new set of pagecache write
      aops in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08291429
    • N
      mm: write iovec cleanup · 4a9e5ef1
      Nick Piggin 提交于
      Hide some of the open-coded nr_segs tests into the iovec helpers.  This is all
      to simplify generic_file_buffered_write, because that gets more complex in the
      next patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a9e5ef1
    • N
      mm: buffered write cleanup · eb2be189
      Nick Piggin 提交于
      Quite a bit of code is used in maintaining these "cached pages" that are
      probably pretty unlikely to get used. It would require a narrow race where
      the page is inserted concurrently while this process is allocating a page
      in order to create the spare page. Then a multi-page write into an uncached
      part of the file, to make use of it.
      
      Next, the buffered write path (and others) uses its own LRU pagevec when it
      should be just using the per-CPU LRU pagevec (which will cut down on both data
      and code size cacheline footprint). Also, these private LRU pagevecs are
      emptied after just a very short time, in contrast with the per-CPU pagevecs
      that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
      to add the pages to pagecache for a bulk write (in 4K chunks).
      
      [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
       to clashes in -mm. What put them there, and why? ]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb2be189
    • N
      mm: trim more holes · 64649a58
      Nick Piggin 提交于
      If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
      we may have failed the write operation despite prepare_write having
      instantiated blocks past i_size.  Fix this, and consolidate the trimming into
      one place.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64649a58
    • N
      mm: debug write deadlocks · 5fe17237
      Nick Piggin 提交于
      Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
      Makes the race much easier to hit.
      
      This is useful for demonstration and testing purposes, but is removed in a
      subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fe17237
    • A
      mm: clean up buffered write code · ae37461c
      Andrew Morton 提交于
      Rename some variables and fix some types.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae37461c
    • A
      Revert "[PATCH] generic_file_buffered_write(): deadlock on vectored write" · 6814d7a9
      Andrew Morton 提交于
      This reverts commit 6527c2bd, which
      fixed the following bug:
      
        When prefaulting in the pages in generic_file_buffered_write(), we only
        faulted in the pages for the firts segment of the iovec.  If the second of
        successive segment described a mmapping of the page into which we're
        write()ing, and that page is not up-to-date, the fault handler tries to lock
        the already-locked page (to bring it up to date) and deadlocks.
      
        An exploit for this bug is in writev-deadlock-demo.c, in
        http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.
      
        (These demos assume blocksize < PAGE_CACHE_SIZE).
      
      The problem with this fix is that it takes the kernel back to doing a single
      prepare_write()/commit_write() per iovec segment.  So in the worst case we'll
      run prepare_write+commit_write 1024 times where we previously would have run
      it once. The other problem with the fix is that it fix all the locking problems.
      
      <insert numbers obtained via ext3-tools's writev-speed.c here>
      
      And apparently this change killed NFS overwrite performance, because, I
      suppose, it talks to the server for each prepare_write+commit_write.
      
      So just back that patch out - we'll be fixing the deadlock by other means.
      
      Nick says: also it only ever actually papered over the bug, because after
      faulting in the pages, they might be unmapped or reclaimed.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6814d7a9
    • A
      Revert "[PATCH] generic_file_buffered_write(): handle zero-length iovec segments" · 4b49643f
      Andrew Morton 提交于
      This reverts commit 81b0c871, which was
      a bugfix against 6527c2bd ("[PATCH]
      generic_file_buffered_write(): deadlock on vectored write"), which we
      also revert.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b49643f
    • N
      mm: revert KERNEL_DS buffered write optimisation · 41cb8ac0
      Nick Piggin 提交于
      Revert the patch from Neil Brown to optimise NFSD writev handling.
      
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41cb8ac0
    • H
      mm: use pagevec to rotate reclaimable page · 902aaed0
      Hisashi Hifumi 提交于
      While running some memory intensive load, system response deteriorated just
      after swap-out started.
      
      The cause of this problem is that when a PG_reclaim page is moved to the tail
      of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is
      acquired every page writeback .  This deteriorates system performance and
      makes interrupt hold off time longer when swap-out started.
      
      Following patch solves this problem.  I use pagevec in rotating reclaimable
      pages to mitigate LRU spin lock contention and reduce interrupt hold off time.
      
      I did a test that allocating and touching pages in multiple processes, and
      pinging to the test machine in flooding mode to measure response under memory
      intensive load.
      
      The test result is:
      
      	-2.6.23-rc5
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
      	rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
      17.746/0.092 ms
      
      	-2.6.23-rc5-patched
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
      	rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
      17.314/0.091 ms
      
      Max round-trip-time was improved.
      
      The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
      8GB memory , 8GB swap.
      
      I did ping test again to observe performance deterioration caused by taking
      a ref.
      
      	-2.6.23-rc6-with-modifiedpatch
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
      	rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms
      
      The result for my original patch is as follows.
      
      	-2.6.23-rc5-with-originalpatch
      	--- testmachine ping statistics ---
      	3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
      	rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms
      
      The influence to response was small.
      
      [akpm@linux-foundation.org: fix uninitalised var warning]
      [hugh@veritas.com: fix locking]
      [randy.dunlap@oracle.com: fix function declaration]
      [hugh@veritas.com: fix BUG at include/linux/mm.h:220!]
      [hugh@veritas.com: kill redundancy in rotate_reclaimable_page]
      [hugh@veritas.com: move_tail_pages into lru_add_drain]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      902aaed0
    • L
      Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag · 754af6f5
      Lee Schermerhorn 提交于
      Allow an application to query the memories allowed by its context.
      
      Updated numa_memory_policy.txt to mention that applications can use this to
      obtain allowed memories for constructing valid policies.
      
      TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new
      wrapper--e.g.,  numa_get_mems_allowed() ?
      
      Also, update numa syscall man pages.
      
      Tested with memtoy V>=0.13.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      754af6f5
    • R
      mm: prevent kswapd from freeing excessive amounts of lowmem · 32a4330d
      Rik van Riel 提交于
      The current VM can get itself into trouble fairly easily on systems with a
      small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.
      
      On one side, page_alloc() will allocate down to zone->pages_low, while on
      the other side, kswapd() and balance_pgdat() will try to free memory from
      every zone, until every zone has more free pages than zone->pages_high.
      
      Highmem can be filled up to zone->pages_low with page tables, ramfs,
      vmalloc allocations and other unswappable things quite easily and without
      many bad side effects, since we still have a huge ZONE_NORMAL to do future
      allocations from.
      
      However, as long as the number of free pages in the highmem zone is below
      zone->pages_high, kswapd will continue swapping things out from
      ZONE_NORMAL, too!
      
      Sami Farin managed to get his system into a stage where kswapd had freed
      about 700MB of low memory and was still "going strong".
      
      The attached patch will make kswapd stop paging out data from zones when
      there is more than enough memory free.  We do go above zone->pages_high in
      order to keep pressure between zones equal in normal circumstances, but the
      patch should prevent the kind of excesses that made Sami's computer totally
      unusable.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32a4330d
    • J
      mm: no need to cast vmalloc() return value in zone_wait_table_init() · 8691f3a7
      Jesper Juhl 提交于
      vmalloc() returns a void pointer, so there's no need to cast its
      return value in mm/page_alloc.c::zone_wait_table_init().
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8691f3a7
    • C
      Slab allocators: fail if ksize is called with a NULL parameter · ef8b4520
      Christoph Lameter 提交于
      A NULL pointer means that the object was not allocated.  One cannot
      determine the size of an object that has not been allocated.  Currently we
      return 0 but we really should BUG() on attempts to determine the size of
      something nonexistent.
      
      krealloc() interprets NULL to mean a zero sized object.  Handle that
      separately in krealloc().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8b4520
    • D
      calculation of pgoff in do_linear_fault() uses mixed units · 0da7e01f
      Dean Nelson 提交于
      The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not
      PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not
      PAGE_CACHE_SIZE.  At the moment linux/pagemap.h has PAGE_CACHE_SHIFT
      defined as PAGE_SHIFT, but should that ever change this calculation would
      break.
      Signed-off-by: NDean Nelson <dcn@sgi.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0da7e01f
    • S
      {slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check · 2408c550
      Satyam Sharma 提交于
      Considering kfree(NULL) would normally occur only in error paths and
      kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
      condition check in SLUB's and SLOB's kfree() to optimize for the common
      case.  SLAB has this already.
      Signed-off-by: NSatyam Sharma <satyam@infradead.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2408c550
    • N
      mm: clarify __add_to_swap_cache locking · b55ed816
      Nick Piggin 提交于
      __add_to_swap_cache unconditionally sets the page locked, which can be a bit
      alarming to the unsuspecting reader: in the code paths where the page is
      visible to other CPUs, the page should be (and is) already locked.
      
      Instead, just add a check to ensure the page is locked here, and teach the one
      path relying on the old behaviour to call SetPageLocked itself.
      
      [hugh@veritas.com: locking fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b55ed816
    • N
      mm: improve find_lock_page · 45726cb4
      Nick Piggin 提交于
      find_lock_page does not need to recheck ->index because if the page is in the
      right mapping then the index must be the same.  Also, tree_lock does not need
      to be retaken after the page is locked in order to test that ->mapping has not
      changed, because holding the page lock pins its mapping.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45726cb4
    • N
      mm: use lockless radix-tree probe · 00128188
      Nick Piggin 提交于
      Probing pages and radix_tree_tagged are lockless operations with the lockless
      radix-tree.  Convert these users to RCU locking rather than using tree_lock.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00128188
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
    • C
      SLUB: direct pass through of page size or higher kmalloc requests · aadb4bc4
      Christoph Lameter 提交于
      This gets rid of all kmalloc caches larger than page size.  A kmalloc
      request larger than PAGE_SIZE > 2 is going to be passed through to the page
      allocator.  This works both inline where we will call __get_free_pages
      instead of kmem_cache_alloc and in __kmalloc.
      
      kfree is modified to check if the object is in a slab page. If not then
      the page is freed via the page allocator instead. Roughly similar to what
      SLOB does.
      
      Advantages:
      - Reduces memory overhead for kmalloc array
      - Large kmalloc operations are faster since they do not
        need to pass through the slab allocator to get to the
        page allocator.
      - Performance increase of 10%-20% on alloc and 50% on free for
        PAGE_SIZEd allocations.
        SLUB must call page allocator for each alloc anyways since
        the higher order pages which that allowed avoiding the page alloc calls
        are not available in a reliable way anymore. So we are basically removing
        useless slab allocator overhead.
      - Large kmallocs yields page aligned object which is what
        SLAB did. Bad things like using page sized kmalloc allocations to
        stand in for page allocate allocs can be transparently handled and are not
        distinguishable from page allocator uses.
      - Checking for too large objects can be removed since
        it is done by the page allocator.
      
      Drawbacks:
      - No accounting for large kmalloc slab allocations anymore
      - No debugging of large kmalloc slab allocations.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aadb4bc4
    • F
      filemap: convert some unsigned long to pgoff_t · 57f6b96c
      Fengguang Wu 提交于
      Convert some 'unsigned long' to pgoff_t.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f6b96c
    • F
      filemap: trivial code cleanups · b2c3843b
      Fengguang Wu 提交于
      - remove unused local next_index in do_generic_mapping_read()
      - remove a redudant page_cache_read() declaration
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2c3843b
    • F
      readahead: remove several readahead macros · 535443f5
      Fengguang Wu 提交于
      Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535443f5
    • F
      readahead: remove the local copy of ra in do_generic_mapping_read() · 7ff81078
      Fengguang Wu 提交于
      The local copy of ra in do_generic_mapping_read() can now go away.
      
      It predates readanead(req_size).  In a time when the readahead code was called
      on *every* single page.  Hence a local has to be made to reduce the chance of
      the readahead state being overwritten by a concurrent reader.  More details
      in: Linux: Random File I/O Regressions In 2.6
      <http://kerneltrap.org/node/3039>
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ff81078
    • F
      readahead: basic support of interleaved reads · 6b10c6c9
      Fengguang Wu 提交于
      This is a simplified version of the pagecache context based readahead.  It
      handles the case of multiple threads reading on the same fd and invalidating
      each others' readahead state.  It does the trick by scanning the pagecache and
      recovering the current read stream's readahead status.
      
      The algorithm works in a opportunistic way, in that it does not try to detect
      interleaved reads _actively_, which requires a probe into the page cache
      (which means a little more overhead for random reads).  It only tries to
      handle a previously started sequential readahead whose state was overwritten
      by another concurrent stream, and it can do this job pretty well.
      
      Negative and positive examples(or what you can expect from it):
      
      1) it cannot detect and serve perfect request-by-request interleaved reads
         right:
      	time	stream 1  stream 2
      	0 	1
      	1 	          1001
      	2 	2
      	3 	          1002
      	4 	3
      	5 	          1003
      	6 	4
      	7 	          1004
      	8 	5
      	9	          1005
      
      Here no single readahead will be carried out.
      
      2) However, if it's two concurrent reads by two threads, the chance of the
         initial sequential readahead be started is huge. Once the first sequential
         readahead is started for a stream, this patch will ensure that the readahead
         window continues to rampup and won't be disturbed by other streams.
      
      	time	stream 1  stream 2
      	0 	1
      	1 	2
      	2 	          1001
      	3 	3
      	4 	          1002
      	5 	          1003
      	6 	4
      	7 	5
      	8 	          1004
      	9 	6
      	10	          1005
      	11	7
      	12	          1006
      	13	          1007
      
      Here stream 1 will start a readahead at page 2, and stream 2 will start its
      first readahead at page 1003.  From then on the two streams will be served
      right.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b10c6c9
    • F
      readahead: combine file_ra_state.prev_index/prev_offset into prev_pos · f4e6b498
      Fengguang Wu 提交于
      Combine the file_ra_state members
      				unsigned long prev_index
      				unsigned int prev_offset
      into
      				loff_t prev_pos
      
      It is more consistent and better supports huge files.
      
      Thanks to Peter for the nice proposal!
      
      [akpm@linux-foundation.org: fix shift overflow]
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4e6b498
    • F
      readahead: mmap read-around simplification · 0bb7ba6b
      Fengguang Wu 提交于
      Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bb7ba6b
    • F
      readahead: compacting file_ra_state · 937085aa
      Fengguang Wu 提交于
      Use 'unsigned int' instead of 'unsigned long' for readahead sizes.
      
      This helps reduce memory consumption on 64bit CPU when a lot of files are
      opened.
      
      CC: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      937085aa
    • J
      Clean up duplicate includes in mm/ · 43fac94d
      Jesper Juhl 提交于
      This patch cleans up duplicate includes in
      	mm/
      Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43fac94d
    • A
      slub.c:early_kmem_cache_node_alloc() shouldn't be __init · 1cd7daa5
      Adrian Bunk 提交于
      WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes')
      ...
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cd7daa5
    • A
      vmemmap: generify initialisation via helpers · 29c71111
      Andy Whitcroft 提交于
      Convert the common vmemmap population into initialisation helpers for use by
      architecture vmemmap populators.  All architecture implementing the
      SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
      initialiser, which may make use of the helpers.
      
      This allows us to clean up and remove the initialisation Kconfig entries.
      With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
      indicate use of that variant.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c71111
    • C
      Generic Virtual Memmap support for SPARSEMEM · 8f6aac41
      Christoph Lameter 提交于
      SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
      the arches.  It would be great if it could be the default so that we can get
      rid of various forms of DISCONTIG and other variations on memory maps.  So far
      what has hindered this are the additional lookups that SPARSEMEM introduces
      for virt_to_page and page_address.  This goes so far that the code to do this
      has to be kept in a separate function and cannot be used inline.
      
      This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
      is mapped into a virtually contigious area, only the active sections are
      physically backed.  This allows virt_to_page page_address and cohorts become
      simple shift/add operations.  No page flag fields, no table lookups, nothing
      involving memory is required.
      
      The two key operations pfn_to_page and page_to_page become:
      
         #define __pfn_to_page(pfn)      (vmemmap + (pfn))
         #define __page_to_pfn(page)     ((page) - vmemmap)
      
      By having a virtual mapping for the memmap we allow simple access without
      wasting physical memory.  As kernel memory is typically already mapped 1:1
      this introduces no additional overhead.  The virtual mapping must be big
      enough to allow a struct page to be allocated and mapped for all valid
      physical pages.  This vill make a virtual memmap difficult to use on 32 bit
      platforms that support 36 address bits.
      
      However, if there is enough virtual space available and the arch already maps
      its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
      technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
      FLATMEM needs to read the contents of the mem_map variable to get the start of
      the memmap and then add the offset to the required entry.  vmemmap is a
      constant to which we can simply add the offset.
      
      This patch has the potential to allow us to make SPARSMEM the default (and
      even the only) option for most systems.  It should be optimal on UP, SMP and
      NUMA on most platforms.  Then we may even be able to remove the other memory
      models: FLATMEM, DISCONTIG etc.
      
      [apw@shadowen.org: config cleanups, resplit code etc]
      [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
      [apw@shadowen.org: vmemmap: remove excess debugging]
      [apw@shadowen.org: simplify initialisation code and reduce duplication]
      [apw@shadowen.org: pull out the vmemmap code into its own file]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f6aac41
    • A
      sparsemem: record when a section has a valid mem_map · 540557b9
      Andy Whitcroft 提交于
      We have flags to indicate whether a section actually has a valid mem_map
      associated with it.  This is never set and we rely solely on the present bit
      to indicate a section is valid.  By definition a section is not valid if it
      has no mem_map and there is a window during init where the present bit is set
      but there is no mem_map, during which pfn_valid() will return true
      incorrectly.
      
      Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
      mem_map.  Switch valid_section{,_nr} and pfn_valid() to this bit.  Add a new
      present_section{,_nr} and pfn_present() interfaces for those users who care to
      know that a section is going to be valid.
      
      [akpm@linux-foundation.org: coding-syle fixes]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      540557b9
    • A
      sparsemem: clean up spelling error in comments · cd881a6b
      Andy Whitcroft 提交于
      SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
      the arches.  It would be great if it could be the default so that we can get
      rid of various forms of DISCONTIG and other variations on memory maps.  So far
      what has hindered this are the additional lookups that SPARSEMEM introduces
      for virt_to_page and page_address.  This goes so far that the code to do this
      has to be kept in a separate function and cannot be used inline.
      
      This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
      is mapped into a virtually contigious area, only the active sections are
      physically backed.  This allows virt_to_page page_address and cohorts become
      simple shift/add operations.  No page flag fields, no table lookups, nothing
      involving memory is required.
      
      The two key operations pfn_to_page and page_to_page become:
      
         #define __pfn_to_page(pfn)      (vmemmap + (pfn))
         #define __page_to_pfn(page)     ((page) - vmemmap)
      
      By having a virtual mapping for the memmap we allow simple access without
      wasting physical memory.  As kernel memory is typically already mapped 1:1
      this introduces no additional overhead.  The virtual mapping must be big
      enough to allow a struct page to be allocated and mapped for all valid
      physical pages.  This vill make a virtual memmap difficult to use on 32 bit
      platforms that support 36 address bits.
      
      However, if there is enough virtual space available and the arch already maps
      its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
      technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
      FLATMEM needs to read the contents of the mem_map variable to get the start of
      the memmap and then add the offset to the required entry.  vmemmap is a
      constant to which we can simply add the offset.
      
      This patch has the potential to allow us to make SPARSMEM the default (and
      even the only) option for most systems.  It should be optimal on UP, SMP and
      NUMA on most platforms.  Then we may even be able to remove the other memory
      models: FLATMEM, DISCONTIG etc.
      
      The current aim is to bring a common virtually mapped mem_map to all
      architectures.  This should facilitate the removal of the bespoke
      implementations from the architectures.  This also brings performance
      improvements for most architecture making sparsmem vmemmap the more desirable
      memory model.  The ultimate aim of this work is to expand sparsemem support to
      encompass all the features of the other memory models.  This could allow us to
      drop support for and remove the other models in the longer term.
      
      Below are some comparitive kernbench numbers for various architectures,
      comparing default memory model against SPARSEMEM VMEMMAP.  All but ia64 show
      marginal improvement; we expect the ia64 figures to be sorted out when the
      larger mapping support returns.
      
      x86-64 non-NUMA
                   Base    VMEMAP    % change (-ve good)
      User        85.07     84.84    -0.26
      System      34.32     33.84    -1.39
      Total      119.38    118.68    -0.59
      
      ia64
                   Base    VMEMAP    % change (-ve good)
      User      1016.41   1016.93    0.05
      System      50.83     51.02    0.36
      Total     1067.25   1067.95    0.07
      
      x86-64 NUMA
                   Base   VMEMAP    % change (-ve good)
      User        30.77   431.73     0.22
      System      45.39    43.98    -3.11
      Total      476.17   475.71    -0.10
      
      ppc64
                   Base   VMEMAP    % change (-ve good)
      User       488.77   488.35    -0.09
      System      56.92    56.37    -0.97
      Total      545.69   544.72    -0.18
      
      Below are some AIM bencharks on IA64 and x86-64 (thank Bob).  The seems
      pretty much flat as you would expect.
      
      ia64 results 2 cpu non-numa 4Gb SCSI disk
      
      Benchmark	Version	Machine	Run Date
      AIM Multiuser Benchmark - Suite VII	"1.1"	extreme	Jun  1 07:17:24 2007
      
      Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
      1	98.9		100	58.9	1.3	1.6482
      101	5547.1		95	106.0	79.4	0.9154
      201	6377.7		95	183.4	158.3	0.5288
      301	6932.2		95	252.7	237.3	0.3838
      401	7075.8		93	329.8	316.7	0.2941
      501	7235.6		94	403.0	396.2	0.2407
      600	7387.5		94	472.7	475.0	0.2052
      
      Benchmark	Version	Machine	Run Date
      AIM Multiuser Benchmark - Suite VII	"1.1"	vmemmap	Jun  1 09:59:04 2007
      
      Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
      1	99.1		100	58.8	1.2	1.6509
      101	5480.9		95	107.2	79.2	0.9044
      201	6490.3		95	180.2	157.8	0.5382
      301	6886.6		94	254.4	236.8	0.3813
      401	7078.2		94	329.7	316.0	0.2942
      501	7250.3		95	402.2	395.4	0.2412
      600	7399.1		94	471.9	473.9	0.2055
      
      open power 710 2 cpu, 4 Gb, SCSI and configured physically
      
      Benchmark	Version	Machine	Run Date
      AIM Multiuser Benchmark - Suite VII	"1.1"	extreme	May 29 15:42:53 2007
      
      Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
      1	25.7		100	226.3	4.3	0.4286
      101	1096.0		97	536.4	199.8	0.1809
      201	1236.4		96	946.1	389.1	0.1025
      301	1280.5		96	1368.0	582.3	0.0709
      401	1270.2		95	1837.4	771.0	0.0528
      501	1251.4		96	2330.1	955.9	0.0416
      601	1252.6		96	2792.4	1139.2	0.0347
      701	1245.2		96	3276.5	1334.6	0.0296
      918	1229.5		96	4345.4	1728.7	0.0223
      
      Benchmark	Version	Machine	Run Date
      AIM Multiuser Benchmark - Suite VII	"1.1"	vmemmap	May 30 07:28:26 2007
      
      Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
      1	25.6		100	226.9	4.3	0.4275
      101	1049.3		97	560.2	198.1	0.1731
      201	1199.1		97	975.6	390.7	0.0994
      301	1261.7		96	1388.5	591.5	0.0699
      401	1256.1		96	1858.1	771.9	0.0522
      501	1220.1		96	2389.7	955.3	0.0406
      601	1224.6		96	2856.3	1133.4	0.0340
      701	1252.0		96	3258.7	1314.1	0.0298
      915	1232.8		96	4319.7	1704.0	0.0225
      
      amd64 2 2-core, 4Gb and SATA
      
      Benchmark	Version	Machine	Run Date
      AIM Multiuser Benchmark - Suite VII	"1.1"	extreme	Jun  2 03:59:48 2007
      
      Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
      1	13.0		100	446.4	2.1	0.2173
      101	533.4		97	1102.0	110.2	0.0880
      201	578.3		97	2022.8	220.8	0.0480
      301	583.8		97	3000.6	332.3	0.0323
      401	580.5		97	4020.1	442.2	0.0241
      501	574.8		98	5072.8	558.8	0.0191
      600	566.5		98	6163.8	671.0	0.0157
      
      Benchmark	Version	Machine	Run Date
      AIM Multiuser Benchmark - Suite VII	"1.1"	vmemmap	Jun  3 04:19:31 2007
      
      Tasks	Jobs/Min	JTI	Real	CPU	Jobs/sec/task
      1	13.0		100	447.8	2.0	0.2166
      101	536.5		97	1095.6	109.7	0.0885
      201	567.7		97	2060.5	219.3	0.0471
      301	582.1		96	3009.4	330.2	0.0322
      401	578.2		96	4036.4	442.4	0.0240
      501	585.1		98	4983.2	555.1	0.0195
      600	565.5		98	6175.2	660.6	0.0157
      
      This patch:
      
      Fix some spelling errors.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd881a6b
  2. 15 10月, 2007 1 次提交