1. 17 6月, 2009 8 次提交
    • M
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman 提交于
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • W
      readahead: enforce full sync mmap readahead size · 7ffc59b4
      Wu Fengguang 提交于
      Now that we do readahead for sequential mmap reads, here is a simple
      evaluation of the impacts, and one further optimization.
      
      It's an NFS-root debian desktop system, readahead size = 60 pages.
      The numbers are grabbed after a fresh boot into console.
      
      approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
         A            383             31.6%           383             11
         B            225             32.4%           390             11
         C            224             32.6%           307             13
      
      case A: mmap sync/async readahead disabled
      case B: mmap sync/async readahead enabled, with enforced full async readahead size
      case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
      or:
      A = vanilla 2.6.30-rc1
      B = A plus mmap readahead
      C = B plus this patch
      
      The numbers show that
      - there are good possibilities for random mmap reads to trigger readahead
      - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
      - case C can further reduce IO count by 1/4
      - readahead miss ratios are not quite affected
      
      The theory is
      - readahead is _good_ for clustered random reads, and can perform
        _better_ than readaround because they could be _async_.
      - async readahead size is guaranteed to be larger than readaround
        size, and they are _async_, hence will mostly behave better
      However for B
      - sync readahead size could be smaller than readaround size, hence may
        make things worse by produce more smaller IOs
      which will be fixed by this patch.
      
      Final conclusion:
      - mmap readahead reduced major faults by 1/3 and no obvious overheads;
      - mmap io can be further reduced by 1/4 with this patch.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ffc59b4
    • W
    • W
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang 提交于
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • W
      readahead: enforce full readahead size on async mmap readahead · 2fad6f5d
      Wu Fengguang 提交于
      We need this in one particular case and two more general ones.
      
      Now we do async readahead for sequential mmap reads, and do it with the
      help of PG_readahead.  For normal reads, PG_readahead is the sufficient
      condition to do a sequential readahead.  But unfortunately, for mmap
      reads, there is a tiny nuisance:
      
      [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
      [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
      [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
      [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
      
      An unfavorably small readahead.  The original dumb read-around size could
      be more efficient.
      
      That happened because ld-linux.so does a read(832) in L1 before mmap(),
      which triggers a 4-page readahead, with the second page tagged
      PG_readahead.
      
      L0: open("/lib/libc.so.6", O_RDONLY)        = 3
      L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
      L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
      L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
      L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
      L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
      L7: close(3)                                = 0
      
      In general, the PG_readahead flag will also be hit in cases
      
      - sequential reads
      
      - clustered random reads
      
      A full readahead size is desirable in both cases.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fad6f5d
    • W
      readahead: sequential mmap readahead · 70ac23cf
      Wu Fengguang 提交于
      Auto-detect sequential mmap reads and do readahead for them.
      
      The sequential mmap readahead will be triggered when
      - sync readahead: it's a major fault and (prev_offset == offset-1);
      - async readahead: minor fault on PG_readahead page with valid readahead state.
      
      The benefits of doing readahead instead of read-around:
      - less I/O wait thanks to async readahead
      - double real I/O size and no more cache hits
      
      The single stream case is improved a little.
      For 100,000 sequential mmap reads:
      
                                          user       system    cpu        total
      (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
      (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
      (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
      
      The patched (2) has smallest total time, since it has no cache hit overheads
      and less I/O block time(thanks to async readahead). Here the I/O size
      makes no much difference, since there's only one single stream.
      
      Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
      since the half of the read-around pages will be readahead cache hits.
      
      This is going to make _real_ differences for _concurrent_ IO streams.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ac23cf
    • L
      readahead: clean up and simplify the code for filemap page fault readahead · ef00e08e
      Linus Torvalds 提交于
      This shouldn't really change behavior all that much, but the single rather
      complex function with read-ahead inside a loop etc is broken up into more
      manageable pieces.
      
      The behaviour is also less subtle, with the read-ahead being done up-front
      rather than inside some subtle loop and thus avoiding the now unnecessary
      extra state variables (ie "did_readaround" is gone).
      
      Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
      the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
      which case the original code will directly jump to no_cached_page reading.
      
      Cc: Pavel Levshin <lpk@581.spb.su>
      Cc: <wli@movementarian.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef00e08e
    • W
      readahead: move max_sane_readahead() calls into force_page_cache_readahead() · f7e839dd
      Wu Fengguang 提交于
      Impact: code simplification.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e839dd
  2. 29 5月, 2009 1 次提交
  3. 16 4月, 2009 1 次提交
  4. 14 4月, 2009 1 次提交
  5. 04 4月, 2009 1 次提交
  6. 03 4月, 2009 2 次提交
  7. 02 3月, 2009 1 次提交
    • I
      x86, mm: dont use non-temporal stores in pagecache accesses · f1800536
      Ingo Molnar 提交于
      Impact: standardize IO on cached ops
      
      On modern CPUs it is almost always a bad idea to use non-temporal stores,
      as the regression in this commit has shown it:
      
        30d697fa: x86: fix performance regression in write() syscall
      
      The kernel simply has no good information about whether using non-temporal
      stores is a good idea or not - and trying to add heuristics only increases
      complexity and inserts fragility.
      
      The regression on cached write()s took very long to be found - over two
      years. So dont take any chances and let the hardware decide how it makes
      use of its caches.
      
      The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
      absolutely sure that another entity (the GPU) will pick up the dirty
      data immediately and that the CPU will not touch that data before the
      GPU will.
      
      Also, keep the _nocache() primitives to make it easier for people to
      experiment with these details. There may be more clear-cut cases where
      non-cached copies can be used, outside of filemap.c.
      
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f1800536
  8. 25 2月, 2009 1 次提交
    • I
      x86, mm: pass in 'total' to __copy_from_user_*nocache() · 3255aa2e
      Ingo Molnar 提交于
      Impact: cleanup, enable future change
      
      Add a 'total bytes copied' parameter to __copy_from_user_*nocache(),
      and update all the callsites.
      
      The parameter is not used yet - architecture code can use it to
      more intelligently decide whether the copy should be cached or
      non-temporal.
      
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3255aa2e
  9. 14 1月, 2009 2 次提交
  10. 09 1月, 2009 1 次提交
    • K
      memcg: revert gfp mask fix · 2c26fdd7
      KAMEZAWA Hiroyuki 提交于
      My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
      of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
      happen at memory reclaim.
      
      But in recent discussion, it's NACKed because it sounds ugly.
      
      This patch is for reverting it and add some clean up to gfp_mask of
      callers of charge.  No behavior change but need review before generating
      HUNK in deep queue.
      
      This patch also adds explanation to meaning of gfp_mask passed to charge
      functions in memcontrol.h.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c26fdd7
  11. 07 1月, 2009 4 次提交
    • N
      mm: pagecache gfp flags fix · 67d58ac4
      Nick Piggin 提交于
      Frustratingly, gfp_t is really divided into two classes of flags.  One are
      the context dependent ones (can we sleep?  can we enter filesystem?  block
      subsystem?  should we use some extra reserves, etc.).  The other ones are
      the type of memory required and depend on how the algorithm is implemented
      rather than the point at which the memory is allocated (highmem?  dma
      memory?  etc).
      
      Some of the functions which allocate a page and add it to page cache take
      a gfp_t, but sometimes those functions or their callers aren't really
      doing the right thing: when allocating pagecache page, the memory type
      should be mapping_gfp_mask(mapping).  When allocating radix tree nodes,
      the memory type should be kernel mapped (not highmem) memory.  The gfp_t
      argument should only really be needed for context dependent options.
      
      This patch doesn't really solve that tangle in a nice way, but it does
      attempt to fix a couple of bugs.
      
      - find_or_create_page changes its radix-tree allocation to only include
        the main context dependent flags in order so the pagecache page may be
        allocated from arbitrary types of memory without affecting the
        radix-tree.  In practice, slab allocations don't come from highmem
        anyway, and radix-tree only uses slab allocations.  So there isn't a
        practical change (unless some fs uses GFP_DMA for pages).
      
      - grab_cache_page_nowait() is changed to allocate radix-tree nodes with
        GFP_NOFS, because it is not supposed to reenter the filesystem.  This
        bug could cause lock recursion if a filesystem is not expecting the
        function to reenter the fs (as-per documentation).
      
      Filesystems should be careful about exactly what semantics they want and
      what they get when fiddling with gfp_t masks to allocate pagecache.  One
      should be as liberal as possible with the type of memory that can be used,
      and same for the the context specific flags.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d58ac4
    • N
      mm: direct IO starvation improvement · 48b47c56
      Nick Piggin 提交于
      Direct IO can invalidate and sync a lot of pagecache pages in the mapping.
       A 4K direct IO will actually try to sync and/or invalidate the pagecache
      of the entire file, for example (which might be many GB or TB large).
      
      Improve this by doing range syncs.  Also, memory no longer has to be
      unmapped to catch the dirty bits for syncing, as dirty bits would remain
      coherent due to dirty mmap accounting.
      
      This fixes the immediate DM deadlocks when doing direct IO reads to block
      device with a mounted filesystem, if only by papering over the problem
      somewhat rather than addressing the fsync starvation cases.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48b47c56
    • N
      mm: write_cache_pages integrity fix · 05fe478d
      Nick Piggin 提交于
      In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
      so the function will return success after writing out nr_to_write pages,
      even if that was not sufficient to guarantee data integrity.
      
      The callers tend to set it to values that could break data interity
      semantics easily in practice.  For example, nr_to_write can be set to
      mapping->nr_pages * 2, however if a file has a single, dirty page, then
      fsync is called, subsequent pages might be concurrently added and dirtied,
      then write_cache_pages might writeout two of these newly dirty pages,
      while not writing out the old page that should have been written out.
      
      Fix this by ignoring nr_to_write if it is a data integrity sync.
      
      This is a data integrity bug.
      
      The reason this has been done in the past is to avoid stalling sync
      operations behind page dirtiers.
      
       "If a file has one dirty page at offset 1000000000000000 then someone
        does an fsync() and someone else gets in first and starts madly writing
        pages at offset 0, we want to write that page at 1000000000000000.
        Somehow."
      
      What we do today is return success after an arbitrary amount of pages are
      written, whether or not we have provided the data-integrity semantics that
      the caller has asked for.  Even this doesn't actually fix all stall cases
      completely: in the above situation, if the file has a huge number of pages
      in pagecache (but not dirty), then mapping->nrpages is going to be huge,
      even if pages are being dirtied.
      
      This change does indeed make the possibility of long stalls lager, and
      that's not a good thing, but lying about data integrity is even worse.  We
      have to either perform the sync, or return -ELINUXISLAME so at least the
      caller knows what has happened.
      
      There are subsequent competing approaches in the works to solve the stall
      problems properly, without compromising data integrity.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05fe478d
    • N
      mm: don't mark_page_accessed in fault path · bf3f3bc5
      Nick Piggin 提交于
      Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
      unmap-time if the pte is young has a number of problems.
      
      mark_page_accessed is supposed to be roughly the equivalent of a young pte
      for unmapped references. Unfortunately it doesn't come with any context:
      after being called, reclaim doesn't know who or why the page was touched.
      
      So calling mark_page_accessed not only adds extra lru or PG_referenced
      manipulations for pages that are already going to have pte_young ptes anyway,
      but it also adds these references which are difficult to work with from the
      context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
      wish to contribute to the page being referenced).
      
      Then, simply doing SetPageReferenced when zapping a pte and finding it is
      young, is not a really good solution either. SetPageReferenced does not
      correctly promote the page to the active list for example. So after removing
      mark_page_accessed from the fault path, several mmap()+touch+munmap() would
      have a very different result from several read(2) calls for example, which
      is not really desirable.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NJohannes Weiner <hannes@saeurebad.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf3f3bc5
  12. 06 1月, 2009 1 次提交
  13. 05 1月, 2009 1 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
  14. 31 10月, 2008 1 次提交
  15. 20 10月, 2008 3 次提交
  16. 17 10月, 2008 1 次提交
  17. 14 10月, 2008 1 次提交
    • O
      do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails · 85462323
      Oleg Nesterov 提交于
      If lock_page_killable() fails because the task was killed by SIGKILL or
      any other fatal signal, do_generic_file_read() returns -EIO.
      
      This seems to be OK, because in fact the userspace won't see this error,
      the task will dequeue SIGKILL and exit.
      
      However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and
      return to the user-space with the bogus -EIO.
      
      Change the code to return the error code from lock_page_killable(), -EINTR.
      This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this
      change the code looks a bit more logical, and the "good" init should handle
      the spurious EINTR or short read.
      
      Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR,
      but this can't prevent the short reads.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      85462323
  18. 03 9月, 2008 1 次提交
    • H
      VFS: fix dio write returning EIO when try_to_release_page fails · 6ccfa806
      Hisashi Hifumi 提交于
      Dio write returns EIO when try_to_release_page fails because bh is
      still referenced.
      
      The patch
      
          commit 3f31fddf
          Author: Mingming Cao <cmm@us.ibm.com>
          Date:   Fri Jul 25 01:46:22 2008 -0700
      
              jbd: fix race between free buffer and commit transaction
      
      was merged into 2.6.27-rc1, but I noticed that this patch is not enough
      to fix the race.
      
      I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
      sometimes got EIO through this test.
      
      The patch above fixed race between freeing buffer(dio) and committing
      transaction(jbd) but I discovered that there is another race, freeing
      buffer(dio) and ext3/4_ordered_writepage.
      
      : background_writeout()
           ->write_cache_pages()
             ->ext3_ordered_writepage()
           	   walk_page_buffers() -> take a bh ref
       	   block_write_full_page() -> unlock_page
      		: <- end_page_writeback
                      : <- race! (dio write->try_to_release_page fails)
            	   walk_page_buffers() ->release a bh ref
      
      ext3_ordered_writepage holds bh ref and does unlock_page remaining
      taking a bh ref, so this causes the race and failure of
      try_to_release_page.
      
      To fix this race, I used the approach of falling back to buffered
      writes if try_to_release_page() fails on a page.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ccfa806
  19. 05 8月, 2008 1 次提交
  20. 31 7月, 2008 1 次提交
    • L
      Fix off-by-one error in iov_iter_advance() · 94ad374a
      Linus Torvalds 提交于
      The iov_iter_advance() function would look at the iov->iov_len entry
      even though it might have iterated over the whole array, and iov was
      pointing past the end.  This would cause DEBUG_PAGEALLOC to trigger a
      kernel page fault if the allocation was at the end of a page, and the
      next page was unallocated.
      
      The quick fix is to just change the order of the tests: check that there
      is any iovec data left before we check the iov entry itself.
      
      Thanks to Alexey Dobriyan for finding this case, and testing the fix.
      Reported-and-tested-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94ad374a
  21. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  22. 27 7月, 2008 4 次提交
  23. 26 7月, 2008 1 次提交
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5