1. 27 2月, 2011 6 次提交
    • T
      ext4: remove page_skipped hackery in ext4_da_writepages() · ee6ecbcc
      Theodore Ts'o 提交于
      Because the ext4 page writeback codepath had been prematurely calling
      clear_page_dirty_for_io(), if it turned out that a particular page
      couldn't be written out during a particular pass of
      write_cache_pages_da(), the page would have to get redirtied by
      calling redirty_pages_for_writeback().  Not only was this wasted work,
      but redirty_page_for_writeback() would increment wbc->pages_skipped to
      signal to writeback_sb_inodes() that buffers were locked, and that it
      should skip this inode until later.
      
      Since this signal was incorrect in ext4's case --- which was caused by
      ext4's historically incorrect use of write_cache_pages() ---
      ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
      confusing writeback_sb_inodes().
      
      Now that we've fixed ext4 to call clear_page_dirty_for_io() right
      before initiating the page I/O, we can nuke the page_skipped
      save/restore hackery, and breathe a sigh of relief.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ee6ecbcc
    • T
      ext4: clear the dirty bit for a page in writeback at the last minute · 97498956
      Theodore Ts'o 提交于
      Move when we call clear_page_dirty_for_io() to just before we actually
      write the page.  This simplifies the code somewhat, and avoids marking
      pages as clean and then needing to remark them as dirty later.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      97498956
    • T
      ext4: simple cleanups to write_cache_pages_da() · 4f01b02c
      Theodore Ts'o 提交于
      Eliminate duplicate code, unneeded variables, etc., to make it easier
      to understand the code.  No behavioral changes were made in this patch.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4f01b02c
    • T
      ext4: fold __mpage_da_writepage() into write_cache_pages_da() · 8eb9e5ce
      Theodore Ts'o 提交于
      Fold the __mpage_da_writepage() function into write_cache_pages_da().
      This will give us opportunities to clean up and simplify the resulting
      code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8eb9e5ce
    • C
      ext4: fix ext4_da_block_invalidatepages() to handle page range properly · c7f5938a
      Curt Wohlgemuth 提交于
      If ext4_da_block_invalidatepages() is called because of a
      failure from ext4_map_blocks() in mpage_da_map_and_submit(),
      it's supposed to clean up -- including unlock -- all the
      pages in the mpd structure.  But these values may not match
      up, even on a system in which block size == page size:
      
         mpd->b_blocknr != mpd->first_page
         mpd->b_size != (mpd->next_page - mpd->first_page)
      
      ext4_da_block_invalidatepages() has been using b_blocknr and
      b_size; this patch changes it to use first_page and
      next_page.
      
      Tested:  I injected a small number (5%) of failures in
      ext4_map_blocks() in the case that the flags contain
      EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
      kernel.  Without this patch, I got hung tasks every time.
      With this patch, I see no hangs in many runs of fsstress.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c7f5938a
    • C
      ext4: mark multi-page IO complete on mapping failure · e0fd9b90
      Curt Wohlgemuth 提交于
      In mpage_da_map_and_submit(), if we have a delayed block
      allocation failure from ext4_map_blocks(), we need to mark
      the IO as complete, by setting
      
            mpd->io_done = 1;
      
      Otherwise, we could end up submitting the pages in an outer
      loop; since they are unlocked on mapping failure in
      ext4_da_block_invalidatepages(), this will cause a bug check
      in mpage_da_submit_io().
      
      I tested this by injected failures into ext4_map_blocks().
      Without this patch, a simple fsstress run will bug check;
      with the patch, it works fine.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e0fd9b90
  2. 22 2月, 2011 1 次提交
  3. 14 1月, 2011 1 次提交
  4. 11 1月, 2011 6 次提交
  5. 17 12月, 2010 2 次提交
  6. 15 12月, 2010 1 次提交
    • T
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o 提交于
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      
      begin;
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      rollback;
      
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      Reported-by: NJon Nelson <jnelson@jamponi.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1449032b
  7. 15 11月, 2010 1 次提交
  8. 09 11月, 2010 1 次提交
  9. 02 11月, 2010 2 次提交
  10. 29 10月, 2010 1 次提交
  11. 28 10月, 2010 17 次提交
    • D
      ext4: optimize orphan_list handling for ext4_setattr · 3d287de3
      Dmitry Monakhov 提交于
      Surprisingly chown() on ext4 is not SMP scalable operation. 
      Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
      result in significant performance overhead because of global orphan
      mutex, especially in no-journal mode (where orphan_add() is noop).
      It is possible to skip explicit orphan_del if possible.
      Results of fchown() micro-benchmark in no-journal mode
      while (1) {
         iteration++;
         fchown(fd, uid, gid);
         fchown(fd, uid + 1, gid + 1)
      }
      measured: iterations per millisecond
      | nr_tasks | w/o patch | with patch |
      |        1 |       142 |        185 |
      |        4 |       109 |        642 |
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3d287de3
    • T
      ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static · 4a873a47
      Theodore Ts'o 提交于
      Fix a namespace leak by moving the function to the file where it is
      used and making it static.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a873a47
    • T
      ext4: make various ext4 functions be static · 1f109d5a
      Theodore Ts'o 提交于
      These functions have no need to be exported beyond file context.
      
      No functions needed to be moved for this commit; just some function
      declarations changed to be static and removed from header files.
      
      (A similar patch was submitted by Eric Sandeen, but I wanted to handle
      code movement in separate patches to make sure code changes didn't
      accidentally get dropped.)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1f109d5a
    • E
      ext4: update writeback_index based on last page scanned · 72f84e65
      Eric Sandeen 提交于
      As pointed out in a prior patch, updating the mapping's
      writeback_index based on pages written isn't quite right;
      what the writeback index is really supposed to reflect is
      the next page which should be scanned for writeback during
      periodic flush.
      
      As in write_cache_pages(), write_cache_pages_da() does
      this scanning for us as we assemble the mpd for later
      writeout.  If we keep track of the next page after the
      current scan, we can easily update writeback_index without
      worrying about pages written vs. pages skipped, etc.
      
      Without this, an fsync will reset writeback_index to
      0 (its starting index) + however many pages it wrote, which
      can mess up the progress of periodic flush.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      72f84e65
    • E
      ext4: implement writeback livelock avoidance using page tagging · 5b41d924
      Eric Sandeen 提交于
      This is analogous to Jan Kara's commit,
      f446daae
      mm: implement writeback livelock avoidance using page tagging
      
      but since we forked write_cache_pages, we need to reimplement
      it there (and in ext4_da_writepages, since range_cyclic handling
      was moved to there)
      
      If you start a large buffered IO to a file, and then set
      fsync after it, you'll find that fsync does not complete
      until the other IO stops.
      
      If you continue re-dirtying the file (say, putting dd
      with conv=notrunc in a loop), when fsync finally completes
      (after all IO is done), it reports via tracing that
      it has written many more pages than the file contains;
      in other words it has synced and re-synced pages in
      the file multiple times.
      
      This then leads to problems with our writeback_index
      update, since it advances it by pages written, and
      essentially sets writeback_index off the end of the
      file...
      
      With the following patch, we only sync as much as was
      dirty at the time of the sync.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5b41d924
    • E
      ext4: tidy up a void argument in inode.c · bbd08344
      Eric Sandeen 提交于
      This doesn't fix anything at all, it just removes a vestige
      of prior use from __mpage_da_writepage()
      
      __mpage_da_writepage() had a *void argument leftover from
      its previous life as a callback; make it reflect the actual type.
      
      Fixing this up makes it slightly more obvious to read, and 
      enables proper typechecking.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bbd08344
    • N
      ext4: Check return value of sb_getblk() and friends · 87783690
      Namhyung Kim 提交于
      Fail block allocation if sb_getblk() returns NULL. In that case,
      sb_find_get_block() also likely to fail so that it should skip
      calling ext4_forget().
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      87783690
    • T
      ext4: use bio layer instead of buffer layer in mpage_da_submit_io · bd2d0210
      Theodore Ts'o 提交于
      Call the block I/O layer directly instad of going through the buffer
      layer.  This should give us much better performance and scalability,
      as well as lowering our CPU utilization when doing buffered writeback.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bd2d0210
    • T
      ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io() · 1de3e3df
      Theodore Ts'o 提交于
      This massively simplifies the ext4_da_writepages() code path by
      completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
      code iterating over a set of pages using pagevec_lookup(), and folds
      that functionality into mpage_da_submit_io()'s existing
      pagevec_lookup() loop.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1de3e3df
    • T
      ext4: inline walk_page_buffers() into mpage_da_submit_io · 3ecdb3a1
      Theodore Ts'o 提交于
      Expand the call:
      
        if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                              ext4_bh_delay_or_unwritten))
      	goto redirty_page
      
      into mpage_da_submit_io().
      
      This will allow us to merge in mpage_put_bnr_to_bhs() in the next
      patch.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3ecdb3a1
    • T
      ext4: inline ext4_writepage() into mpage_da_submit_io() · cb20d518
      Theodore Ts'o 提交于
      As a prepratory step to switching to bio_submit, inline
      ext4_writepage() into mpage_da_submit() and then simplify things a
      bit.  This makes it clearer what mpage_da_submit needs to do.
      
      Also, move the ClearPageChecked(page) call into
      __ext4_journalled_writepage(), as a minor bit of cleanup refactoring.
      
      This also allows us to pull i_size_read() and
      ext4_should_journal_data() out of the loop, which should be a very
      minor CPU savings.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cb20d518
    • T
      ext4: simplify ext4_writepage() · a42afc5f
      Theodore Ts'o 提交于
      The actual code in ext4_writepage() is unnecessarily convoluted.
      Simplify it so it is easier to understand, but otherwise logically
      equivalent.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a42afc5f
    • T
      ext4: call mpage_da_submit_io() from mpage_da_map_blocks() · 5a87b7a5
      Theodore Ts'o 提交于
      Eventually we need to completely reorganize the ext4 writepage
      callpath, but for now, we simplify things a little by calling
      mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
      places where we call mpage_da_map_blocks() it is followed up by a call
      to mpage_da_submit_io().
      
      We're also a wee bit better with respect to error handling, but there
      are still a number of issues where it's not clear what the right thing
      is to do with ext4 functions deep in the writeback codepath fails.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5a87b7a5
    • E
      ext4: queue conversion after adding to inode's completed IO list · c999af2b
      Eric Sandeen 提交于
      By queuing the io end on the unwritten workqueue before adding it
      to our inode's list of completed IOs, I think we run the risk
      of the work getting completed, and the IO freed, before we try
      to add it to the inode's i_completed_io_list.
      
      It should be safe to add it to the inode's list of completed
      IOs, and -then- queue it for completion, I think.
      
      Thanks to Dave Chinner for pointing out the race.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c999af2b
    • T
      ext4: fix potential infinite loop in ext4_da_writepages() · 0c9169cc
      Toshiyuki Okajima 提交于
      On linux-2.6.36-rc2, if we execute the following script, we can hang
      the system when the /bin/sync command is executed:
      
      ========================================================================
      #!/bin/sh
      
      echo -n "HANG UP TEST: "
      /bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
      /sbin/mkfs.ext4 -Fq /tmp/img
      /bin/mount -o loop -t ext4 /tmp/img /mnt
      /bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
      seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
      /bin/sync
      /bin/umount /mnt
      echo "DONE"
      exit 0
      ========================================================================
      
      We can see the following backtrace if we get the kdump when this
      hangup occurs:
      
      ======================================================================
      kthread()
      => bdi_writeback_thread()
         => wb_do_writeback()
            => wb_writeback()
               => writeback_inodes_wb()
                  => writeback_sb_inodes()
                     => writeback_single_inode()
                        => ext4_da_writepages()  ---+ 
                                      ^ infinite    |
                                      |   loop      |
                                      +-------------+
      ======================================================================
      
      The reason why this hangup happens is described as follows:
      1) We write the last extent block of the file whose size is the filesystem 
         maximum size.
      2) "BH_Delay" flag is set on the buffer_head of its block.
      3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
         - the member, "m_len" of struct mpage_da_data is 1
        mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
        cannot clear "BH_Delay" flag of the buffer_head because the type of
        m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.
      
        Therefore an infinite loop occurs because ext4_da_writepages()
        cannot write the page (which corresponds to the block) since
        "BH_Delay" flag isn't cleared.
      ----------------------------------------------------------------------
      static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
      				struct ext4_map_blocks *map)
      {
      ...
      	int blocks = map->m_len;
      ...
      		do {
      			// cur_logical = 4294967295
      			// map->m_lblk = 4294967295
      			// blocks = 1
      			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
      			// (cur_logical >= map->m_lblk + blocks) => true
      			if (cur_logical >= map->m_lblk + blocks)
      				break;
      ----------------------------------------------------------------------
      
      NOTE: Mounting with the nodelalloc option will avoid this codepath,
      and thus, avoid this hang
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0c9169cc
    • E
      ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 · b443e733
      Eric Sandeen 提交于
      I'm uneasy with lots of stuff going on in ext4_da_writepages(),
      but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
      making anything better, so avoid the multiplier in that case.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b443e733
    • E
      ext4: stop looping in ext4_num_dirty_pages when max_pages reached · 659c6009
      Eric Sandeen 提交于
      Today we simply break out of the inner loop when we have accumulated
      max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
      in the while (!done) loop, this does potentially a lot of work
      with no net effect.
      
      When we have accumulated max_pages, just clean up and return.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      659c6009
  12. 26 10月, 2010 1 次提交