1. 21 3月, 2011 3 次提交
    • A
      ext4: handle errors in ext4_rename · ef607893
      Amir Goldstein 提交于
      Checking return code from ext4_journal_get_write_access() is important
      with snapshots, because this function invokes COW, so may return new
      errors, such as ENOSPC.
      
      We move the call to ext4_journal_get_write_access earlier in the
      function, to simplify error handling in the case that this function
      returns returns an error.
      Signed-off-by: NAmir Goldstein <amir73il@users.sf.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ef607893
    • A
      jbd2: add COW fields to struct jbd2_journal_handle · 93737456
      Amir Goldstein 提交于
      Add fields needed for the copy-on-write ext4 development work.
      
      The h_cowing flag is used by ext4 snapshots code to mark the task in
      COWING state.
      
      The h_XXX_credits fields are used to track buffer credits usage
      (accounted by COW and non-COW operations).
      
      The h_cow_XXX fields are used as per task debugging counters.
      
      Merging this commit into mainline will allow users to test ext4
      snapshots as a standalone module, without the need to patch and
      install a development kernel.
      Signed-off-by: NAmir Goldstein <amir73il@users.sf.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      93737456
    • A
      jbd2: add the b_cow_tid field to journal_head struct · c2cc7028
      Amir Goldstein 提交于
      The b_cow_tid field will be used by the ext4 snapshots code to store
      the transaction id when the buffer was last cowed.
      
      Merging this patch to mainline will allow users to test ext4 snapshots
      as a standalone module, without the need to patch and install a
      development kernel.
      
      On 64bit machines this field uses fills in a padding "hole" and does
      not increase the size of the struct.  On a 32bit machine this patch
      increases the size of the struct from 60 to 64 bytes.
      Signed-off-by: NAmir Goldstein <amir73il@users.sf.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c2cc7028
  2. 17 3月, 2011 1 次提交
    • T
      ext4: Initialize fsync transaction ids in ext4_new_inode() · 688f869c
      Theodore Ts'o 提交于
      When allocating a new inode, we need to make sure i_sync_tid and
      i_datasync_tid are initialized.  Otherwise, one or both of these two
      values could be left initialized to zero, which could potentially
      result in BUG_ON in jbd2_journal_commit_transaction.
      
      (This could happen by having journal->commit_request getting set to
      zero, which could wake up the kjournald process even though there is
      no running transaction, which then causes a BUG_ON via the 
      J_ASSERT(j_ruinning_transaction != NULL) statement.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      688f869c
  3. 06 3月, 2011 1 次提交
  4. 01 3月, 2011 1 次提交
    • T
      ext4: optimize ext4_bio_write_page() when no extent conversion is needed · b6168443
      Theodore Ts'o 提交于
      If no extent conversion is required, wake up any processes waiting for
      the page's writeback to be complete and free the ext4_io_end structure
      directly in ext4_end_bio() instead of dropping it on the linked list
      (which requires taking a spinlock to queue and dequeue the io_end
      structure), and waiting for the workqueue to do this work.
      
      This removes an extra scheduling delay before process waiting for an
      fsync() to complete gets woken up, and it also reduces the CPU
      overhead for a random write workload.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b6168443
  5. 28 2月, 2011 6 次提交
    • A
      ext4: skip orphan cleanup if fs has unknown ROCOMPAT features · d39195c3
      Amir Goldstein 提交于
      Orphan cleanup is currently executed even if the file system has some
      number of unknown ROCOMPAT features, which deletes inodes and frees
      blocks, which could be very bad for some RO_COMPAT features,
      especially the SNAPSHOT feature.
      
      This patch skips the orphan cleanup if it contains readonly compatible
      features not known by this ext4 implementation, which would prevent
      the fs from being mounted (or remounted) readwrite.
      Signed-off-by: NAmir Goldstein <amir73il@users.sf.net>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d39195c3
    • A
      ext4: use the nblocks arg to ext4_truncate_restart_trans() · 8e8eaabe
      Amir Goldstein 提交于
      nblocks is passed into ext4_truncate_restart_trans() from
      ext4_ext_truncate_extend_restart() with a value different from the default
      blocks_for_truncate(), but is being ignored.
      
      The two other calls to ext4_truncate_restart_trans() already pass the
      default value, which is then being recalculated inside the function.
      
      Fix the problem by using the passed argument.
      Signed-off-by: NAmir Goldstein <amir73il@users.sf.net>
      8e8eaabe
    • M
      ext4: fix missing iput of root inode for some mount error paths · 32a9bb57
      Manish Katiyar 提交于
      This assures that the root inode is not leaked, and that sb->s_root is
      NULL, which will prevent generic_shutdown_super() from doing extra
      work, including call sync_filesystem, which ultimately results in
      ext4_sync_fs() getting called with an uninitialized struct super,
      which is the cause of the crash noted in Kernel Bugzilla #26752.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=26752Signed-off-by: NManish Katiyar <mkatiyar@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      32a9bb57
    • Y
      ext4: make FIEMAP and delayed allocation play well together · 6d9c85eb
      Yongqiang Yang 提交于
      Fix the FIEMAP ioctl so that it returns all of the page ranges which
      are still subject to delayed allocation.  We were missing some cases
      if the file was sparse.
      
      Reported by Chris Mason <chris.mason@oracle.com>:
      >We've had reports on btrfs that cp is giving us files full of zeros
      >instead of actually copying them.  It was tracked down to a bug with
      >the btrfs fiemap implementation where it was returning holes for
      >delalloc ranges.
      >
      >Newer versions of cp are trusting fiemap to tell it where the holes
      >are, which does seem like a pretty neat trick.
      >
      >I decided to give xfs and ext4 a shot with a few tests cases too, xfs
      >passed with all the ones btrfs was getting wrong, and ext4 got the basic
      >delalloc case right.
      >$ mkfs.ext4 /dev/xxx
      >$ mount /dev/xxx /mnt
      >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
      >$ fiemap-test foo
      >ext:   0 logical: [       0..     255] phys:        0..     255
      >flags: 0x007 tot: 256
      >
      >Horray!  But once we throw a hole in, things go bad:
      >$ mkfs.ext4 /dev/xxx
      >$ mount /dev/xxx /mnt
      >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
      >$ fiemap-test foo
      >< no output >
      >
      >We've got a delalloc extent after the hole and ext4 fiemap didn't find
      >it.  If I run sync to kick the delalloc out:
      >$sync
      >$ fiemap-test foo
      >ext:   0 logical: [     256..     511] phys:    34048..   34303
      >flags: 0x001 tot: 256
      >
      >fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
      >got there.  It's full of pretty comments so I know it isn't mine, but
      >you can grab it here:
      >
      >http://oss.oracle.com/~mason/fiemap-test.c
      >
      >xfsqa has a fiemap program too.
      
      After Fix, test results are as follows:
      ext:   0 logical: [     256..     511] phys:        0..     255
      flags: 0x007 tot: 256
      ext:   0 logical: [     256..     511] phys:    33280..   33535
      flags: 0x001 tot: 256
      
      $ mkfs.ext4 /dev/xxx
      $ mount /dev/xxx /mnt
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
      $ sync
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5
      $ fiemap-test foo
      ext:   0 logical: [     256..     511] phys:    33280..   33535
      flags: 0x000 tot: 256
      ext:   1 logical: [     768..    1023] phys:        0..     255
      flags: 0x006 tot: 256
      ext:   2 logical: [    1280..    1535] phys:        0..     255
      flags: 0x007 tot: 256
      Tested-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NYongqiang Yang <xiaoqiangnk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6d9c85eb
    • T
      ext4: suppress verbose debugging information if malloc-debug is off · 4dd89fc6
      Theodore Ts'o 提交于
      If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
      to disk being full, a verbose debugging message is printed, even if
      the malloc-debug switch has not been enabled.  Suppress the debugging
      message so that nothing is printed unless malloc-debug has been turned
      on.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4dd89fc6
    • T
      ext4: don't leave PageWriteback set after memory failure · a54aa761
      Theodore Ts'o 提交于
      In ext4_bio_write_page(), if the memory allocation for the struct
      ext4_io_page fails, it returns with the page's PageWriteback flag set.
      This will end up causing the page not to skip writeback in
      WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
      umount) the writeback daemon will get stuck forever on the
      wait_on_page_writeback() function in write_cache_pages_da().
      
      Or, if journalling is enabled and the file gets deleted, it the
      journal thread can get stuck in journal_finish_inode_data_buffers()
      call to filemap_fdatawait().
      
      Another place where things can get hung up is in
      truncate_inode_pages(), called out of ext4_evict_inode().
      
      Fix this by not setting PageWriteback until after we have successfully
      allocated the struct ext4_io_page.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a54aa761
  6. 27 2月, 2011 9 次提交
    • T
      ext4: move setup of the mpd structure to write_cache_pages_da() · 168fc022
      Theodore Ts'o 提交于
      Move the initialization of all of the fields of the mpd structure to
      write_cache_pages_da().  This simplifies the code considerably.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      168fc022
    • T
      ext4: don't lock the next page in write_cache_pages if not needed · 78aaced3
      Theodore Ts'o 提交于
      If we have accumulated a contiguous region of memory to be written
      out, and the next page can added to this region, don't bother locking
      (and then unlocking the page) before writing out the memory.  In the
      unlikely event that the next page was being written back by some other
      CPU, we can also skip waiting that page to finish writeback.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      78aaced3
    • T
      ext4: remove page_skipped hackery in ext4_da_writepages() · ee6ecbcc
      Theodore Ts'o 提交于
      Because the ext4 page writeback codepath had been prematurely calling
      clear_page_dirty_for_io(), if it turned out that a particular page
      couldn't be written out during a particular pass of
      write_cache_pages_da(), the page would have to get redirtied by
      calling redirty_pages_for_writeback().  Not only was this wasted work,
      but redirty_page_for_writeback() would increment wbc->pages_skipped to
      signal to writeback_sb_inodes() that buffers were locked, and that it
      should skip this inode until later.
      
      Since this signal was incorrect in ext4's case --- which was caused by
      ext4's historically incorrect use of write_cache_pages() ---
      ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
      confusing writeback_sb_inodes().
      
      Now that we've fixed ext4 to call clear_page_dirty_for_io() right
      before initiating the page I/O, we can nuke the page_skipped
      save/restore hackery, and breathe a sigh of relief.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ee6ecbcc
    • T
      ext4: clear the dirty bit for a page in writeback at the last minute · 97498956
      Theodore Ts'o 提交于
      Move when we call clear_page_dirty_for_io() to just before we actually
      write the page.  This simplifies the code somewhat, and avoids marking
      pages as clean and then needing to remark them as dirty later.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      97498956
    • T
      ext4: simple cleanups to write_cache_pages_da() · 4f01b02c
      Theodore Ts'o 提交于
      Eliminate duplicate code, unneeded variables, etc., to make it easier
      to understand the code.  No behavioral changes were made in this patch.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4f01b02c
    • T
      ext4: fold __mpage_da_writepage() into write_cache_pages_da() · 8eb9e5ce
      Theodore Ts'o 提交于
      Fold the __mpage_da_writepage() function into write_cache_pages_da().
      This will give us opportunities to clean up and simplify the resulting
      code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8eb9e5ce
    • T
      ext4: enable mblk_io_submit by default · 6fd7a467
      Theodore Ts'o 提交于
      Now that we've fixed the file corruption bug in commit d50bdd5a,
      it's time to enable mblk_io_submit by default.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6fd7a467
    • C
      ext4: fix ext4_da_block_invalidatepages() to handle page range properly · c7f5938a
      Curt Wohlgemuth 提交于
      If ext4_da_block_invalidatepages() is called because of a
      failure from ext4_map_blocks() in mpage_da_map_and_submit(),
      it's supposed to clean up -- including unlock -- all the
      pages in the mpd structure.  But these values may not match
      up, even on a system in which block size == page size:
      
         mpd->b_blocknr != mpd->first_page
         mpd->b_size != (mpd->next_page - mpd->first_page)
      
      ext4_da_block_invalidatepages() has been using b_blocknr and
      b_size; this patch changes it to use first_page and
      next_page.
      
      Tested:  I injected a small number (5%) of failures in
      ext4_map_blocks() in the case that the flags contain
      EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
      kernel.  Without this patch, I got hung tasks every time.
      With this patch, I see no hangs in many runs of fsstress.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c7f5938a
    • C
      ext4: mark multi-page IO complete on mapping failure · e0fd9b90
      Curt Wohlgemuth 提交于
      In mpage_da_map_and_submit(), if we have a delayed block
      allocation failure from ext4_map_blocks(), we need to mark
      the IO as complete, by setting
      
            mpd->io_done = 1;
      
      Otherwise, we could end up submitting the pages in an outer
      loop; since they are unlocked on mapping failure in
      ext4_da_block_invalidatepages(), this will cause a bug check
      in mpage_da_submit_io().
      
      I tested this by injected failures into ext4_map_blocks().
      Without this patch, a simple fsstress run will bug check;
      with the patch, it works fine.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e0fd9b90
  7. 25 2月, 2011 5 次提交
    • C
      ext4: mballoc: don't replace the current preallocation group unnecessarily · 5a54b2f1
      Coly Li 提交于
      In ext4_mb_check_group_pa(), the current preallocation space is
      replaced with a new preallocation space when the two have the same
      distance from the goal block.
      
      This doesn't actually gain us anything, so change things so that the
      function only switches to the new preallocation group if its distance
      from the goal block is strictly smaller than the current preallocaiton
      group's distance from the goal block.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5a54b2f1
    • C
      ext4: clarify description of ac_g_ex in struct ext4_allocation_context · 58696f3a
      Coly Li 提交于
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      58696f3a
    • C
      mballoc: add comments to ext4_mb_mark_free_simple() · 7c786059
      Coly Li 提交于
      This patch adds comments to ext4_mb_mark_free_simple to make it more
      understandable.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      7c786059
    • C
      ext4: remove unncessary call mb_find_buddy() in debugging code · 235772da
      Coly Li 提交于
      In __mb_check_buddy(), look at the code below:
        591         fstart = -1;
        592         buddy = mb_find_buddy(e4b, 0, &max);
        593         for (i = 0; i < max; i++) {
        594                 if (!mb_test_bit(i, buddy)) {
        595                         MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
        596                         if (fstart == -1) {
        597                                 fragments++;
        598                                 fstart = i;
        599                         }
        600                         continue;
        601                 }
        602                 fstart = -1;
        603                 /* check used bits only */
        604                 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
        605                         buddy2 = mb_find_buddy(e4b, j, &max2);
        606                         k = i >> j;
        607                         MB_CHECK_ASSERT(k < max2);
        608                         MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
        609                 }
        610         }
        611         MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
        612         MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
        613
        614         grp = ext4_get_group_info(sb, e4b->bd_group);
        615         buddy = mb_find_buddy(e4b, 0, &max);
      
      On line 592, buddy is fetched by mb_find_buddy() with order 0, between
      line 593 to line 615, buddy is not changed, therefore there is
      no need to fetch buddy again from mb_find_buddy() with order 0 again.
      
      We can safely remove the second mb_find_buddy() on line 615.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      235772da
    • C
      ext4: code cleanup in mb_find_buddy() · 84b775a3
      Coly Li 提交于
      Current code calculate max no matter whether order is zero, it's
      unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
      + 3)" only when order == 0.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      84b775a3
  8. 24 2月, 2011 4 次提交
  9. 22 2月, 2011 4 次提交
  10. 17 2月, 2011 1 次提交
    • L
      vfs: fix BUG_ON() in fs/namei.c:1461 · 3abb17e8
      Linus Torvalds 提交于
      When Al moved the nameidata_dentry_drop_rcu_maybe() call into the
      do_follow_link function in commit 844a3917 ("nothing in
      do_follow_link() is going to see RCU"), he mistakenly left the
      
      	BUG_ON(inode != path->dentry->d_inode);
      
      behind.  Which would otherwise be ok, but that BUG_ON() really needs to
      be _after_ dropping RCU, since the dentry isn't necessarily stable
      otherwise.
      
      So complete the code movement in that commit, and move the BUG_ON() into
      do_follow_link() too.  This means that we need to pass in 'inode' as an
      argument (just for this one use), but that's a small thing.  And
      eventually we may be confident enough in our path lookup that we can
      just remove the BUG_ON() and the unnecessary inode argument.
      Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3abb17e8
  11. 16 2月, 2011 5 次提交