1. 27 2月, 2011 8 次提交
    • T
      ext4: don't lock the next page in write_cache_pages if not needed · 78aaced3
      Theodore Ts'o 提交于
      If we have accumulated a contiguous region of memory to be written
      out, and the next page can added to this region, don't bother locking
      (and then unlocking the page) before writing out the memory.  In the
      unlikely event that the next page was being written back by some other
      CPU, we can also skip waiting that page to finish writeback.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      78aaced3
    • T
      ext4: remove page_skipped hackery in ext4_da_writepages() · ee6ecbcc
      Theodore Ts'o 提交于
      Because the ext4 page writeback codepath had been prematurely calling
      clear_page_dirty_for_io(), if it turned out that a particular page
      couldn't be written out during a particular pass of
      write_cache_pages_da(), the page would have to get redirtied by
      calling redirty_pages_for_writeback().  Not only was this wasted work,
      but redirty_page_for_writeback() would increment wbc->pages_skipped to
      signal to writeback_sb_inodes() that buffers were locked, and that it
      should skip this inode until later.
      
      Since this signal was incorrect in ext4's case --- which was caused by
      ext4's historically incorrect use of write_cache_pages() ---
      ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
      confusing writeback_sb_inodes().
      
      Now that we've fixed ext4 to call clear_page_dirty_for_io() right
      before initiating the page I/O, we can nuke the page_skipped
      save/restore hackery, and breathe a sigh of relief.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ee6ecbcc
    • T
      ext4: clear the dirty bit for a page in writeback at the last minute · 97498956
      Theodore Ts'o 提交于
      Move when we call clear_page_dirty_for_io() to just before we actually
      write the page.  This simplifies the code somewhat, and avoids marking
      pages as clean and then needing to remark them as dirty later.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      97498956
    • T
      ext4: simple cleanups to write_cache_pages_da() · 4f01b02c
      Theodore Ts'o 提交于
      Eliminate duplicate code, unneeded variables, etc., to make it easier
      to understand the code.  No behavioral changes were made in this patch.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4f01b02c
    • T
      ext4: fold __mpage_da_writepage() into write_cache_pages_da() · 8eb9e5ce
      Theodore Ts'o 提交于
      Fold the __mpage_da_writepage() function into write_cache_pages_da().
      This will give us opportunities to clean up and simplify the resulting
      code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8eb9e5ce
    • T
      ext4: enable mblk_io_submit by default · 6fd7a467
      Theodore Ts'o 提交于
      Now that we've fixed the file corruption bug in commit d50bdd5a,
      it's time to enable mblk_io_submit by default.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6fd7a467
    • C
      ext4: fix ext4_da_block_invalidatepages() to handle page range properly · c7f5938a
      Curt Wohlgemuth 提交于
      If ext4_da_block_invalidatepages() is called because of a
      failure from ext4_map_blocks() in mpage_da_map_and_submit(),
      it's supposed to clean up -- including unlock -- all the
      pages in the mpd structure.  But these values may not match
      up, even on a system in which block size == page size:
      
         mpd->b_blocknr != mpd->first_page
         mpd->b_size != (mpd->next_page - mpd->first_page)
      
      ext4_da_block_invalidatepages() has been using b_blocknr and
      b_size; this patch changes it to use first_page and
      next_page.
      
      Tested:  I injected a small number (5%) of failures in
      ext4_map_blocks() in the case that the flags contain
      EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
      kernel.  Without this patch, I got hung tasks every time.
      With this patch, I see no hangs in many runs of fsstress.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c7f5938a
    • C
      ext4: mark multi-page IO complete on mapping failure · e0fd9b90
      Curt Wohlgemuth 提交于
      In mpage_da_map_and_submit(), if we have a delayed block
      allocation failure from ext4_map_blocks(), we need to mark
      the IO as complete, by setting
      
            mpd->io_done = 1;
      
      Otherwise, we could end up submitting the pages in an outer
      loop; since they are unlocked on mapping failure in
      ext4_da_block_invalidatepages(), this will cause a bug check
      in mpage_da_submit_io().
      
      I tested this by injected failures into ext4_map_blocks().
      Without this patch, a simple fsstress run will bug check;
      with the patch, it works fine.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e0fd9b90
  2. 25 2月, 2011 5 次提交
    • C
      ext4: mballoc: don't replace the current preallocation group unnecessarily · 5a54b2f1
      Coly Li 提交于
      In ext4_mb_check_group_pa(), the current preallocation space is
      replaced with a new preallocation space when the two have the same
      distance from the goal block.
      
      This doesn't actually gain us anything, so change things so that the
      function only switches to the new preallocation group if its distance
      from the goal block is strictly smaller than the current preallocaiton
      group's distance from the goal block.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5a54b2f1
    • C
      ext4: clarify description of ac_g_ex in struct ext4_allocation_context · 58696f3a
      Coly Li 提交于
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      58696f3a
    • C
      mballoc: add comments to ext4_mb_mark_free_simple() · 7c786059
      Coly Li 提交于
      This patch adds comments to ext4_mb_mark_free_simple to make it more
      understandable.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      7c786059
    • C
      ext4: remove unncessary call mb_find_buddy() in debugging code · 235772da
      Coly Li 提交于
      In __mb_check_buddy(), look at the code below:
        591         fstart = -1;
        592         buddy = mb_find_buddy(e4b, 0, &max);
        593         for (i = 0; i < max; i++) {
        594                 if (!mb_test_bit(i, buddy)) {
        595                         MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
        596                         if (fstart == -1) {
        597                                 fragments++;
        598                                 fstart = i;
        599                         }
        600                         continue;
        601                 }
        602                 fstart = -1;
        603                 /* check used bits only */
        604                 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
        605                         buddy2 = mb_find_buddy(e4b, j, &max2);
        606                         k = i >> j;
        607                         MB_CHECK_ASSERT(k < max2);
        608                         MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
        609                 }
        610         }
        611         MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
        612         MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
        613
        614         grp = ext4_get_group_info(sb, e4b->bd_group);
        615         buddy = mb_find_buddy(e4b, 0, &max);
      
      On line 592, buddy is fetched by mb_find_buddy() with order 0, between
      line 593 to line 615, buddy is not changed, therefore there is
      no need to fetch buddy again from mb_find_buddy() with order 0 again.
      
      We can safely remove the second mb_find_buddy() on line 615.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      235772da
    • C
      ext4: code cleanup in mb_find_buddy() · 84b775a3
      Coly Li 提交于
      Current code calculate max no matter whether order is zero, it's
      unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
      + 3)" only when order == 0.
      Signed-off-by: NColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      84b775a3
  3. 24 2月, 2011 4 次提交
  4. 22 2月, 2011 3 次提交
  5. 12 2月, 2011 2 次提交
    • E
      ext4: serialize unaligned asynchronous DIO · e9e3bcec
      Eric Sandeen 提交于
      ext4 has a data corruption case when doing non-block-aligned
      asynchronous direct IO into a sparse file, as demonstrated
      by xfstest 240.
      
      The root cause is that while ext4 preallocates space in the
      hole, mappings of that space still look "new" and 
      dio_zero_block() will zero out the unwritten portions.  When
      more than one AIO thread is going, they both find this "new"
      block and race to zero out their portion; this is uncoordinated
      and causes data corruption.
      
      Dave Chinner fixed this for xfs by simply serializing all
      unaligned asynchronous direct IO.  I've done the same here.
      The difference is that we only wait on conversions, not all IO.
      This is a very big hammer, and I'm not very pleased with
      stuffing this into ext4_file_write().  But since ext4 is
      DIO_LOCKING, we need to serialize it at this high level.
      
      I tried to move this into ext4_ext_direct_IO, but by then
      we have the i_mutex already, and we will wait on the
      work queue to do conversions - which must also take the
      i_mutex.  So that won't work.
      
      This was originally exposed by qemu-kvm installing to
      a raw disk image with a normal sector-63 alignment.  I've
      tested a backport of this patch with qemu, and it does
      avoid the corruption.  It is also quite a lot slower
      (14 min for package installs, vs. 8 min for well-aligned)
      but I'll take slow correctness over fast corruption any day.
      
      Mingming suggested that we can track outstanding
      conversions, and wait on those so that non-sparse
      files won't be affected, and I've implemented that here;
      unaligned AIO to nonsparse files won't take a perf hit.
      
      [tytso@mit.edu: Keep the mutex as a hashed array instead
       of bloating the ext4 inode]
      
      [tytso@mit.edu: Fix up namespace issues so that global
       variables are protected with an "ext4_" prefix.]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e9e3bcec
    • E
      ext4: make grpinfo slab cache names static · 2892c15d
      Eric Sandeen 提交于
      In 2.6.37 I was running into oopses with repeated module
      loads & unloads.  I tracked this down to:
      
      fb1813f4 ext4: use dedicated slab caches for group_info structures
      
      (this was in addition to the features advert unload problem)
      
      The kstrdup & subsequent kfree of the cache name was causing
      a double free.  In slub, at least, if I read it right it allocates
      & frees the name itself, slab seems to do something different...
      so in slub I think we were leaking -our- cachep->name, and double
      freeing the one allocated by slub.
      
      After getting lost in slab/slub/slob a bit, I just looked at other
      sized-caches that get allocated.  jbd2, biovec, sgpool all do it
      more or less the way jbd2 does.  Below patch follows the jbd2
      method of dynamically allocating a cache at mount time from
      a list of static names.
      
      (This might also possibly fix a race creating the caches with
      parallel mounts running).
      
      [Folded in a fix from Dan Carpenter which fixed an off-by-one error in
      the original patch]
      
      Cc: stable@kernel.org
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2892c15d
  6. 08 2月, 2011 1 次提交
    • C
      ext4: Fix data corruption with multi-block writepages support · d50bdd5a
      Curt Wohlgemuth 提交于
      This fixes a corruption problem with the multi-block
      writepages submittal change for ext4, from commit
      bd2d0210 ("ext4: use bio
      layer instead of buffer layer in mpage_da_submit_io").
      
      (Note that this corruption is not present in 2.6.37 on
      ext4, because the corruption was detected after the
      feature was merged in 2.6.37-rc1, and so it was turned
      off by adding a non-default mount option,
      mblk_io_submit.  With this commit, which hopefully
      fixes the last of the bugs with this feature, we'll be
      able to turn on this performance feature by default in
      2.6.38, and remove the mblk_io_submit option.)
      
      The ext4 code path to bundle multiple pages for
      writeback in ext4_bio_write_page() had a bug: we should
      be clearing buffer head dirty flags *before* we submit
      the bio, not in the completion routine.
      
      The patch below was tested on 2.6.37 under KVM with the
      postgresql script which was submitted by Jon Nelson as
      documented in commit 1449032b.
      
      Without the patch, I'd hit the corruption problem about
      50-70% of the time.  With the patch, I executed the
      script > 100 times with no corruption seen.
      
      I also fixed a bug to make sure ext4_end_bio() doesn't
      dereference the bio after the bio_put() call.
      Reported-by: NJon Nelson <jnelson@jamponi.net>
      Reported-by: NMatthias Bayer <jackdachef@gmail.com>
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      d50bdd5a
  7. 04 2月, 2011 3 次提交
  8. 17 1月, 2011 2 次提交
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • C
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig 提交于
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  9. 14 1月, 2011 1 次提交
  10. 13 1月, 2011 2 次提交
    • J
      Ext4: fail if we try to use hole punch · d6dc8462
      Josef Bacik 提交于
      Ext4 doesn't have the ability to punch holes yet, so make sure we return
      EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
      be added later.  Thanks,
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d6dc8462
    • J
      quota: Fix deadlock during path resolution · f00c9e44
      Jan Kara 提交于
      As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
      is prone to deadlocks. We hold s_umount semaphore for reading during the
      path resolution and resolution itself may need to acquire the semaphore
      for writing when e. g. autofs mountpoint is passed.
      
      Solve the problem by performing the resolution before we get hold of the
      superblock (and thus s_umount semaphore). The whole thing is complicated
      by the fact that some filesystems (OCFS2) ignore the path argument. So to
      distinguish between filesystem which want the path and which do not we
      introduce new .quota_on_meta callback which does not get the path. OCFS2
      then uses this callback instead of old .quota_on.
      
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      CC: Christoph Hellwig <hch@lst.de>
      CC: Ted Ts'o <tytso@mit.edu>
      CC: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      f00c9e44
  11. 12 1月, 2011 2 次提交
  12. 11 1月, 2011 7 次提交
    • E
      ext4: don't pass entire map to check_eofblocks_fl · d002ebf1
      Eric Sandeen 提交于
      Since check_eofblocks_fl() only uses the m_lblk portion of the map
      structure, we may as well pass that directly, rather than passing the
      entire map, which IMHO obfuscates what parameters check_eofblocks_fl()
      cares about.  Not a big deal, but seems tidier and less confusing, to
      me.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d002ebf1
    • T
      ext4: fix memory leak in ext4_free_branches · 1c5b9e90
      Theodore Ts'o 提交于
      Commit 40389687 moved a call to ext4_forget() out of
      ext4_free_branches and let ext4_free_blocks() handle calling
      bforget().  But that change unfortunately did not replace the call to
      ext4_forget() with brelse(), which was needed to drop the in-use count
      of the indirect block's buffer head, which lead to a memory leak when
      deleting files that used indirect blocks.  Fix this.
      
      Thanks to Hugh Dickins for pointing this out.
      
      Cc: stable@kernel.org
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1c5b9e90
    • T
      ext4: remove ext4_mb_return_to_preallocation() · a5196f8c
      Theodore Ts'o 提交于
      This function was never implemented, except for a BUG_ON which was
      tripping when ext4 is run without a journal.  The problem is that
      although the comment asserts that "truncate (which is the only way to
      free block) discards all preallocations", ext4_free_blocks() is also
      called in various error recovery paths when blocks have been
      allocated, but for various reasons, we were not able to use those data
      blocks (for example, because we ran out of memory while trying to
      manipulate the extent tree, or some other similar situation).
      
      In addition to the fact that this function isn't implemented except
      for the incorrect BUG_ON, the single caller of this function,
      ext4_free_blocks(), doesn't use it all if the journal is enabled.
      
      So remove the (stub) function entirely for now.  If we decide it's
      better to add it back, it's only going to be useful with a relatively
      large number of code changes anyway.
      
      Google-Bug-Id: 3236408
      
      Cc: Jiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a5196f8c
    • J
      ext4: flush the i_completed_io_list during ext4_truncate · 3889fd57
      Jiaying Zhang 提交于
      Ted first found the bug when running 2.6.36 kernel with dioread_nolock
      mount option that xfstests #13 complained about wrong file size during fsck.
      However, the bug exists in the older kernels as well although it is
      somehow harder to trigger.
      
      The problem is that ext4_end_io_work() can happen after we have truncated an
      inode to a smaller size. Then when ext4_end_io_work() calls 
      ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
      been truncated, so the inode size becomes inconsistent with the allocated
      blocks. 
      
      The following patch flushes the i_completed_io_list during truncate to reduce 
      the risk that some pending end_io requests are executed later and convert 
      already truncated blocks to initialized. 
      
      Note that although the fix helps reduce the problem a lot there may still 
      be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
      problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
      held, it can race with an ongoing write request so that the io_end request is
      processed later when the corresponding blocks have been truncated.
      
      Ted and I have discussed the problem offline and we saw a few ways to fix
      the race completely:
      
      a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
      whenever vmtruncate() is called. The i_mutex lock prevents any new write
      requests from entering writeback and the i_alloc_sem prevents the race
      from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
      is called from do_truncate(), which is probably the most common case.
      However, there are places where we may call vmtruncate() without holding
      either i_mutex or i_alloc_sem. I would like to ask for other people's
      opinions on what locks are expected to be held before calling vmtruncate().
      There seems a disagreement among the callers of that function.
      
      b) We change the ext4 write path so that we change the extent tree to contain 
      the newly allocated blocks and update i_size both at the same time --- when 
      the write of the data blocks is completed.
      
      c) We add some additional locking to synchronize vmtruncate() and 
      ext4_end_io_work(). This approach may have performance implications so we
      need to be careful.
      
      All of the above proposals may require more substantial changes, so
      we may consider to take the following patch as a bandaid.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3889fd57
    • T
      ext4: add error checking to calls to ext4_handle_dirty_metadata() · b4097142
      Theodore Ts'o 提交于
      Call ext4_std_error() in various places when we can't bail out
      cleanly, so the file system can be marked as in error.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b4097142
    • J
      ext4: fix trimming of a single group · ca6e909f
      Jan Kara 提交于
      When ext4_trim_fs() is called to trim a part of a single group, the
      logic will wrongly set last block of the interval to 'len' instead
      of 'first_block + len'. Thus a shorter interval is possibly trimmed.
      Fix it.
      
      CC: Lukas Czerner <lczerner@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ca6e909f
    • A
      ext4: fix uninitialized variable in ext4_register_li_request · 6c5a6cb9
      Andrew Morton 提交于
      fs/ext4/super.c: In function 'ext4_register_li_request':
      fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function
      
      It looks buggy to me, too.
      
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6c5a6cb9