1. 17 8月, 2012 1 次提交
  2. 14 7月, 2012 2 次提交
  3. 05 3月, 2012 1 次提交
    • J
      ext4: fix race between sync and completed io work · 491caa43
      Jeff Moyer 提交于
      The following command line will leave the aio-stress process unkillable
      on an ext4 file system (in my case, mounted on /mnt/test):
      
      aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2
      
      This is using the aio-stress program from the xfstests test suite.
      That particular command line tells aio-stress to do random writes to
      20 files from 20 threads (one thread per file).  The files are NOT
      preallocated, so you will get writes to random offsets within the
      file, thus creating holes and extending i_size.  It also opens the
      file with O_DIRECT and O_SYNC.
      
      On to the problem.  When an I/O requires unwritten extent conversion,
      it is queued onto the completed_io_list for the ext4 inode.  Two code
      paths will pull work items from this list.  The first is the
      ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
      which is called via the fsync path (and O_SYNC handling, as well).
      There are two issues I've found in these code paths.  First, if the
      fsync path beats the work routine to a particular I/O, the work
      routine will free the io_end structure!  It does not take into account
      the fact that the io_end may still be in use by the fsync path.  I've
      fixed this issue by adding yet another IO_END flag, indicating that
      the io_end is being processed by the fsync path.
      
      The second problem is that the work routine will make an assignment to
      io->flag outside of the lock.  I have witnessed this result in a hang
      at umount.  Moving the flag setting inside the lock resolved that
      problem.
      
      The problem was introduced by commit b82e384c ("ext4: optimize
      locking for end_io extent conversion"), which first appeared in 3.2.
      As such, the fix should be backported to that release (probably along
      with the unwritten extent conversion race fix).
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      CC: stable@kernel.org
      491caa43
  4. 31 10月, 2011 2 次提交
    • T
      ext4: optimize locking for end_io extent conversion · b82e384c
      Theodore Ts'o 提交于
      Now that we are doing the locking correctly, we need to grab the
      i_completed_io_lock() twice per end_io.  We can clean this up by
      removing the structure from the i_complted_io_list, and use this as
      the locking mechanism to prevent ext4_flush_completed_IO() racing
      against ext4_end_io_work(), instead of clearing the
      EXT4_IO_END_UNWRITTEN in io->flag.
      
      In addition, if the ext4_convert_unwritten_extents() returns an error,
      we no longer keep the end_io structure on the linked list.  This
      doesn't help, because it tends to lock up the file system and wedges
      the system.  That's one way to call attention to the problem, but it
      doesn't help the overall robustness of the system.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b82e384c
    • T
      ext4: Use correct locking for ext4_end_io_nolock() · d73d5046
      Tao Ma 提交于
      We must hold i_completed_io_lock when manipulating anything on the
      i_completed_io_list linked list.  This includes io->lock, which we
      were checking in ext4_end_io_nolock().
      
      So move this check to ext4_end_io_work().  This also has the bonus of
      avoiding extra work if it is already done without needing to take the
      mutex.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d73d5046
  5. 18 10月, 2011 1 次提交
  6. 31 7月, 2011 1 次提交
    • T
      ext4: fix races in ext4_sync_parent() · d59729f4
      Theodore Ts'o 提交于
      Fix problems if fsync() races against a rename of a parent directory
      as pointed out by Al Viro in his own inimitable way:
      
      >While we are at it, could somebody please explain what the hell is ext4
      >doing in
      >static int ext4_sync_parent(struct inode *inode)
      >{
      >        struct writeback_control wbc;
      >        struct dentry *dentry = NULL;
      >        int ret = 0;
      >
      >        while (inode && ext4_test_inode_state(inode, EXT4_STATE_NEWENTRY)) {
      >                ext4_clear_inode_state(inode, EXT4_STATE_NEWENTRY);
      >                dentry = list_entry(inode->i_dentry.next,
      >                                    struct dentry, d_alias);
      >                if (!dentry || !dentry->d_parent || !dentry->d_parent->d_inode)
      >                        break;
      >                inode = dentry->d_parent->d_inode;
      >                ret = sync_mapping_buffers(inode->i_mapping);
      >                ...
      >Note that dentry obviously can't be NULL there.  dentry->d_parent is never
      >NULL.  And dentry->d_parent would better not be negative, for crying out
      >loud!  What's worse, there's no guarantees that dentry->d_parent will
      >remain our parent over that sync_mapping_buffers() *and* that inode won't
      >just be freed under us (after rename() and memory pressure leading to
      >eviction of what used to be our dentry->d_parent)......
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d59729f4
  7. 21 7月, 2011 1 次提交
    • J
      fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers · 02c24a82
      Josef Bacik 提交于
      Btrfs needs to be able to control how filemap_write_and_wait_range() is called
      in fsync to make it less of a painful operation, so push down taking i_mutex and
      the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
      file systems can drop taking the i_mutex altogether it seems, like ext3 and
      ocfs2.  For correctness sake I just pushed everything down in all cases to make
      sure that we keep the current behavior the same for everybody, and then each
      individual fs maintainer can make up their mind about what to do from there.
      Thanks,
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      02c24a82
  8. 25 5月, 2011 1 次提交
    • J
      ext4: fix waiting and sending of a barrier in ext4_sync_file() · 93628ffb
      Jan Kara 提交于
      jbd2_log_start_commit() returns 1 only when we really start a
      transaction.  But we also need to wait for a transaction when the
      commit is already running.  Fix this problem by waiting for
      transaction commit unconditionally (which is just a quick check if the
      transaction is already committed).
      
      Also we have to be more careful with sending of a barrier because when
      transaction is being committed in parallel to ext4_sync_file()
      running, we cannot be sure that the barrier the journalling code sends
      happens after we wrote all the data for fsync (note that not every
      data writeout needs to trigger metadata changes thus commit of some
      metadata changes can be running while other data is still written
      out). So use jbd2_will_send_data_barrier() helper to detect the common
      cases when we can be sure barrier will be issued by the commit code
      and issue the barrier ourselves in the remaining cases.
      Reported-by: NEdward Goggin <egoggin@vmware.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      93628ffb
  9. 09 5月, 2011 1 次提交
    • T
      ext4: use EXT4FS_DEBUG instead of EXT4_DEBUG in fsync.c · e8bbe8c4
      Tao Ma 提交于
      We have EXT4FS_DEBUG for some old debug and CONFIG_EXT4_DEBUG
      for the new mballoc debug, but there isn't any EXT4_DEBUG.
      
      As CONFIG_EXT4_DEBUG seems to be only used in mballoc, use
      EXT4FS_DEBUG in fsync.c.
      
      [ It doesn't really matter; although I'm including this commit for
        consistency's sake.  The whole point of the #ifdef's is to disable
        the debugging code.  In general you're not going to want to enable
        all of the code protected by EXT4FS_DEBUG at the same time.  -- Ted ]
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e8bbe8c4
  10. 11 4月, 2011 1 次提交
    • C
      ext4: sync the directory inode in ext4_sync_parent() · 0893ed45
      Curt Wohlgemuth 提交于
      ext4 has taken the stance that, in the absence of a journal,
      when an fsync/fdatasync of an inode is done, the parent
      directory should be sync'ed if this inode entry is new.
      ext4_sync_parent(), which implements this, does indeed sync
      the dirent pages for parent directories, but it does not
      sync the directory *inode*.  This patch fixes this.
      
      Also now return error status from ext4_sync_parent().
      
      I tested this using a power fail test, which panics a
      machine running a file server getting requests from a
      client.  Without this patch, on about every other test run,
      the server is missing many, many files that had been synced.
      With this patch, on > 6 runs, I see zero files being lost.
      
      Google-Bug-Id: 4179519
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0893ed45
  11. 31 3月, 2011 1 次提交
  12. 22 3月, 2011 1 次提交
  13. 11 1月, 2011 1 次提交
    • J
      ext4: flush the i_completed_io_list during ext4_truncate · 3889fd57
      Jiaying Zhang 提交于
      Ted first found the bug when running 2.6.36 kernel with dioread_nolock
      mount option that xfstests #13 complained about wrong file size during fsck.
      However, the bug exists in the older kernels as well although it is
      somehow harder to trigger.
      
      The problem is that ext4_end_io_work() can happen after we have truncated an
      inode to a smaller size. Then when ext4_end_io_work() calls 
      ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
      been truncated, so the inode size becomes inconsistent with the allocated
      blocks. 
      
      The following patch flushes the i_completed_io_list during truncate to reduce 
      the risk that some pending end_io requests are executed later and convert 
      already truncated blocks to initialized. 
      
      Note that although the fix helps reduce the problem a lot there may still 
      be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
      problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
      held, it can race with an ongoing write request so that the io_end request is
      processed later when the corresponding blocks have been truncated.
      
      Ted and I have discussed the problem offline and we saw a few ways to fix
      the race completely:
      
      a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
      whenever vmtruncate() is called. The i_mutex lock prevents any new write
      requests from entering writeback and the i_alloc_sem prevents the race
      from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
      is called from do_truncate(), which is probably the most common case.
      However, there are places where we may call vmtruncate() without holding
      either i_mutex or i_alloc_sem. I would like to ask for other people's
      opinions on what locks are expected to be held before calling vmtruncate().
      There seems a disagreement among the callers of that function.
      
      b) We change the ext4 write path so that we change the extent tree to contain 
      the newly allocated blocks and update i_size both at the same time --- when 
      the write of the data blocks is completed.
      
      c) We add some additional locking to synchronize vmtruncate() and 
      ext4_end_io_work(). This approach may have performance implications so we
      need to be careful.
      
      All of the above proposals may require more substantial changes, so
      we may consider to take the following patch as a bandaid.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3889fd57
  14. 28 10月, 2010 1 次提交
  15. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed
  16. 28 5月, 2010 2 次提交
  17. 17 5月, 2010 2 次提交
  18. 10 5月, 2010 1 次提交
  19. 29 4月, 2010 1 次提交
  20. 03 3月, 2010 1 次提交
  21. 23 12月, 2009 1 次提交
    • T
      ext4, jbd2: Add barriers for file systems with exernal journals · cc3e1bea
      Theodore Ts'o 提交于
      This is a bit complicated because we are trying to optimize when we
      send barriers to the fs data disk.  We could just throw in an extra
      barrier to the data disk whenever we send a barrier to the journal
      disk, but that's not always strictly necessary.
      
      We only need to send a barrier during a commit when there are data
      blocks which are must be written out due to an inode written in
      ordered mode, or if fsync() depends on the commit to force data blocks
      to disk.  Finally, before we drop transactions from the beginning of
      the journal during a checkpoint operation, we need to guarantee that
      any blocks that were flushed out to the data disk are firmly on the
      rust platter before we drop the transaction from the journal.
      
      Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cc3e1bea
  22. 09 12月, 2009 1 次提交
  23. 23 11月, 2009 1 次提交
  24. 29 9月, 2009 1 次提交
    • M
      ext4: async direct IO for holes and fallocate support · 8d5d02e6
      Mingming Cao 提交于
      For async direct IO that covers holes or fallocate, the end_io
      callback function now queued the convertion work on workqueue but
      don't flush the work rightaway as it might take too long to afford.
      
      But when fsync is called after all the data is completed, user expects
      the metadata also being updated before fsync returns.
      
      Thus we need to flush the conversion work when fsync() is called.
      This patch keep track of a listed of completed async direct io that
      has a work queued on workqueue.  When fsync() is called, it will go
      through the list and do the conversion.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      8d5d02e6
  25. 13 9月, 2009 1 次提交
  26. 06 9月, 2009 1 次提交
  27. 17 6月, 2009 1 次提交
  28. 06 10月, 2008 1 次提交
  29. 09 9月, 2008 1 次提交
  30. 12 7月, 2008 1 次提交
  31. 30 4月, 2008 1 次提交
  32. 17 4月, 2008 1 次提交
    • H
      ext4: fdatasync should skip metadata writeout when overwriting · 53c550e9
      Hisashi Hifumi 提交于
      Currently fdatasync is identical to fsync in ext3.
      
      I think fdatasync should skip journal flush in data=ordered and
      data=writeback mode when it overwrites to already-instantiated blocks on
      HDD.  When I_DIRTY_DATASYNC flag is not set, fdatasync should skip journal
      writeout because this indicates only atime or/and mtime updates.
      
      Following patch is the same approach of ext2's fsync code(ext2_sync_file).
      
      I did a performance test using the sysbench.
      
      #sysbench --num-threads=128 --max-requests=50000 --test=fileio --file-total-size=128G
      --file-test-mode=rndwr --file-fsync-mode=fdatasync run
      
      The result on ext3 was:
      
      	-2.6.24
      	Operations performed:  0 Read, 50080 Write, 59600 Other = 109680 Total
      	Read 0b  Written 782.5Mb  Total transferred 782.5Mb  (12.116Mb/sec)
      	  775.45 Requests/sec executed
      
      	Test execution summary:
      	    total time:                          64.5814s
      	    total number of events:              50080
      	    total time taken by event execution: 3713.9836
      	    per-request statistics:
      	         min:                            0.0000s
      	         avg:                            0.0742s
      	         max:                            0.9375s
      	         approx.  95 percentile:         0.2901s
      
      	Threads fairness:
      	    events (avg/stddev):           391.2500/23.26
      	    execution time (avg/stddev):   29.0155/1.99
      
      	-2.6.24-patched
      	Operations performed:  0 Read, 50009 Write, 61596 Other = 111605 Total
      	Read 0b  Written 781.39Mb  Total transferred 781.39Mb  (16.419Mb/sec)
      	1050.83 Requests/sec executed
      
      	Test execution summary:
      	    total time:                          47.5900s
      	    total number of events:              50009
      	    total time taken by event execution: 2934.5768
      	    per-request statistics:
       	         min:                            0.0000s
      	         avg:                            0.0587s
       	         max:                            0.8938s
      	         approx.  95 percentile:         0.1993s
      
      	Threads fairness:
      	    events (avg/stddev):           390.6953/22.64
      	    execution time (avg/stddev):   22.9264/1.17
      
      Filesystem I/O throughput was improved.
      
      Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      53c550e9
  33. 18 10月, 2007 1 次提交
  34. 12 10月, 2006 3 次提交