1. 18 8月, 2009 2 次提交
    • J
      ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks() · 487caeef
      Jan Kara 提交于
      During truncate we are sometimes forced to start a new transaction as
      the amount of blocks to be journaled is both quite large and hard to
      predict. So far we restarted a transaction while holding i_data_sem
      and that violates lock ordering because i_data_sem ranks below a
      transaction start (and it can lead to a real deadlock with
      ext4_get_blocks() mapping blocks in some page while having a
      transaction open).
      
      We fix the problem by dropping the i_data_sem before restarting the
      transaction and acquire it afterwards. It's slightly subtle that this
      works:
      
      1) By the time ext4_truncate() is called, all the page cache for the
      truncated part of the file is dropped so get_block() should not be
      called on it (we only have to invalidate extent cache after we
      reacquire i_data_sem because some extent from not-truncated part could
      extend also into the part we are going to truncate).
      
      2) Writes, migrate or defrag hold i_mutex so they are stopped for all
      the time of the truncate.
      
      This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      487caeef
    • J
      jbd2: Annotate transaction start also for jbd2_journal_restart() · 9599b0e5
      Jan Kara 提交于
      lockdep annotation for a transaction start has been at the end of
      jbd2_journal_start(). But a transaction is also started from
      jbd2_journal_restart(). Move the lockdep annotation to start_this_handle()
      which covers both cases.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9599b0e5
  2. 19 9月, 2009 1 次提交
  3. 01 9月, 2009 1 次提交
    • M
      ext4: Compile warning fix when EXT_DEBUG enabled · 84fe3bef
      Mingming 提交于
      When EXT_DEBUG is enabled I received the following compile warning on
      PPC64:
      
        CC [M]  fs/ext4/inode.o
        CC [M]  fs/ext4/extents.o
      fs/ext4/extents.c: In function ‘ext4_ext_rm_leaf’:
      fs/ext4/extents.c:2097: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_lblk_t’
      fs/ext4/extents.c: In function ‘ext4_ext_get_blocks’:
      fs/ext4/extents.c:2789: warning: format ‘%u’ expects type ‘unsigned int’, but argument 4 has type ‘long unsigned int’
      fs/ext4/extents.c:2852: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘ext4_lblk_t’
      fs/ext4/extents.c:2953: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘unsigned int’
        CC [M]  fs/ext4/migrate.o
      
      The patch fixes compile warning.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      Index: linux-2.6.31-rc4/fs/ext4/extents.c
      ===================================================================
      84fe3bef
  4. 19 9月, 2009 1 次提交
    • T
      ext4: Avoid group preallocation for closed files · 50797481
      Theodore Ts'o 提交于
      Currently the group preallocation code tries to find a large (512)
      free block from which to do per-cpu group allocation for small files.
      The problem with this scheme is that it leaves the filesystem horribly
      fragmented.  In the worst case, if the filesystem is unmounted and
      remounted (after a system shutdown, for example) we forget the fact
      that wee were using a particular (now-partially filled) 512 block
      extent.  So the next time we try to allocate space for a small file,
      we will find *another* completely free 512 block chunk to allocate
      small files.  Given that there are 32,768 blocks in a block group,
      after 64 iterations of "mount, write one 4k file in a directory,
      unmount", the block group will have 64 files, each separated by 511
      blocks, and the block group will no longer have any free 512
      completely free chunks of blocks for group preallocation space.
      
      So if we try to allocate blocks for a file that has been closed, such
      that we know the final size of the file, and the filesystem is not
      busy, avoid using group preallocation.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      50797481
  5. 10 8月, 2009 2 次提交
    • T
      ext4: Fix bugs in mballoc's stream allocation mode · 4ba74d00
      Theodore Ts'o 提交于
      The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
      screwed up.  These fields were getting unconditionally all the time,
      set even when stream allocation had not taken place, and if they were
      being used when the file was smaller than s_mb_stream_request, which
      is when the allocation should _not_ be doing stream allocation.
      
      Fix this by determining whether or not we stream allocation should
      take place once, in ext4_mb_group_or_file(), and setting a flag which
      gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
      This simplifies the code and assures that we are consistently using
      (or not using) the stream allocation logic.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4ba74d00
    • T
      ext4: Display the mballoc flags in mb_history in hex instead of decimal · 0ef90db9
      Theodore Ts'o 提交于
      Displaying the flags in base 16 makes it easier to see which flags
      have been set.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0ef90db9
  6. 19 9月, 2009 1 次提交
  7. 11 8月, 2009 3 次提交
  8. 28 7月, 2009 1 次提交
  9. 06 7月, 2009 2 次提交
  10. 17 7月, 2009 2 次提交
    • C
      ext4: More buffer head reference leaks · 6487a9d3
      Curt Wohlgemuth 提交于
      After the patch I posted last week regarding buffer head ref leaks in
      no-journal mode, I looked at all the code that uses buffer heads and
      searched for more potential leaks.
      
      The patch below fixes the issues I found; these can occur even when a
      journal is present.
      
      The change to inode.c fixes a double release if
      ext4_journal_get_create_access() fails.
      
      The changes to namei.c are more complicated.  add_dirent_to_buf() will
      release the input buffer head EXCEPT when it returns -ENOSPC.  There are
      some callers of this routine that don't always do the brelse() in the event
      that -ENOSPC is returned.  Unfortunately, to put this fix into ext4_add_entry()
      required capturing the return value of make_indexed_dir() and
      add_dirent_to_buf().
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6487a9d3
    • J
      jbd2: Fail to load a journal if it is too short · f6f50e28
      Jan Kara 提交于
      Due to on disk corruption, it can happen that journal is too short. Fail
      to load it in such case so that we don't oops somewhere later.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f6f50e28
  11. 28 7月, 2009 2 次提交
  12. 17 7月, 2009 1 次提交
  13. 16 9月, 2009 16 次提交
  14. 15 9月, 2009 4 次提交
    • J
      udf: Fix possible corruption when close races with write · cbc8cc33
      Jan Kara 提交于
      When we close a file, we remove preallocated blocks from it. But this
      truncation was not protected by i_mutex and thus it could have raced with a
      write through a different fd and cause crashes or even filesystem corruption.
      Signed-off-by: NJan Kara <jack@suse.cz>
      cbc8cc33
    • J
      udf: Perform preallocation only for regular files · 81056dd0
      Jan Kara 提交于
      So far we preallocated blocks also for directories but that brings a
      problem, when to get rid of preallocated blocks we don't need. So far
      we removed them in udf_clear_inode() which has a disadvantage that
      1) blocks are unavailable long after writing to a directory finished
         and thus one can get out of space unnecessarily early
      2) releasing blocks from udf_clear_inode is problematic because VFS
         does not expect us to redirty inode there and it also slows down
         memory reclaim.
      
      So preallocate blocks only for regular files where we can drop preallocation
      in udf_release_file.
      Signed-off-by: NJan Kara <jack@suse.cz>
      81056dd0
    • J
      udf: Remove wrong assignment in udf_symlink · 7c6e3d1a
      Jan Kara 提交于
      Recomputation of the pointer was wrong (it should have been just increment).
      Luckily, we never use the computed value. Remove it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      7c6e3d1a
    • J
      udf: Remove dead code · 5891d9dd
      Jan Kara 提交于
      Remove code that gets never used.
      Signed-off-by: NJan Kara <jack@suse.cz>
      5891d9dd
  15. 14 9月, 2009 1 次提交
    • C
      fsync: wait for data writeout completion before calling ->fsync · 2daea67e
      Christoph Hellwig 提交于
      Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
      the data, the calls into ->fsync to write out the metadata and then finally
      calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
      like a clever micro-optimization actually is nast trap for many filesystems.
      
      For many modern filesystems i_size or other inode information is only
      updated on I/O completion and we need to wait for I/O to finish before
      we can write out the metadata.  For old fashionen filesystems that
      instanciate blocks during the actual write and also update the metadata
      at that point it opens up a large window were we could expose uninitialized
      blocks after a crash.  While a few filesystems that need it already wait
      for the I/O to finish inside their ->fsync methods it is rather suboptimal
      as it is done under the i_mutex and also always for the whole file instead
      of just a part as we could do for O_SYNC handling.
      
      Here is a small audit of all fsync instances in the tree:
      
       - spufs_mfc_fsync:
       - ps3flash_fsync:
       - vol_cdev_fsync:
       - printer_fsync:
       - fb_deferred_io_fsync:
       - bad_file_fsync:
       - simple_sync_file:
      
      	don't care - filesystems/drivers do't use the page cache or are
      	purely in-memory.
      
       - simple_fsync:
       - file_fsync:
       - affs_file_fsync:
       - fat_file_fsync:
       - jfs_fsync:
       - ubifs_fsync:
       - reiserfs_dir_fsync:
       - reiserfs_sync_file:
      
      	never touch pagecache themselves.  We need to wait before if we do
      	not want to expose stale data after an allocation.
      
       - afs_fsync:
       - fuse_fsync_common:
      
      	do the waiting writeback itself in awkward ways, would benefit from
      	proper semantics
      
       - block_fsync:
      
      	Does a filemap_write_and_wait on the block device inode.  Because we
      	now have f_mapping that is the same inode we call it on in vfs_fsync.
      	So just removing it and letting the VFS do the work in one go would
      	be an improvement.
      
       - btrfs_sync_file:
       - cifs_fsync:
       - xfs_file_fsync:
      
      	need the wait first and currently do it themselves. would benefit from
      	doing it outside i_mutex.
      
       - coda_fsync:
       - ecryptfs_fsync:
       - exofs_file_fsync:
       - shm_fsync:
      
      	only passes the fsync through to the lower layer
      
       - ext3_sync_file:
      
      	doesn't seem to care, comments are confusing.
      
       - ext4_sync_file:
      
      	would need the wait to work correctly for delalloc mode with late
      	i_size updates.  Otherwise the ext3 comment applies.
      
      	currently implemens it's own writeback and wait in an odd way,
      	could benefit from doing it properly.
      
       - gfs2_fsync:
      
      	not needed for journaled data mode, but probably harmless there.
      	Currently writes back data asynchronously itself.  Needs some
      	major audit.
      
       - hostfs_fsync:
      
      	just calls fsync/datasync on the host FD.  Without the wait before
      	data might not even be inflight yet if we're unlucky.
      
       - hpfs_file_fsync:
       - ncp_fsync:
      
      	no-ops.  Dangerous before and after.
      
       - jffs2_fsync:
      
      	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.
      
       - nfs_fsync_dir:
      
      	just increments stats, claims all directory operations are synchronous
      
       - nfs_file_fsync:
      
      	only writes out data???  Looks very odd.
      
       - nilfs_sync_file:
      
      	looks like it expects all data done, but not sure from the code
      
       - ntfs_dir_fsync:
       - ntfs_file_fsync:
      
      	appear to do their own data writeback.  Very convoluted code.
      
       - ocfs2_sync_file:
      
      	does it's own data writeback, but no wait.  probably needs the wait.
      
       - smb_fsync:
      
      	according to a comment expects all pages written already, probably needs
      	the wait before.
      
      This patch only changes vfs_fsync_range, removal of the wait in the methods
      that have it is left to the filesystem maintainers.  Note that most
      filesystems really do need an audit for their fsync methods given the
      gems found in this very brief audit.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      2daea67e