1. 22 3月, 2011 1 次提交
  2. 28 2月, 2011 1 次提交
    • Y
      ext4: make FIEMAP and delayed allocation play well together · 6d9c85eb
      Yongqiang Yang 提交于
      Fix the FIEMAP ioctl so that it returns all of the page ranges which
      are still subject to delayed allocation.  We were missing some cases
      if the file was sparse.
      
      Reported by Chris Mason <chris.mason@oracle.com>:
      >We've had reports on btrfs that cp is giving us files full of zeros
      >instead of actually copying them.  It was tracked down to a bug with
      >the btrfs fiemap implementation where it was returning holes for
      >delalloc ranges.
      >
      >Newer versions of cp are trusting fiemap to tell it where the holes
      >are, which does seem like a pretty neat trick.
      >
      >I decided to give xfs and ext4 a shot with a few tests cases too, xfs
      >passed with all the ones btrfs was getting wrong, and ext4 got the basic
      >delalloc case right.
      >$ mkfs.ext4 /dev/xxx
      >$ mount /dev/xxx /mnt
      >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
      >$ fiemap-test foo
      >ext:   0 logical: [       0..     255] phys:        0..     255
      >flags: 0x007 tot: 256
      >
      >Horray!  But once we throw a hole in, things go bad:
      >$ mkfs.ext4 /dev/xxx
      >$ mount /dev/xxx /mnt
      >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
      >$ fiemap-test foo
      >< no output >
      >
      >We've got a delalloc extent after the hole and ext4 fiemap didn't find
      >it.  If I run sync to kick the delalloc out:
      >$sync
      >$ fiemap-test foo
      >ext:   0 logical: [     256..     511] phys:    34048..   34303
      >flags: 0x001 tot: 256
      >
      >fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
      >got there.  It's full of pretty comments so I know it isn't mine, but
      >you can grab it here:
      >
      >http://oss.oracle.com/~mason/fiemap-test.c
      >
      >xfsqa has a fiemap program too.
      
      After Fix, test results are as follows:
      ext:   0 logical: [     256..     511] phys:        0..     255
      flags: 0x007 tot: 256
      ext:   0 logical: [     256..     511] phys:    33280..   33535
      flags: 0x001 tot: 256
      
      $ mkfs.ext4 /dev/xxx
      $ mount /dev/xxx /mnt
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
      $ sync
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5
      $ fiemap-test foo
      ext:   0 logical: [     256..     511] phys:    33280..   33535
      flags: 0x000 tot: 256
      ext:   1 logical: [     768..    1023] phys:        0..     255
      flags: 0x006 tot: 256
      ext:   2 logical: [    1280..    1535] phys:        0..     255
      flags: 0x007 tot: 256
      Tested-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NYongqiang Yang <xiaoqiangnk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6d9c85eb
  3. 22 2月, 2011 1 次提交
  4. 12 2月, 2011 1 次提交
    • E
      ext4: serialize unaligned asynchronous DIO · e9e3bcec
      Eric Sandeen 提交于
      ext4 has a data corruption case when doing non-block-aligned
      asynchronous direct IO into a sparse file, as demonstrated
      by xfstest 240.
      
      The root cause is that while ext4 preallocates space in the
      hole, mappings of that space still look "new" and 
      dio_zero_block() will zero out the unwritten portions.  When
      more than one AIO thread is going, they both find this "new"
      block and race to zero out their portion; this is uncoordinated
      and causes data corruption.
      
      Dave Chinner fixed this for xfs by simply serializing all
      unaligned asynchronous direct IO.  I've done the same here.
      The difference is that we only wait on conversions, not all IO.
      This is a very big hammer, and I'm not very pleased with
      stuffing this into ext4_file_write().  But since ext4 is
      DIO_LOCKING, we need to serialize it at this high level.
      
      I tried to move this into ext4_ext_direct_IO, but by then
      we have the i_mutex already, and we will wait on the
      work queue to do conversions - which must also take the
      i_mutex.  So that won't work.
      
      This was originally exposed by qemu-kvm installing to
      a raw disk image with a normal sector-63 alignment.  I've
      tested a backport of this patch with qemu, and it does
      avoid the corruption.  It is also quite a lot slower
      (14 min for package installs, vs. 8 min for well-aligned)
      but I'll take slow correctness over fast corruption any day.
      
      Mingming suggested that we can track outstanding
      conversions, and wait on those so that non-sparse
      files won't be affected, and I've implemented that here;
      unaligned AIO to nonsparse files won't take a perf hit.
      
      [tytso@mit.edu: Keep the mutex as a hashed array instead
       of bloating the ext4 inode]
      
      [tytso@mit.edu: Fix up namespace issues so that global
       variables are protected with an "ext4_" prefix.]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e9e3bcec
  5. 17 1月, 2011 2 次提交
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • C
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig 提交于
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  6. 13 1月, 2011 1 次提交
  7. 11 1月, 2011 5 次提交
    • E
      ext4: don't pass entire map to check_eofblocks_fl · d002ebf1
      Eric Sandeen 提交于
      Since check_eofblocks_fl() only uses the m_lblk portion of the map
      structure, we may as well pass that directly, rather than passing the
      entire map, which IMHO obfuscates what parameters check_eofblocks_fl()
      cares about.  Not a big deal, but seems tidier and less confusing, to
      me.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d002ebf1
    • J
      ext4: flush the i_completed_io_list during ext4_truncate · 3889fd57
      Jiaying Zhang 提交于
      Ted first found the bug when running 2.6.36 kernel with dioread_nolock
      mount option that xfstests #13 complained about wrong file size during fsck.
      However, the bug exists in the older kernels as well although it is
      somehow harder to trigger.
      
      The problem is that ext4_end_io_work() can happen after we have truncated an
      inode to a smaller size. Then when ext4_end_io_work() calls 
      ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
      been truncated, so the inode size becomes inconsistent with the allocated
      blocks. 
      
      The following patch flushes the i_completed_io_list during truncate to reduce 
      the risk that some pending end_io requests are executed later and convert 
      already truncated blocks to initialized. 
      
      Note that although the fix helps reduce the problem a lot there may still 
      be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
      problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
      held, it can race with an ongoing write request so that the io_end request is
      processed later when the corresponding blocks have been truncated.
      
      Ted and I have discussed the problem offline and we saw a few ways to fix
      the race completely:
      
      a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
      whenever vmtruncate() is called. The i_mutex lock prevents any new write
      requests from entering writeback and the i_alloc_sem prevents the race
      from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
      is called from do_truncate(), which is probably the most common case.
      However, there are places where we may call vmtruncate() without holding
      either i_mutex or i_alloc_sem. I would like to ask for other people's
      opinions on what locks are expected to be held before calling vmtruncate().
      There seems a disagreement among the callers of that function.
      
      b) We change the ext4 write path so that we change the extent tree to contain 
      the newly allocated blocks and update i_size both at the same time --- when 
      the write of the data blocks is completed.
      
      c) We add some additional locking to synchronize vmtruncate() and 
      ext4_end_io_work(). This approach may have performance implications so we
      need to be careful.
      
      All of the above proposals may require more substantial changes, so
      we may consider to take the following patch as a bandaid.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3889fd57
    • T
      ext4: drop ec_type from the ext4_ext_cache structure · b05e6ae5
      Theodore Ts'o 提交于
      We can encode the ec_type information by using ee_len == 0 to denote
      EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if
      neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT.
      This allows us to reduce the size of ext4_ext_inode by another 8
      bytes.  (ec_type is 4 bytes, plus another 4 bytes of padding)
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b05e6ae5
    • T
      ext4: use ext4_lblk_t instead of sector_t for logical blocks · 01f49d0b
      Theodore Ts'o 提交于
      This fixes a number of places where we used sector_t instead of
      ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data
      types.  No point wasting space in the ext4_inode_info structure, and
      requiring 64-bit arithmetic on 32-bit systems, when it isn't
      necessary.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      01f49d0b
    • K
      ext4: fix 32bit overflow in ext4_ext_find_goal() · ad4fb9ca
      Kazuya Mio 提交于
      ext4_ext_find_goal() returns an ideal physical block number that the block
      allocator tries to allocate first. However, if a required file offset is
      smaller than the existing extent's one, ext4_ext_find_goal() returns
      a wrong block number because it may overflow at
      "block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.
      
      ext4_ext_find_goal() will also return a wrong block number in case
      a file offset of the existing extent is too big. In this case,
      the ideal physical block number is fixed in ext4_mb_initialize_context(),
      so it's no problem.
      
      reproduce:
      # dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
      # dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
      # filefrag -v /mnt/mp1/file
      Filesystem type is: ef53
      File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
       ext logical physical expected length flags
         0     128    67456             128 eof
      /mnt/mp1/file: 2 extents found
      # rm -rf /mnt/mp1/tmp
      # echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
      # dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc
      
      result (linux-2.6.37-rc2 + ext4 patch queue):
      # filefrag -v /mnt/mp1/file
      Filesystem type is: ef53
      File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
       ext logical physical expected length flags
         0       0    33280             128 
         1     128    67456    33407    128 eof
      /mnt/mp1/file: 2 extents found
      
      result(apply this patch):
      # filefrag -v /mnt/mp1/file
      Filesystem type is: ef53
      File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
       ext logical physical expected length flags
         0       0    66560             128 
         1     128    67456    66687    128 eof
      /mnt/mp1/file: 2 extents found
      Signed-off-by: NKazuya Mio <k-mio@sx.jp.nec.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ad4fb9ca
  8. 02 11月, 2010 1 次提交
  9. 28 10月, 2010 6 次提交
  10. 27 7月, 2010 1 次提交
  11. 17 6月, 2010 1 次提交
  12. 15 6月, 2010 1 次提交
  13. 17 5月, 2010 9 次提交
  14. 16 5月, 2010 1 次提交
  15. 12 5月, 2010 1 次提交
  16. 04 4月, 2010 1 次提交
  17. 04 3月, 2010 2 次提交
  18. 03 3月, 2010 1 次提交
  19. 05 3月, 2010 1 次提交
    • J
      ext4: use ext4_get_block_write in buffer write · 744692dc
      Jiaying Zhang 提交于
      Allocate uninitialized extent before ext4 buffer write and
      convert the extent to initialized after io completes.
      The purpose is to make sure an extent can only be marked
      initialized after it has been written with new data so
      we can safely drop the i_mutex lock in ext4 DIO read without
      exposing stale data. This helps to improve multi-thread DIO
      read performance on high-speed disks.
      
      Skip the nobh and data=journal mount cases to make things simple for now.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      744692dc
  20. 03 3月, 2010 1 次提交
  21. 24 2月, 2010 1 次提交