1. 12 2月, 2011 1 次提交
    • E
      ext4: serialize unaligned asynchronous DIO · e9e3bcec
      Eric Sandeen 提交于
      ext4 has a data corruption case when doing non-block-aligned
      asynchronous direct IO into a sparse file, as demonstrated
      by xfstest 240.
      
      The root cause is that while ext4 preallocates space in the
      hole, mappings of that space still look "new" and 
      dio_zero_block() will zero out the unwritten portions.  When
      more than one AIO thread is going, they both find this "new"
      block and race to zero out their portion; this is uncoordinated
      and causes data corruption.
      
      Dave Chinner fixed this for xfs by simply serializing all
      unaligned asynchronous direct IO.  I've done the same here.
      The difference is that we only wait on conversions, not all IO.
      This is a very big hammer, and I'm not very pleased with
      stuffing this into ext4_file_write().  But since ext4 is
      DIO_LOCKING, we need to serialize it at this high level.
      
      I tried to move this into ext4_ext_direct_IO, but by then
      we have the i_mutex already, and we will wait on the
      work queue to do conversions - which must also take the
      i_mutex.  So that won't work.
      
      This was originally exposed by qemu-kvm installing to
      a raw disk image with a normal sector-63 alignment.  I've
      tested a backport of this patch with qemu, and it does
      avoid the corruption.  It is also quite a lot slower
      (14 min for package installs, vs. 8 min for well-aligned)
      but I'll take slow correctness over fast corruption any day.
      
      Mingming suggested that we can track outstanding
      conversions, and wait on those so that non-sparse
      files won't be affected, and I've implemented that here;
      unaligned AIO to nonsparse files won't take a perf hit.
      
      [tytso@mit.edu: Keep the mutex as a hashed array instead
       of bloating the ext4 inode]
      
      [tytso@mit.edu: Fix up namespace issues so that global
       variables are protected with an "ext4_" prefix.]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e9e3bcec
  2. 17 1月, 2011 1 次提交
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
  3. 11 1月, 2011 1 次提交
  4. 28 10月, 2010 1 次提交
    • T
      ext4: improve llseek error handling for overly large seek offsets · e0d10bfa
      Toshiyuki Okajima 提交于
      The llseek system call should return EINVAL if passed a seek offset
      which results in a write error.  What this maximum offset should be
      depends on whether or not the huge_file file system feature is set,
      and whether or not the file is extent based or not.
      
      
      If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
      written (write systemcall) is different from the maximum size which can be 
      sought (lseek systemcall).
      
      For example, the following 2 cases demonstrates the differences
      between the maximum size which can be written, versus the seek offset
      allowed by the llseek system call:
      
      #1: mkfs.ext3 <dev>; mount -t ext4 <dev>
      #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
      
      Table. the max file size which we can write or seek
             at each filesystem feature tuning and file flag setting
      +============+===============================+===============================+
      | \ File flag|                               |                               |
      |      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
      |case       \|                               |                               |
      +------------+-------------------------------+-------------------------------+
      | #1         |   write:      2194719883264   | write:       --------------   |
      |            |   seek:       2199023251456   | seek:        --------------   |
      +------------+-------------------------------+-------------------------------+
      | #2         |   write:      4402345721856   | write:       17592186044415   |
      |            |   seek:      17592186044415   | seek:        17592186044415   |
      +------------+-------------------------------+-------------------------------+
      
      The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
      (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
      maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
      (llseek of ext4_file_operations is generic_file_llseek which uses
      sb->s_maxbytes.)
      
      Therefore we create ext4 llseek function which uses 2 maxbytes.
      
      The new own function originates from generic_file_llseek().
      If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
      inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
      Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      e0d10bfa
  5. 27 7月, 2010 1 次提交
  6. 12 6月, 2010 1 次提交
    • T
      ext4: Clean up s_dirt handling · a0375156
      Theodore Ts'o 提交于
      We don't need to set s_dirt in most of the ext4 code when journaling
      is enabled.  In ext3/4 some of the summary statistics for # of free
      inodes, blocks, and directories are calculated from the per-block
      group statistics when the file system is mounted or unmounted.  As a
      result the superblock doesn't have to be updated, either via the
      journal or by setting s_dirt.  There are a few exceptions, most
      notably when resizing the file system, where the superblock needs to
      be modified --- and in that case it should be done as a journalled
      operation if possible, and s_dirt set only in no-journal mode.
      
      This patch will optimize out some unneeded disk writes when using ext4
      with a journal.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a0375156
  7. 17 5月, 2010 1 次提交
  8. 05 3月, 2010 2 次提交
    • C
      dquot: cleanup dquot initialize routine · 871a2931
      Christoph Hellwig 提交于
      Get rid of the initialize dquot operation - it is now always called from
      the filesystem and if a filesystem really needs it's own (which none
      currently does) it can just call into it's own routine directly.
      
      Rename the now static low-level dquot_initialize helper to __dquot_initialize
      and vfs_dq_init to dquot_initialize to have a consistent namespace.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      871a2931
    • C
      dquot: move dquot initialization responsibility into the filesystem · 907f4554
      Christoph Hellwig 提交于
      Currently various places in the VFS call vfs_dq_init directly.  This means
      we tie the quota code into the VFS.  Get rid of that and make the
      filesystem responsible for the initialization.   For most metadata operations
      this is a straight forward move into the methods, but for truncate and
      open it's a bit more complicated.
      
      For truncate we currently only call vfs_dq_init for the sys_truncate case
      because open already takes care of it for ftruncate and open(O_TRUNC) - the
      new code causes an additional vfs_dq_init for those which is harmless.
      
      For open the initialization is moved from do_filp_open into the open method,
      which means it happens slightly earlier now, and only for regular files.
      The latter is fine because we don't need to initialize it for operations
      on special files, and we already do it as part of the namespace operations
      for directories.
      
      Add a dquot_file_open helper that filesystems that support generic quotas
      can use to fill in ->open.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      907f4554
  9. 04 3月, 2010 1 次提交
  10. 25 1月, 2010 1 次提交
    • T
      ext4: Use bitops to read/modify EXT4_I(inode)->i_state · 19f5fb7a
      Theodore Ts'o 提交于
      At several places we modify EXT4_I(inode)->i_state without holding
      i_mutex (ext4_release_file, ext4_bmap, ext4_journalled_writepage,
      ext4_do_update_inode, ...). These modifications are racy and we can
      lose updates to i_state. So convert handling of i_state to use bitops
      which are atomic.
      
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      19f5fb7a
  11. 28 9月, 2009 1 次提交
  12. 14 9月, 2009 1 次提交
  13. 09 9月, 2009 1 次提交
  14. 13 6月, 2009 1 次提交
    • T
      ext4: update the s_last_mounted field in the superblock · bc0b0d6d
      Theodore Ts'o 提交于
      This field can be very helpful when a system administrator is trying
      to sort through large numbers of block devices or filesystem images.
      What is stored in this field can be ambiguous if multiple filesystem
      namespaces are in play; what we store in practice is the mountpoint
      interpreted by the process's namespace which first opens a file in the
      filesystem.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bc0b0d6d
  15. 28 3月, 2009 1 次提交
  16. 24 2月, 2009 1 次提交
    • T
      ext4: Automatically allocate delay allocated blocks on close · 7d8f9f7d
      Theodore Ts'o 提交于
      When closing a file that had been previously truncated, force any
      delay allocated blocks that to be allocated so that if the filesystem
      is mounted with data=ordered, the data blocks will be pushed out to
      disk along with the journal commit.  Many application programs expect
      this, so we do this to avoid zero length files if the system crashes
      unexpectedly.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7d8f9f7d
  17. 23 11月, 2008 1 次提交
  18. 11 10月, 2008 1 次提交
  19. 07 10月, 2008 1 次提交
  20. 10 10月, 2008 1 次提交
  21. 09 9月, 2008 1 次提交
  22. 12 7月, 2008 2 次提交
    • M
      ext4: delayed allocation i_blocks fix for stat · 3e3398a0
      Mingming Cao 提交于
      Right now i_blocks is not getting updated until the blocks are actually
      allocaed on disk.  This means with delayed allocation, right after files
      are copied, "ls -sF" shoes the file as taking 0 blocks on disk.  "du"
      also shows the files taking zero space, which is highly confusing to the
      user.
      
      Since delayed allocation already keeps track of per-inode total
      number of blocks that are subject to delayed allocation, this patch fix
      this by using that to adjust the value returned by stat(2). When real
      block allocation is done, the i_blocks will get updated. Since the
      reserved blocks for delayed allocation will be decreased, this will be
      keep value returned by stat(2) consistent.
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e3398a0
    • A
      ext4: Use page_mkwrite vma_operations to get mmap write notification. · 2e9ee850
      Aneesh Kumar K.V 提交于
      We would like to get notified when we are doing a write on mmap section.
      This is needed with respect to preallocated area. We split the preallocated
      area into initialzed extent and uninitialzed extent in the call back. This
      let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
      that would result in data loss. The changes are also needed to handle ENOSPC
      when writing to an mmap section of files with holes.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2e9ee850
  23. 30 4月, 2008 2 次提交
  24. 29 1月, 2008 2 次提交
  25. 18 7月, 2007 1 次提交
    • A
      fallocate support in ext4 · a2df2a63
      Amit Arora 提交于
      This patch implements ->fallocate() inode operation in ext4. With this
      patch users of ext4 file systems will be able to use fallocate() system
      call for persistent preallocation. Current implementation only supports
      preallocation for regular files (directories not supported as of date)
      with extent maps. This patch does not support block-mapped files currently.
      Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
      now.
      Signed-off-by: NAmit Arora <aarora@in.ibm.com>
      a2df2a63
  26. 10 7月, 2007 1 次提交
  27. 13 2月, 2007 1 次提交
  28. 09 12月, 2006 1 次提交
  29. 12 10月, 2006 3 次提交
  30. 01 10月, 2006 3 次提交
  31. 27 9月, 2006 1 次提交
  32. 31 3月, 2006 1 次提交
    • J
      [PATCH] Introduce sys_splice() system call · 5274f052
      Jens Axboe 提交于
      This adds support for the sys_splice system call. Using a pipe as a
      transport, it can connect to files or sockets (latter as output only).
      
      From the splice.c comments:
      
         "splice": joining two ropes together by interweaving their strands.
      
         This is the "extended pipe" functionality, where a pipe is used as
         an arbitrary in-memory buffer. Think of a pipe as a small kernel
         buffer that you can use to transfer data from one end to the other.
      
         The traditional unix read/write is extended with a "splice()" operation
         that transfers data buffers to or from a pipe buffer.
      
         Named by Larry McVoy, original implementation from Linus, extended by
         Jens to support splicing to files and fixing the initial implementation
         bugs.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5274f052