1. 17 11月, 2011 1 次提交
  2. 28 7月, 2011 1 次提交
    • M
      ocfs2: serialize unaligned aio · a11f7e63
      Mark Fasheh 提交于
      Fix a corruption that can happen when we have (two or more) outstanding
      aio's to an overlapping unaligned region.  Ext4
      (e9e3bcec) and xfs recently had to fix
      similar issues.
      
      In our case what happens is that we can have an outstanding aio on a region
      and if a write comes in with some bytes overlapping the original aio we may
      decide to read that region into a page before continuing (typically because
      of buffered-io fallback).  Since we have no ordering guarantees with the
      aio, we can read stale or bad data into the page and then write it back out.
      
      If the i/o is page and block aligned, then we avoid this issue as there
      won't be any need to read data from disk.
      
      I took the same approach as Eric in the ext4 patch and introduced some
      serialization of unaligned async direct i/o.  I don't expect this to have an
      effect on the most common cases of AIO.  Unaligned aio will be slower
      though, but that's far more acceptable than data corruption.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      a11f7e63
  3. 26 7月, 2011 1 次提交
    • S
      ocfs2: Implement llseek() · 93862d5e
      Sunil Mushran 提交于
      ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA
      functionality.
      
      SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten
      (preallocated) extent, that is greater than or equal to the supplied offset.
      
      SEEK_DATA sets the file pointer to the start of an allocated extent (not
      unwritten) that is greater than or equal to the supplied offset.
      
      If the supplied offset is on a desired region, then the file pointer is set
      to it. Offsets greater than or equal to the file size return -ENXIO.
      
      Unwritten (preallocated) extents are considered holes because the file system
      treats reads to such regions in the same way as it does to holes.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      93862d5e
  4. 21 7月, 2011 4 次提交
    • J
      fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers · 02c24a82
      Josef Bacik 提交于
      Btrfs needs to be able to control how filemap_write_and_wait_range() is called
      in fsync to make it less of a painful operation, so push down taking i_mutex and
      the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
      file systems can drop taking the i_mutex altogether it seems, like ext3 and
      ocfs2.  For correctness sake I just pushed everything down in all cases to make
      sure that we keep the current behavior the same for everybody, and then each
      individual fs maintainer can make up their mind about what to do from there.
      Thanks,
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      02c24a82
    • C
      fs: always maintain i_dio_count · df2d6f26
      Christoph Hellwig 提交于
      Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
      This these filesystems to also protect truncate against direct I/O requests
      by using common code.  Right now the only non-DIO_LOCKING filesystem that
      appears to do so is XFS, which uses an opencoded variant of the i_dio_count
      scheme.
      
      Behaviour doesn't change for filesystems never calling inode_dio_wait.
      For ext4 behaviour changes when using the dioread_nonlock option, which
      previously was missing any protection between truncate and direct I/O reads.
      For ocfs2 that handcrafted i_dio_count manipulations are replaced with
      the common code now enable.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      df2d6f26
    • C
      fs: move inode_dio_wait calls into ->setattr · 562c72aa
      Christoph Hellwig 提交于
      Let filesystems handle waiting for direct I/O requests themselves instead
      of doing it beforehand.  This means filesystem-specific locks to prevent
      new dio referenes from appearing can be held.  This is important to allow
      generalizing i_dio_count to non-DIO_LOCKING filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      562c72aa
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  5. 20 7月, 2011 3 次提交
  6. 26 5月, 2011 1 次提交
  7. 14 5月, 2011 1 次提交
  8. 22 2月, 2011 1 次提交
  9. 07 3月, 2011 1 次提交
    • T
      ocfs2: Remove EXIT from masklog. · c1e8d35e
      Tao Ma 提交于
      mlog_exit is used to record the exit status of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      This patch just try to remove it or change it. So:
      1. if all the error paths already use mlog_errno, it is just removed.
         Otherwise, it will be replaced by mlog_errno.
      2. if it is used to print some return value, it is replaced with
         mlog(0,...).
      mlog_exit_ptr is changed to mlog(0.
      All those mlog(0,...) will be replaced with trace events later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      c1e8d35e
  10. 21 2月, 2011 1 次提交
    • T
      ocfs2: Remove ENTRY from masklog. · ef6b689b
      Tao Ma 提交于
      ENTRY is used to record the entry of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      So for mlog_entry_void, we just remove it.
      for mlog_entry(...), we replace it with mlog(0,...), and they
      will be replace by trace event later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      ef6b689b
  11. 17 1月, 2011 2 次提交
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • C
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig 提交于
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  12. 13 1月, 2011 1 次提交
  13. 07 1月, 2011 1 次提交
  14. 10 12月, 2010 1 次提交
  15. 26 10月, 2010 1 次提交
  16. 23 10月, 2010 1 次提交
  17. 12 10月, 2010 1 次提交
    • T
      ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. · 7bdb0d18
      Tristan Ye 提交于
      Currently, the default behavior of O_DIRECT writes was allowing
      concurrent writing among nodes to the same file, with no cluster
      coherency guaranteed (no EX lock held).  This can leave stale data in
      the cache for buffered reads on other nodes.
      
      The new mount option introduce a chance to choose two different
      behaviors for O_DIRECT writes:
      
          * coherency=full, as the default value, will disallow
                            concurrent O_DIRECT writes by taking
                            EX locks.
      
          * coherency=buffered, allow concurrent O_DIRECT writes
                                without EX lock among nodes, which
                                gains high performance at risk of
                                getting stale data on other nodes.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      7bdb0d18
  18. 16 9月, 2010 1 次提交
  19. 10 9月, 2010 2 次提交
  20. 08 9月, 2010 3 次提交
  21. 12 8月, 2010 2 次提交
  22. 10 8月, 2010 2 次提交
    • C
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig 提交于
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
    • C
      remove inode_setattr · 1025774c
      Christoph Hellwig 提交于
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1025774c
  23. 17 7月, 2010 1 次提交
  24. 09 7月, 2010 2 次提交
    • J
      ocfs2: Zero the tail cluster when extending past i_size. · 5693486b
      Joel Becker 提交于
      ocfs2's allocation unit is the cluster.  This can be larger than a block
      or even a memory page.  This means that a file may have many blocks in
      its last extent that are beyond the block containing i_size.  There also
      may be more unwritten extents after that.
      
      When ocfs2 grows a file, it zeros the entire cluster in order to ensure
      future i_size growth will see cleared blocks.  Unfortunately,
      block_write_full_page() drops the pages past i_size.  This means that
      ocfs2 is actually leaking garbage data into the tail end of that last
      cluster.  This is a bug.
      
      We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
      when a write or truncate is past i_size.  They will use
      ocfs2_zero_extend() to ensure the data is properly zeroed.
      
      Older versions of ocfs2_zero_extend() simply zeroed every block between
      i_size and the zeroing position.  This presumes three things:
      
      1) There is allocation for all of these blocks.
      2) The extents are not unwritten.
      3) The extents are not refcounted.
      
      (1) and (2) hold true for non-sparse filesystems, which used to be the
      only users of ocfs2_zero_extend().  (3) is another bug.
      
      Since we're now using ocfs2_zero_extend() for sparse filesystems as
      well, we teach ocfs2_zero_extend() to check every extent between
      i_size and the zeroing position.  If the extent is unwritten, it is
      ignored.  If it is refcounted, it is CoWed.  Then it is zeroed.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      5693486b
    • J
      ocfs2: When zero extending, do it by page. · a4bfb4cf
      Joel Becker 提交于
      ocfs2_zero_extend() does its zeroing block by block, but it calls a
      function named ocfs2_write_zero_page().  Let's have
      ocfs2_write_zero_page() handle the page level.  From
      ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      a4bfb4cf
  25. 28 5月, 2010 2 次提交
  26. 22 5月, 2010 2 次提交