1. 13 7月, 2010 1 次提交
    • J
      ocfs2: No need to zero pages past i_size. · 693c241a
      Joel Becker 提交于
      When ocfs2 fills a hole, it does so by allocating clusters.  When a
      cluster is larger than the write, ocfs2 must zero the portions of the
      cluster outside of the write.  If the clustersize is smaller than a
      pagecache page, this is handled by the normal pagecache mechanisms, but
      when the clustersize is larger than a page, ocfs2's write code will zero
      the pages adjacent to the write.  This makes sure the entire cluster is
      zeroed correctly.
      
      Currently ocfs2 behaves exactly the same when writing past i_size.
      However, this means ocfs2 is writing zeroed pages for portions of a new
      cluster that are beyond i_size.  The page writeback code isn't expecting
      this.  It treats all pages past the one containing i_size as left behind
      due to a previous truncate operation.
      
      Thankfully, ocfs2 calculates the number of pages it will be working on
      up front.  The rest of the write code merely honors the original
      calculation.  We can simply trim the number of pages to only cover the
      actual file data.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      693c241a
  2. 09 7月, 2010 2 次提交
    • J
      ocfs2: Zero the tail cluster when extending past i_size. · 5693486b
      Joel Becker 提交于
      ocfs2's allocation unit is the cluster.  This can be larger than a block
      or even a memory page.  This means that a file may have many blocks in
      its last extent that are beyond the block containing i_size.  There also
      may be more unwritten extents after that.
      
      When ocfs2 grows a file, it zeros the entire cluster in order to ensure
      future i_size growth will see cleared blocks.  Unfortunately,
      block_write_full_page() drops the pages past i_size.  This means that
      ocfs2 is actually leaking garbage data into the tail end of that last
      cluster.  This is a bug.
      
      We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
      when a write or truncate is past i_size.  They will use
      ocfs2_zero_extend() to ensure the data is properly zeroed.
      
      Older versions of ocfs2_zero_extend() simply zeroed every block between
      i_size and the zeroing position.  This presumes three things:
      
      1) There is allocation for all of these blocks.
      2) The extents are not unwritten.
      3) The extents are not refcounted.
      
      (1) and (2) hold true for non-sparse filesystems, which used to be the
      only users of ocfs2_zero_extend().  (3) is another bug.
      
      Since we're now using ocfs2_zero_extend() for sparse filesystems as
      well, we teach ocfs2_zero_extend() to check every extent between
      i_size and the zeroing position.  If the extent is unwritten, it is
      ignored.  If it is refcounted, it is CoWed.  Then it is zeroed.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      5693486b
    • J
      ocfs2: When zero extending, do it by page. · a4bfb4cf
      Joel Becker 提交于
      ocfs2_zero_extend() does its zeroing block by block, but it calls a
      function named ocfs2_write_zero_page().  Let's have
      ocfs2_write_zero_page() handle the page level.  From
      ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      a4bfb4cf
  3. 06 5月, 2010 1 次提交
  4. 05 3月, 2010 1 次提交
    • C
      dquot: cleanup space allocation / freeing routines · 5dd4056d
      Christoph Hellwig 提交于
      Get rid of the alloc_space, free_space, reserve_space, claim_space and
      release_rsv dquot operations - they are always called from the filesystem
      and if a filesystem really needs their own (which none currently does)
      it can just call into it's own routine directly.
      
      Move shared logic into the common __dquot_alloc_space,
      dquot_claim_space_nodirty and __dquot_free_space low-level methods,
      and rationalize the wrappers around it to move as much as possible
      code into the common block for CONFIG_QUOTA vs not.  Also rename
      all these helpers to be named dquot_* instead of vfs_dq_*.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      5dd4056d
  5. 27 2月, 2010 1 次提交
  6. 26 1月, 2010 1 次提交
  7. 17 12月, 2009 1 次提交
    • C
      cleanup blockdev_direct_IO locking · 1e431f5c
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three different
      locking types and very confusing checks for some of them.  The most
      complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
      used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
      is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
      The difference is that DIO_NO_LOCKING always sets the create argument for
      the get_blocks callback to zero, but we can easily move that to the actual
      get_blocks callbacks.  There are four users of the DIO_NO_LOCKING mode:
      gfs already ignores the create argument and thus is fine with the new
      version, ocfs2 only errors out if create were ever set, and we can remove
      this dead code now, the block device code only ever uses create for an
      error message if we are fully beyond the device which can never happen,
      and last but not least XFS will need the new behavour for writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a separate
      flag, although for now both flags always get set at the same time.
      
      Also revamp the documentation of the locking scheme to actually make sense.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e431f5c
  8. 16 12月, 2009 1 次提交
    • C
      direct-io: cleanup blockdev_direct_IO locking · 5fe878ae
      Christoph Hellwig 提交于
      Currently the locking in blockdev_direct_IO is a mess, we have three
      different locking types and very confusing checks for some of them.  The
      most complicated one is DIO_OWN_LOCKING for reads, which happens to not
      actually be used.
      
      This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
      case is unused anyway, and the write side is almost identical to
      DIO_NO_LOCKING.  The difference is that DIO_NO_LOCKING always sets the
      create argument for the get_blocks callback to zero, but we can easily
      move that to the actual get_blocks callbacks.  There are four users of the
      DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
      fine with the new version, ocfs2 only errors out if create were ever set,
      and we can remove this dead code now, the block device code only ever uses
      create for an error message if we are fully beyond the device which can
      never happen, and last but not least XFS will need the new behavour for
      writes.
      
      Now we can replace the lock_type variable with a flags one, where no flag
      means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
      flag.  Separate out the check for not allowing to fill holes into a
      separate flag, although for now both flags always get set at the same
      time.
      
      Also revamp the documentation of the locking scheme to actually make
      sense.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fe878ae
  9. 23 9月, 2009 4 次提交
    • T
      ocfs2: Use buffer IO if we are appending a file. · b80474b4
      Tao Ma 提交于
      In ocfs2_file_aio_write, we will prevent direct io if
      we find that we are appending(changing i_size) and call
      generic_file_aio_write_nolock. But actually O_DIRECT flag
      is there and this function will call generic_file_direct_write
      eventually which will update i_size and leave di->i_size
      alone. The bug is
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173.
      
      So this patch let ocfs2_direct_IO returns 0 directly if we
      are appending so that buffered write will be called and
      di->i_size get updated successfully. And this is also
      what we want in ocfs2_file_aio_write.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b80474b4
    • T
      ocfs2: CoW a reflinked cluster when it is truncated. · 37f8a2bf
      Tao Ma 提交于
      When we truncate a file to a specific size which resides in a reflinked
      cluster, we need to CoW it since ocfs2_zero_range_for_truncate will
      zero the space after the size(just another type of write).
      
      So we add a "max_cpos" in ocfs2_refcount_cow so that it will stop when
      it hit the max cluster offset.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      37f8a2bf
    • T
      ocfs2: Integrate CoW in file write. · 293b2f70
      Tao Ma 提交于
      When we use mmap, we CoW the refcountd clusters in
      ocfs2_write_begin_nolock. While for normal file
      io(including directio), we do CoW in
      ocfs2_prepare_inode_for_write.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      293b2f70
    • T
      ocfs2: Add CoW support. · 6f70fa51
      Tao Ma 提交于
      This patch try CoW support for a refcounted record.
      
      the whole process will be:
      1. Calculate how many clusters we need to CoW and where we start.
         Extents that are not completely encompassed by the write will
         be broken on 1MB boundaries.
      2. Do CoW for the clusters with the help of page cache.
      3. Change the b-tree structure with the new allocated clusters.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      6f70fa51
  10. 16 9月, 2009 1 次提交
    • A
      HWPOISON: Enable .remove_error_page for migration aware file systems · aa261f54
      Andi Kleen 提交于
      Enable removing of corrupted pages through truncation
      for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
      These should cover most server needs.
      
      I chose the set of migration aware file systems for this
      for now, assuming they have been especially audited.
      But in general it should be safe for all file systems
      on the data area that support read/write and truncate.
      
      Caveat: the hardware error handler does not take i_mutex
      for now before calling the truncate function. Is that ok?
      
      Cc: tytso@mit.edu
      Cc: hch@infradead.org
      Cc: mfasheh@suse.com
      Cc: aia21@cantab.net
      Cc: hugh.dickins@tiscali.co.uk
      Cc: swhiteho@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      aa261f54
  11. 05 9月, 2009 3 次提交
  12. 08 8月, 2009 1 次提交
  13. 21 7月, 2009 2 次提交
  14. 04 4月, 2009 1 次提交
    • H
      ocfs2: Pagecache usage optimization on ocfs2 · 1fca3a05
      Hisashi Hifumi 提交于
      A page can have multiple buffers and even if a page is not uptodate, some buffers
      can be uptodate on pagesize != blocksize environment.
      This aops checks that all buffers which correspond to a part of a file
      that we want to read are uptodate. If so, we do not have to issue actual
      read IO to HDD even if a page is not uptodate because the portion we
      want to read are uptodate.
      "block_is_partially_uptodate" function is already used by ext2/3/4.
      With the following patch random read/write mixed workloads or random read after
      random write workloads can be optimized and we can get performance improvement.
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      1fca3a05
  15. 13 3月, 2009 1 次提交
  16. 06 1月, 2009 4 次提交
    • J
      ocfs2: Use metadata-specific ocfs2_journal_access_*() functions. · 13723d00
      Joel Becker 提交于
      The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
      commit triggers and allow us to compute metadata ecc right before the
      buffers are written out.  This commit provides ecc for inodes, extent
      blocks, group descriptors, and quota blocks.  It is not safe to use
      extened attributes and metaecc at the same time yet.
      
      The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
      the type of block at their root.  Before, it didn't matter, but now the
      root block must use the appropriate ocfs2_journal_access_*() function.
      To keep this abstract, the structures now have a pointer to the matching
      journal_access function and a wrapper call to call it.
      
      A few places use naked ocfs2_write_block() calls instead of adding the
      blocks to the journal.  We make sure to calculate their checksum and ecc
      before the write.
      
      Since we pass around the journal_access functions.  Let's typedef them
      in ocfs2.h.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      13723d00
    • J
      ocfs2: Add quota calls for allocation and freeing of inodes and space · a90714c1
      Jan Kara 提交于
      Add quota calls for allocation and freeing of inodes and space, also update
      estimates on number of needed credits for a transaction. Move out inode
      allocation from ocfs2_mknod_locked() because vfs_dq_init() must be called
      outside of a transaction.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      a90714c1
    • M
      ocfs2: Remove JBD compatibility layer · 53ef99ca
      Mark Fasheh 提交于
      JBD2 is fully backwards compatible with JBD and it's been tested enough with
      Ocfs2 that we can clean this code up now.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      53ef99ca
    • J
      ocfs2: Wrap inode block reads in a dedicated function. · b657c95c
      Joel Becker 提交于
      The ocfs2 code currently reads inodes off disk with a simple
      ocfs2_read_block() call.  Each place that does this has a different set
      of sanity checks it performs.  Some check only the signature.  A couple
      validate the block number (the block read vs di->i_blkno).  A couple
      others check for VALID_FL.  Only one place validates i_fs_generation.  A
      couple check nothing.  Even when an error is found, they don't all do
      the same thing.
      
      We wrap inode reading into ocfs2_read_inode_block().  This will validate
      all the above fields, going readonly if they are invalid (they never
      should be).  ocfs2_read_inode_block_full() is provided for the places
      that want to pass read_block flags.  Every caller is passing a struct
      inode with a valid ip_blkno, so we don't need a separate blkno argument
      either.
      
      We will remove the validation checks from the rest of the code in a
      later commit, as they are no longer necessary.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      b657c95c
  17. 15 10月, 2008 2 次提交
  18. 14 10月, 2008 9 次提交
    • M
      ocfs2: Don't check for NULL before brelse() · a81cb88b
      Mark Fasheh 提交于
      This is pointless as brelse() already does the check.
      
      Signed-off-by: Mark Fasheh
      a81cb88b
    • J
      ocfs2: Switch over to JBD2. · 2b4e30fb
      Joel Becker 提交于
      ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
      limiting our maximum filesystem size.
      
      It's a pretty trivial change.  Most functions are just renamed.  The
      only functional change is moving to Jan's inode-based ordered data mode.
      It's better, too.
      
      Because JBD2 reads and writes JBD journals, this is compatible with any
      existing filesystem.  It can even interact with JBD-based ocfs2 as long
      as the journal is formated for JBD.
      
      We provide a compatibility option so that paranoid people can still use
      JBD for the time being.  This will go away shortly.
      
      [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
        ocfs2_truncate_for_delete(). --Mark ]
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      2b4e30fb
    • J
      ocfs2: Change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree() · 8d6220d6
      Joel Becker 提交于
      The original get/put_extent_tree() functions held a reference on
      et_root_bh.  However, every single caller already has a safe reference,
      making the get/put cycle irrelevant.
      
      We change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree().  It
      no longer gets a reference on et_root_bh.  ocfs2_put_extent_tree() is
      removed.  Callers now have a simpler init+use pattern.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      8d6220d6
    • J
      ocfs2: Make ocfs2_extent_tree the first-class representation of a tree. · f99b9b7c
      Joel Becker 提交于
      We now have three different kinds of extent trees in ocfs2: inode data
      (dinode), extended attributes (xattr_tree), and extended attribute
      values (xattr_value).  There is a nice abstraction for them,
      ocfs2_extent_tree, but it is hidden in alloc.c.  All the calling
      functions have to pick amongst a varied API and pass in type bits and
      often extraneous pointers.
      
      A better way is to make ocfs2_extent_tree a first-class object.
      Everyone converts their object to an ocfs2_extent_tree() via the
      ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
      tree calls to alloc.c.
      
      This simplifies a lot of callers, making for readability.  It also
      provides an easy way to add additional extent tree types, as they only
      need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree()
      function.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      f99b9b7c
    • T
      ocfs2: Add extent tree operation for xattr value btrees · f56654c4
      Tao Ma 提交于
      Add some thin wrappers around ocfs2_insert_extent() for each of the 3
      different btree types, ocfs2_inode_insert_extent(),
      ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
      last is for the xattr index btree, which will be used in a followup patch.
      
      All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
      while the other two handle the xattr issue. And the init of extent tree are
      handled by these functions.
      
      When storing xattr value which is too large, we will allocate some clusters
      for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
      order to re-use the b-tree operation code, a new parameter named "private"
      is added into ocfs2_extent_tree and it is used to indicate the root of
      ocfs2_exent_list. The reason is that we can't deduce the root from the
      buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
      in any place in an ocfs2_xattr_bucket.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      f56654c4
    • T
      ocfs2: Make high level btree extend code generic · 0eb8d47e
      Tao Ma 提交于
      Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
      function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
      ocfs2_do_cluster_allocation() now, but the latter can be used for other
      btree types as well.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      0eb8d47e
    • T
      ocfs2: Abstract ocfs2_extent_tree in b-tree operations. · e7d4cb6b
      Tao Ma 提交于
      In the old extent tree operation, we take the hypothesis that we
      are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
      As xattr will also use ocfs2_extent_list to store large value
      for a xattr entry, we refactor the tree operation so that xattr
      can use it directly.
      
      The refactoring includes 4 steps:
      1. Abstract set/get of last_eb_blk and update_clusters since they may
         be stored in different location for dinode and xattr.
      2. Add a new structure named ocfs2_extent_tree to indicate the
         extent tree the operation will work on.
      3. Remove all the use of fe_bh and di, use root_bh and root_el in
         extent tree instead. So now all the fe_bh is replaced with
         et->root_bh, el with root_el accordingly.
      4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
         in file extend allocation. But the whole function is useful when we want
         to store large EAs.
      
      Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
      for anything other than truncate inode data btrees.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      e7d4cb6b
    • T
      ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode. · 811f933d
      Tao Ma 提交于
      ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
      ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
      they are all limited to an inode btree because they use a struct
      ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
      (the part of an ocfs2_dinode they actually use) so that the xattr btree code
      can use these functions.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      811f933d
    • T
      ocfs2: Modify ocfs2_num_free_extents for future xattr usage. · 231b87d1
      Tao Ma 提交于
      ocfs2_num_free_extents() is used to find the number of free extent records
      in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
      use this for extended attribute trees in the future, so genericize the
      interface the take a buffer head. A future patch will allow that buffer_head
      to contain any structure rooting an ocfs2 btree.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      231b87d1
  19. 10 9月, 2008 1 次提交
    • T
      ocfs2: Fix a bug in direct IO read. · 0e116227
      Tao Ma 提交于
      ocfs2 will become read-only if we try to read the bytes which pass
      the end of i_size. This can be easily reproduced by following steps:
      1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse.
      2. create a small file(say less than 100 bytes) and we will create the file
         which is allocated 1 cluster.
      3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit.
      4. The ocfs2 volume becomes read-only and dmesg shows:
      OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks:
      Inode 66010 has a hole at block 1
      File system is now read-only due to the potential of on-disk corruption.
      Please run fsck.ocfs2 once the file system is unmounted.
      
      So suppress the ERROR message.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      0e116227
  20. 01 8月, 2008 1 次提交
  21. 17 7月, 2008 1 次提交
    • C
      [PATCH] ocfs2: fix oops in mmap_truncate testing · c0420ad2
      Coly Li 提交于
      This patch fixes a mmap_truncate bug which was found by ocfs2 test suite.
      
      In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races
      mmap writes and truncates from multiple processes. While the test is
      running, a stat from another node forces writeout, causing an oops in
      ocfs2_get_block() because it sees a buffer to write which isn't allocated.
      
      This patch fixed the bug by clear dirty and uptodate bits in buffer, leave
      the buffer unmapped and return.
      
      Fix is suggested by Mark Fasheh, and I code up the patch.
      Signed-off-by: NColy Li <coyli@suse.de>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      c0420ad2