1. 02 10月, 2009 1 次提交
  2. 12 9月, 2009 1 次提交
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b
  3. 01 4月, 2009 1 次提交
    • C
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason 提交于
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5a3f23d5
  4. 09 12月, 2008 1 次提交
    • C
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason 提交于
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d20f7043
  5. 31 10月, 2008 2 次提交
    • Y
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng 提交于
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d899e052
    • Y
      Btrfs: update nodatacow code v2 · 80ff3856
      Yan Zheng 提交于
      This patch simplifies the nodatacow checker. If all references
      were created after the latest snapshot, then we can avoid COW
      safely. This patch also updates run_delalloc_nocow to do more
      fine-grained checking.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      80ff3856
  6. 30 10月, 2008 1 次提交
    • C
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason 提交于
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8b97818
  7. 04 10月, 2008 1 次提交
    • C
      Btrfs: O_DIRECT writes via buffered writes + invaldiate · cb843a6f
      Chris Mason 提交于
      This reworks the btrfs O_DIRECT write code a bit.  It had always fallen
      back to buffered IO and done an invalidate, but needed to be updated
      for the data=ordered code.  The invalidate wasn't actually removing pages
      because they were still inside an ordered extent.
      
      This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
      off IO in the main btrfs_file_write loop to keep the pipe down the the
      disk full as we process long writes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cb843a6f
  8. 25 9月, 2008 16 次提交
  9. 28 8月, 2007 1 次提交
  10. 11 8月, 2007 1 次提交
  11. 14 6月, 2007 1 次提交
    • A
      btrfs: Code cleanup · f1ace244
      Aneesh 提交于
      Attaching below is some of the code cleanups that i came across while
      reading the code.
      
      a) alloc_path already calls init_path.
      b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
      the in memory copy ext4_inode as the disk copy
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f1ace244
  12. 12 6月, 2007 1 次提交
  13. 01 5月, 2007 1 次提交
  14. 11 4月, 2007 1 次提交
  15. 07 4月, 2007 1 次提交
  16. 02 4月, 2007 1 次提交