1. 31 10月, 2008 3 次提交
    • Y
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng 提交于
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d899e052
    • Y
      Btrfs: Fix bookend extent race v2 · 6643558d
      Yan Zheng 提交于
      When dropping middle part of an extent, btrfs_drop_extents truncates
      the extent at first, then inserts a bookend extent.
      
      Since truncation and insertion can't be done atomically, there is a small
      period that the bookend extent isn't in the tree. This causes problem for
      functions that search the tree for file extent item. The way to fix this is
      lock the range of the bookend extent before truncation.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      6643558d
    • Y
      Btrfs: update hole handling v2 · 9036c102
      Yan Zheng 提交于
      This patch splits the hole insertion code out of btrfs_setattr
      into btrfs_cont_expand and updates btrfs_get_extent to properly
      handle the case that file extent items are not continuous.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      9036c102
  2. 30 10月, 2008 1 次提交
    • C
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason 提交于
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8b97818
  3. 09 10月, 2008 2 次提交
    • Y
      Btrfs: Remove offset field from struct btrfs_extent_ref · 3bb1a1bc
      Yan Zheng 提交于
      The offset field in struct btrfs_extent_ref records the position
      inside file that file extent is referenced by. In the new back
      reference system, tree leaves holding references to file extent
      are recorded explicitly. We can scan these tree leaves very quickly, so the
      offset field is not required.
      
      This patch also makes the back reference system check the objectid
      when extents are in deleting.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      3bb1a1bc
    • Y
      Btrfs: Count space allocated to file in bytes · a76a3cd4
      Yan Zheng 提交于
      This patch makes btrfs count space allocated to file in bytes instead
      of 512 byte sectors.
      
      Everything else in btrfs uses a byte count instead of sector sizes or
      blocks sizes, so this fits better.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a76a3cd4
  4. 04 10月, 2008 1 次提交
    • C
      Btrfs: O_DIRECT writes via buffered writes + invaldiate · cb843a6f
      Chris Mason 提交于
      This reworks the btrfs O_DIRECT write code a bit.  It had always fallen
      back to buffered IO and done an invalidate, but needed to be updated
      for the data=ordered code.  The invalidate wasn't actually removing pages
      because they were still inside an ordered extent.
      
      This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
      off IO in the main btrfs_file_write loop to keep the pipe down the the
      disk full as we process long writes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cb843a6f
  5. 30 9月, 2008 1 次提交
    • C
      Btrfs: add and improve comments · d352ac68
      Chris Mason 提交于
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d352ac68
  6. 26 9月, 2008 2 次提交
    • Z
      Btrfs: extent_map and data=ordered fixes for space balancing · 5b21f2ed
      Zheng Yan 提交于
      * Add an EXTENT_BOUNDARY state bit to keep the writepage code
      from merging data extents that are in the process of being
      relocated.  This allows us to do accounting for them properly.
      
      * The balancing code relocates data extents indepdent of the underlying
      inode.  The extent_map code was modified to properly account for
      things moving around (invalidating extent_map caches in the inode).
      
      * Don't take the drop_mutex in the create_subvol ioctl.  It isn't
      required.
      
      * Fix walking of the ordered extent list to avoid races with sys_unlink
      
      * Change the lock ordering rules.  Transaction start goes outside
      the drop_mutex.  This allows btrfs_commit_transaction to directly
      drop the relocation trees.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b21f2ed
    • C
      Remove Btrfs compat code for older kernels · 2b1f55b0
      Chris Mason 提交于
      Btrfs had compatibility code for kernels back to 2.6.18.  These have
      been removed, and will be maintained in a separate backport
      git tree from now on.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2b1f55b0
  7. 25 9月, 2008 30 次提交
    • Z
      Btrfs: Full back reference support · 31840ae1
      Zheng Yan 提交于
      This patch makes the back reference system to explicit record the
      location of parent node for all types of extents. The location of
      parent node is placed into the offset field of backref key. Every
      time a tree block is balanced, the back references for the affected
      lower level extents are updated.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      31840ae1
    • C
      Btrfs: Dir fsync optimizations · 49eb7e46
      Chris Mason 提交于
      Drop i_mutex during the commit
      
      Don't bother doing the fsync at all unless the dir is marked as dirtied
      and needing fsync in this transaction.  For directories, this means
      that someone has unlinked a file from the dir without fsyncing the
      file.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      49eb7e46
    • C
      Btrfs: Add a write ahead tree log to optimize synchronous operations · e02119d5
      Chris Mason 提交于
      File syncs and directory syncs are optimized by copying their
      items into a special (copy-on-write) log tree.  There is one log tree per
      subvolume and the btrfs super block points to a tree of log tree roots.
      
      After a crash, items are copied out of the log tree and back into the
      subvolume.  See tree-log.c for all the details.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e02119d5
    • C
      a1b32a59
    • C
    • C
      Btrfs: Improve and cleanup locking done by walk_down_tree · f87f057b
      Chris Mason 提交于
      While dropping snapshots, walk_down_tree does most of the work of checking
      reference counts and limiting tree traversal to just the blocks that
      we are freeing.
      
      It dropped and held the allocation mutex in strange and confusing ways,
      this commit changes it to only hold the mutex while actually freeing a block.
      
      The rest of the checks around reference counts should be safe without the lock
      because we only allow one process in btrfs_drop_snapshot at a time.  Other
      processes dropping reference counts should not drop it to 1 because
      their tree roots already have an extra ref on the block.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f87f057b
    • C
      3ce7e67a
    • C
      Btrfs: Throttle tuning · 37d1aeee
      Chris Mason 提交于
      This avoids waiting for transactions with pages locked by breaking out
      the code to wait for the current transaction to close into a function
      called by btrfs_throttle.
      
      It also lowers the limits for where we start throttling.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      37d1aeee
    • S
      Btrfs: Add compatibility for kernels >= 2.6.27-rc1 · 0ee0fda0
      Sven Wegener 提交于
      Add a couple of #if's to follow API changes.
      Signed-off-by: NSven Wegener <sven.wegener@stealer.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0ee0fda0
    • Y
      Btrfs: implement memory reclaim for leaf reference cache · bcc63abb
      Yan 提交于
      The memory reclaiming issue happens when snapshot exists. In that
      case, some cache entries may not be used during old snapshot dropping,
      so they will remain in the cache until umount.
      
      The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
      the patch makes all dead roots of a given snapshot linked together in order of
      create time. After a old snapshot was completely dropped, we check the dead
      root list and remove all cache entries created before the oldest dead root in
      the list.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bcc63abb
    • C
      Btrfs: Throttle operations if the reference cache gets too large · ab78c84d
      Chris Mason 提交于
      A large reference cache is directly related to a lot of work pending
      for the cleaner thread.  This throttles back new operations based on
      the size of the reference cache so the cleaner thread will be able to keep
      up.
      
      Overall, this actually makes the FS faster because the cleaner thread will
      be more likely to find things in cache.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ab78c84d
    • C
      Btrfs: Leaf reference cache update · 017e5369
      Chris Mason 提交于
      This changes the reference cache to make a single cache per root
      instead of one cache per transaction, and to key by the byte number
      of the disk block instead of the keys inside.
      
      This makes it much less likely to have cache misses if a snapshot
      or something has an extra reference on a higher node or a leaf while
      the first transaction that added the leaf into the cache is dropping.
      
      Some throttling is added to functions that free blocks heavily so they
      wait for old transactions to drop.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      017e5369
    • C
      Btrfs: Fix some data=ordered related data corruptions · f421950f
      Chris Mason 提交于
      Stress testing was showing data checksum errors, most of which were caused
      by a lookup bug in the extent_map tree.  The tree was caching the last
      pointer returned, and searches would check the last pointer first.
      
      But, search callers also expect the search to return the very first
      matching extent in the range, which wasn't always true with the last
      pointer usage.
      
      For now, the code to cache the last return value is just removed.  It is
      easy to fix, but I think lookups are rare enough that it isn't required anymore.
      
      This commit also replaces do_sync_mapping_range with a local copy of the
      related functions.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f421950f
    • C
      Btrfs: Data ordered fixes · 4a096752
      Chris Mason 提交于
      * In btrfs_delete_inode, wait for ordered extents after calling
      truncate_inode_pages.  This is much faster, and more correct
      
      * Properly clear our the PageChecked bit everywhere we redirty the page.
      
      * Change the writepage fixup handler to lock the page range and check to
      see if an ordered extent had been inserted since the improperly dirtied
      page was discovered
      
      * Wait for ordered extents outside the transaction.  This isn't required
      for locking rules but does improve transaction latencies
      
      * Reduce contention on the alloc_mutex by dropping it while incrementing
      refs on a node/leaf and while dropping refs on a leaf.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a096752
    • C
      Btrfs: Keep extent mappings in ram until pending ordered extents are done · 7f3c74fb
      Chris Mason 提交于
      It was possible for stale mappings from disk to be used instead of the
      new pending ordered extent.  This adds a flag to the extent map struct
      to keep it pinned until the pending ordered extent is actually on disk.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7f3c74fb
    • C
      Add a per-inode lock around btrfs_drop_extents · ee6e6504
      Chris Mason 提交于
      btrfs_drop_extents is always called with a range lock held on the inode.
      But, it may operate on extents outside that range as it drops and splits
      them.
      
      This patch adds a per-inode mutex that is held while calling
      btrfs_drop_extents and while inserting new extents into the tree.  It
      prevents races from two procs working against adjacent ranges in the tree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ee6e6504
    • C
      Btrfs: Don't pin pages in ram until the entire ordered extent is on disk. · ba1da2f4
      Chris Mason 提交于
      Checksum items are not inserted until the entire ordered extent is on disk,
      but individual pages might be clean and available for reclaim long before
      the whole extent is on disk.
      
      In order to allow those pages to be freed, we need to be able to search
      the list of ordered extents to find the checksum that is going to be inserted
      in the tree.  This way if the page needs to be read back in before
      the checksums are in the btree, we'll be able to verify the checksum on
      the page.
      
      This commit adds the ability to search the pending ordered extents for
      a given offset in the file, and changes btrfs_releasepage to allow
      ordered pages to be freed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ba1da2f4
    • C
      btrfs_start_transaction: wait for commits in progress to finish · f9295749
      Chris Mason 提交于
      btrfs_commit_transaction has to loop waiting for any writers in the
      transaction to finish before it can proceed.  btrfs_start_transaction
      should be polite and not join a transaction that is in the process
      of being finished off.
      
      There are a few places that can't wait, basically the ones doing IO that
      might be needed to finish the transaction.  For them, btrfs_join_transaction
      is added.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f9295749
    • C
      Btrfs: Update on disk i_size only after pending ordered extents are done · dbe674a9
      Chris Mason 提交于
      This changes the ordered data code to update i_size after the extent
      is on disk.  An on disk i_size is maintained in the in-memory btrfs inode
      structures, and this is updated as extents finish.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dbe674a9
    • C
      Btrfs: Use async helpers to deal with pages that have been improperly dirtied · 247e743c
      Chris Mason 提交于
      Higher layers sometimes call set_page_dirty without asking the filesystem
      to help.  This causes many problems for the data=ordered and cow code.
      This commit detects pages that haven't been properly setup for IO and
      kicks off an async helper to deal with them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      247e743c
    • C
      Btrfs: New data=ordered implementation · e6dcd2dc
      Chris Mason 提交于
      The old data=ordered code would force commit to wait until
      all the data extents from the transaction were fully on disk.  This
      introduced large latencies into the commit and stalled new writers
      in the transaction for a long time.
      
      The new code changes the way data allocations and extents work:
      
      * When delayed allocation is filled, data extents are reserved, and
        the extent bit EXTENT_ORDERED is set on the entire range of the extent.
        A struct btrfs_ordered_extent is allocated an inserted into a per-inode
        rbtree to track the pending extents.
      
      * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
        to that page.
      
      * When all of the bytes corresponding to a single struct btrfs_ordered_extent
        are written, The previously reserved extent is inserted into the FS
        btree and into the extent allocation trees.  The checksums for the file
        data are also updated.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e6dcd2dc
    • C
    • C
      Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63
      Chris Mason 提交于
      The existing throttle mechanism was often not sufficient to prevent
      new writers from coming in and making a given transaction run forever.
      This adds an explicit wait at the end of most operations so they will
      allow the current transaction to close.
      
      There is no wait inside file_write, inode updates, or cow filling, all which
      have different deadlock possibilities.
      
      This is a temporary measure until better asynchronous commit support is
      added.  This code leads to stalls as it waits for data=ordered
      writeback, and it really needs to be fixed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89ce8a63
    • C
      Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb
      Chris Mason 提交于
      This allows us to delete an unlinked inode with dirty pages from the list
      instead of forcing commit to write these out before deleting the inode.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      594a24eb
    • C
      Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011
      Chris Mason 提交于
      Extent alloctions are still protected by a large alloc_mutex.
      Objectid allocations are covered by a objectid mutex
      Other btree operations are protected by a lock on individual btree nodes
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a2135011
    • S
      Btrfs: transaction ioctls · 6bf13c0c
      Sage Weil 提交于
      These ioctls let a user application hold a transaction open while it
      performs a series of operations.  A final ioctl does a sync on the fs
      (closing the current transaction).  This is the main requirement for
      Ceph's OSD to be able to keep the data it's storing in a btrfs volume
      consistent, and AFAICS it works just fine.  The application would do
      something like
      
      	fd = ::open("some/file", O_RDONLY);
      	::ioctl(fd, BTRFS_IOC_TRANS_START);
      	/* do a bunch of stuff */
      	::ioctl(fd, BTRFS_IOC_TRANS_END);
      or just
      	::close(fd);
      
      And to ensure it commits to disk,
      
      	::ioctl(fd, BTRFS_IOC_SYNC);
      
      When a transaction is held open, the trans_handle is attached to the
      struct file (via private_data) so that it will get cleaned up if the
      process dies unexpectedly.  A held transaction is also ended on fsync() to
      avoid a deadlock.
      
      A misbehaving application could also deliberately hold a transaction open,
      effectively locking up the FS, so it may make sense to restrict something
      like this to root or something.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6bf13c0c
    • M
      btrfs delete ordered inode handling fix · e1b81e67
      Mingming 提交于
      Use btrfs_release_file instead of a put_inode call
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e1b81e67
    • C
      Fix corners in writepage and btrfs_truncate_page · 211c17f5
      Chris Mason 提交于
      The extent_io writepage calls needed an extra check for discarding
      pages that started on th last byte in the file.
      
      btrfs_truncate_page needed checks to make sure the page was still part
      of the file after reading it, and most importantly, needed to wait for
      all IO to the page to finish before freeing the corresponding extents on
      disk.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      211c17f5
    • J
      Btrfs: Add workaround for AppArmor changing remove_suid() · 12fa8ec6
      Jeff Mahoney 提交于
      In openSUSE 10.3, AppArmor modifies remove_suid to take a struct path
      rather than just a dentry. This patch tests that the kernel is openSUSE
      10.3 or newer and adjusts the call accordingly.
      
      Debian/Ubuntu with AppArmor applied will also need a similar patch.
      Maintainers of btrfs under those distributions should build on this
      patch or, alternatively, alter their package descriptions to add
      -DREMOVE_SUID_PATH to the compiler command line.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      - --- /dev/null	1970-01-01 00:00:00.000000000 +0000
      +++ b/compat.h	2008-02-06 16:46:13.000000000 -0500
      @@ -0,0 +1,15 @@
      +#ifndef _COMPAT_H_
      +#define _COMPAT_H_
      +
      +
      +/*
      + * Even if AppArmor isn't enabled, it still has different prototypes.
      + * Add more distro/version pairs here to declare which has AppArmor applied.
      + */
      +#if defined(CONFIG_SUSE_KERNEL)
      +# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,22)
      +# define REMOVE_SUID_PATH 1
      +# endif
      +#endif
      +
      +#endif /* _COMPAT_H_ */
      - --- a/file.c	2008-02-06 11:37:39.000000000 -0500
      +++ b/file.c	2008-02-06 16:46:23.000000000 -0500
      @@ -37,6 +37,7 @@
       #include "ordered-data.h"
       #include "ioctl.h"
       #include "print-tree.h"
      +#include "compat.h"
      
       static int btrfs_copy_from_user(loff_t pos, int num_pages, int write_bytes,
      @@ -790,7 +791,11 @@ static ssize_t btrfs_file_write(struct f
       		goto out_nolock;
       	if (count == 0)
       		goto out_nolock;
      +#ifdef REMOVE_SUID_PATH
      +	err = remove_suid(&file->f_path);
      +#else
       	err = remove_suid(fdentry(file));
      +#endif
       	if (err)
       		goto out_nolock;
       	file_update_time(file);
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12fa8ec6
    • C
      Btrfs: Fix do_sync_file_range ifdefs (2.6.22) · bb8885cc
      Chris Mason 提交于
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bb8885cc