1. 14 6月, 2014 1 次提交
  2. 13 6月, 2014 5 次提交
  3. 10 6月, 2014 17 次提交
    • F
      Btrfs: make fsync work after cloning into a file · 7ffbb598
      Filipe Manana 提交于
      When cloning into a file, we were correctly replacing the extent
      items in the target range and removing the extent maps. However
      we weren't replacing the extent maps with new ones that point to
      the new extents - as a consequence, an incremental fsync (when the
      inode doesn't have the full sync flag) was a NOOP, since it relies
      on the existence of extent maps in the modified list of the inode's
      extent map tree, which was empty. Therefore add new extent maps to
      reflect the target clone range.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7ffbb598
    • A
      trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/ · 93915584
      Antonio Ospite 提交于
      Signed-off-by: NAntonio Ospite <ao2@ao2.it>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: linux-btrfs@vger.kernel.org
      Signed-off-by: NChris Mason <clm@fb.com>
      93915584
    • F
      Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled · f82a9901
      Filipe Manana 提交于
      If the NO_HOLES feature is enabled holes don't have file extent items in
      the btree that represent them anymore. This made the clone operation
      ignore the gaps that exist between consecutive file extent items and
      therefore not create the holes at the destination. When not using the
      NO_HOLES feature, the holes were created at the destination.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f82a9901
    • G
      btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX · 902c68a4
      Gui Hecheng 提交于
      To be accurate about the error case,
      if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      902c68a4
    • F
      Btrfs: update commit root on snapshot creation after orphan cleanup · 3821f348
      Filipe Manana 提交于
      On snapshot creation (either writable or read-only), we do orphan cleanup
      against the root of the snapshot. If the cleanup did remove any orphans,
      then the current root node will be different from the commit root node
      until the next transaction commit happens.
      
      A send operation always uses the commit root of a snapshot - this means
      it will see the orphans if it starts computing the send stream before the
      next transaction commit happens (triggered by a timer or sync() for .e.g),
      which is when the commit root gets assigned a reference to current root,
      where the orphans are not visible anymore. The consequence of send seeing
      the orphans is explained below.
      
      For example:
      
          mkfs.btrfs -f /dev/sdd
          mount -o commit=999 /dev/sdd /mnt
      
          # open a file with O_TMPFILE and leave it open
          # write some data to the file
          btrfs subvolume snapshot -r /mnt /mnt/snap1
      
          btrfs send /mnt/snap1 -f /tmp/send.data
      
      The send operation will fail with the following error:
      
          ERROR: send ioctl failed with -116: Stale file handle
      
      What happens here is that our snapshot has an orphan inode still visible
      through the commit root, that corresponds to the tmpfile. However send
      will attempt to call inode.c:btrfs_iget(), with the goal of reading the
      file's data, which will return -ESTALE because it will use the current
      root (and not the commit root) of the snapshot.
      
      Of course, there are other cases where we can get orphans, but this
      example using a tmpfile makes it much easier to reproduce the issue.
      
      Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
      the commit root is different from the current root, just commit the
      transaction associated with the snapshot's root (if it exists), so that
      a send will not see any orphans that don't exist anymore. This also
      guarantees a send will always see the same content regardless of whether
      a transaction commit happened already before the send was requested and
      after the orphan cleanup (meaning the commit root and current roots are
      the same) or it hasn't happened yet (commit and current roots are
      different).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3821f348
    • F
      Btrfs: ioctl, don't re-lock extent range when not necessary · ff5df9b8
      Filipe Manana 提交于
      In ioctl.c:lock_extent_range(), after locking our target range, the
      ordered extent that btrfs_lookup_first_ordered_extent() returns us
      may not overlap our target range at all. In this case we would just
      unlock our target range, wait for any new ordered extents that overlap
      the range to complete, lock again the range and repeat all these steps
      until we don't get any ordered extent and the delalloc flag isn't set
      in the io tree for our target range.
      
      Therefore just stop if we get an ordered extent that doesn't overlap
      our target range and the dealalloc flag isn't set for the range in
      the inode's io tree.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff5df9b8
    • F
      Btrfs: avoid visiting all extent items when cloning a range · 2c463823
      Filipe Manana 提交于
      When cloning a range of a file, we were visiting all the extent items in
      the btree that belong to our source inode. We don't need to visit those
      extent items that don't overlap the range we are cloning, as doing so only
      makes us waste time and do unnecessary btree navigations (btrfs_next_leaf)
      for inodes that have a large number of file extent items in the btree.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c463823
    • F
      Btrfs: set dead flag on the right root when destroying snapshot · c55bfa67
      Filipe Manana 提交于
      We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the
      parent of our target snapshot, instead of setting it in the target
      snapshot's root.
      
      This is easy to observe by running the following scenario:
      
          mkfs.btrfs -f /dev/sdd
          mount /dev/sdd /mnt
      
          btrfs subvolume create /mnt/first_subvol
          btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
          btrfs subvolume delete /mnt/first_subvol
          btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
          btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data
      
      The send command failed because the send ioctl returned -EPERM.
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c55bfa67
    • F
      Btrfs: ensure readers see new data after a clone operation · c125b8bf
      Filipe Manana 提交于
      We were cleaning the clone target file range from the page cache before
      we did replace the file extent items in the fs tree. This was racy,
      as right after cleaning the relevant range from the page cache and before
      replacing the file extent items, a read against that range could be
      performed by another task and populate again the page cache with stale
      data (stale after the cloning finishes). This would result in reads after
      the clone operation successfully finishes to get old data (and potentially
      for a very long time). Therefore evict the pages after replacing the file
      extent items, so that subsequent reads will always get the new data.
      
      Similarly, we were prone to races while cloning the file extent items
      because we weren't locking the target range and wait for any existing
      ordered extents against that range to complete. It was possible that
      after cloning the extent items, a write operation that was performed
      before the clone operation and overlaps the same range, would end up
      undoing all or part of the work the clone operation did (a worker task
      running inode.c:btrfs_finish_ordered_io). Therefore lock the target
      range in the io tree, wait for all pending ordered extents against that
      range to finish and then safely perform the cloning.
      
      The issue of reading stale data after the clone operation is easy to
      reproduce by running the following C program in a loop until it exits
      with return value 1.
      
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <errno.h>
       #include <pthread.h>
       #include <fcntl.h>
       #include <assert.h>
       #include <asm/types.h>
       #include <linux/ioctl.h>
       #include <sys/stat.h>
       #include <sys/types.h>
       #include <sys/ioctl.h>
      
       #define SRC_FILE "/mnt/sdd/foo"
       #define DST_FILE "/mnt/sdd/bar"
       #define FILE_SIZE (16 * 1024)
       #define PATTERN_SRC 'X'
       #define PATTERN_DST 'Y'
      
      struct btrfs_ioctl_clone_range_args {
      	__s64 src_fd;
      	__u64 src_offset, src_length;
      	__u64 dest_offset;
      };
      
       #define BTRFS_IOCTL_MAGIC 0x94
       #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
      				   struct btrfs_ioctl_clone_range_args)
      
      static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
      static int clone_done = 0;
      static int reader_ready = 0;
      static int stale_data = 0;
      
      static void *reader_loop(void *arg)
      {
      	char buf[4096], want_buf[4096];
      
      	memset(want_buf, PATTERN_SRC, 4096);
      	pthread_mutex_lock(&mutex);
      	reader_ready = 1;
      	pthread_mutex_unlock(&mutex);
      
      	while (1) {
      		int done, fd, ret;
      
      		fd = open(DST_FILE, O_RDONLY);
      		assert(fd != -1);
      
      		pthread_mutex_lock(&mutex);
      		done = clone_done;
      		pthread_mutex_unlock(&mutex);
      
      		ret = read(fd, buf, 4096);
      		assert(ret == 4096);
      		close(fd);
      
      		if (done) {
      			ret = memcmp(buf, want_buf, 4096);
      			if (ret == 0) {
      				printf("Found new content\n");
      			} else {
      				printf("Found old content\n");
      				pthread_mutex_lock(&mutex);
      				stale_data = 1;
      				pthread_mutex_unlock(&mutex);
      			}
      			break;
      		}
      	}
      	return NULL;
      }
      
      int main(int argc, char *argv[])
      {
      	pthread_t reader;
      	int ret, i, fd;
      	struct btrfs_ioctl_clone_range_args clone_args;
      	int fd1, fd2;
      
      	ret = remove(SRC_FILE);
      	if (ret == -1 && errno != ENOENT) {
      		fprintf(stderr, "Error deleting src file: %s\n", strerror(errno));
      		return 1;
      	}
      	ret = remove(DST_FILE);
      	if (ret == -1 && errno != ENOENT) {
      		fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno));
      		return 1;
      	}
      
      	fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
      	assert(fd != -1);
      	for (i = 0; i < FILE_SIZE; i++) {
      		char c = PATTERN_SRC;
      		ret = write(fd, &c, 1);
      		assert(ret == 1);
      	}
      	close(fd);
      	fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
      	assert(fd != -1);
      	for (i = 0; i < FILE_SIZE; i++) {
      		char c = PATTERN_DST;
      		ret = write(fd, &c, 1);
      		assert(ret == 1);
      	}
      	close(fd);
              sync();
      
      	ret = pthread_create(&reader, NULL, reader_loop, NULL);
      	assert(ret == 0);
      	while (1) {
      		int r;
      		pthread_mutex_lock(&mutex);
      		r = reader_ready;
      		pthread_mutex_unlock(&mutex);
      		if (r) break;
      	}
      
      	fd1 = open(SRC_FILE, O_RDONLY);
      	if (fd1 < 0) {
      		fprintf(stderr, "Error open src file: %s\n", strerror(errno));
      		return 1;
      	}
      	fd2 = open(DST_FILE, O_RDWR);
      	if (fd2 < 0) {
      		fprintf(stderr, "Error open dst file: %s\n", strerror(errno));
      		return 1;
      	}
      	clone_args.src_fd = fd1;
      	clone_args.src_offset = 0;
      	clone_args.src_length = 4096;
      	clone_args.dest_offset = 0;
      	ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args);
      	assert(ret == 0);
      	close(fd1);
      	close(fd2);
      
      	pthread_mutex_lock(&mutex);
      	clone_done = 1;
      	pthread_mutex_unlock(&mutex);
      	ret = pthread_join(reader, NULL);
      	assert(ret == 0);
      
      	pthread_mutex_lock(&mutex);
      	ret = stale_data ? 1 : 0;
      	pthread_mutex_unlock(&mutex);
      	return ret;
      }
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c125b8bf
    • Z
      btrfs: replace simple_strtoull() with kstrtoull() · 58dfae63
      ZhangZhen 提交于
      use the newer and more pleasant kstrtoull() to replace simple_strtoull(),
      because simple_strtoull() is marked for obsoletion.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      58dfae63
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • M
    • D
      btrfs: assert that send is not in progres before root deletion · 61155aa0
      David Sterba 提交于
      CC: Miao Xie <miaox@cn.fujitsu.com>
      CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      61155aa0
    • D
      btrfs: protect snapshots from deleting during send · 521e0546
      David Sterba 提交于
      The patch "Btrfs: fix protection between send and root deletion"
      (18f687d5) does not actually prevent to delete the snapshot
      and just takes care during background cleaning, but this seems rather
      user unfriendly, this patch implements the idea presented in
      
      http://www.spinics.net/lists/linux-btrfs/msg30813.html
      
      - add an internal root_item flag to denote a dead root
      - check if the send_in_progress is set and refuse to delete, otherwise
        set the flag and proceed
      - check the flag in send similar to the btrfs_root_readonly checks, for
        all involved roots
      
      The root lookup in send via btrfs_read_fs_root_no_name will check if the
      root is really dead or not. If it is, ENOENT, aborted send. If it's
      alive, it's protected by send_in_progress, send can continue.
      
      CC: Miao Xie <miaox@cn.fujitsu.com>
      CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      521e0546
    • D
      btrfs: make FS_INFO ioctl available to anyone · e4ef90ff
      David Sterba 提交于
      This ioctl provides basic info about the filesystem that can be obtained
      in other ways (eg. sysfs), there's no reason to restrict it to
      CAP_SYSADMIN.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      e4ef90ff
    • D
      btrfs: make DEV_INFO ioctl available to anyone · 7d6213c5
      David Sterba 提交于
      This ioctl provides basic info about the devices that can be obtained in
      other ways (eg. sysfs), there's no reason to restrict it to
      CAP_SYSADMIN.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      7d6213c5
    • D
      btrfs: retrieve more info from FS_INFO ioctl · 80a773fb
      David Sterba 提交于
      Provide the basic information about filesystem through the ioctl:
      * b-tree node size (same as leaf size)
      * sector size
      * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments
      
      Backward compatibility: if the values are 0, kernel does not provide
      this information, the applications should ignore them.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      80a773fb
  4. 21 5月, 2014 1 次提交
  5. 25 4月, 2014 1 次提交
  6. 08 4月, 2014 4 次提交
  7. 22 3月, 2014 1 次提交
    • L
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo 提交于
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      00fdf13a
  8. 21 3月, 2014 3 次提交
    • F
      Btrfs: less fs tree lock contention when using autodefrag · f094c9bd
      Filipe Manana 提交于
      When finding new extents during an autodefrag, don't do so many fs tree
      lookups to find an extent with a size smaller then the target treshold.
      Instead, after each fs tree forward search immediately unlock upper
      levels and process the entire leaf while holding a read lock on the leaf,
      since our leaf processing is very fast.
      This reduces lock contention, allowing for higher concurrency when other
      tasks want to write/update items related to other inodes in the fs tree,
      as we're not holding read locks on upper tree levels while processing the
      leaf and we do less tree searches.
      
      Test:
      
          sysbench --test=fileio --file-num=512 --file-total-size=16G \
             --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
             --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
             --max-requests=10000000000 [prepare|run]
      
      (fileystem mounted with -o autodefrag, averages of 5 runs)
      
      Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
      After this change:  63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
      
      Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f094c9bd
    • G
      Btrfs: return EPERM when deleting a default subvolume · 72de6b53
      Guangyu Sun 提交于
      The error message is confusing:
      
       # btrfs sub delete /mnt/mysub/
       Delete subvolume '/mnt/mysub'
       ERROR: cannot delete '/mnt/mysub' - Directory not empty
      
      The error message does not make sense to me: It's not about deleting a
      directory but it's a subvolume, and it doesn't matter if the subvolume is
      empty or not.
      
      Maybe EPERM or is more appropriate in this case, combined with an explanatory
      kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
      it is configured as default subvolume.")
      Reported-by: NKoen De Wit <koen.de.wit@oracle.com>
      Signed-off-by: NGuangyu Sun <guangyu.sun@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      72de6b53
    • F
      Btrfs: cache extent states in defrag code path · 308d9800
      Filipe Manana 提交于
      When locking file ranges in the inode's io_tree, cache the first
      extent state that belongs to the target range, so that when unlocking
      the range we don't need to search in the io_tree again, reducing cpu
      time and making and therefore holding the io_tree's lock for a shorter
      period.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      308d9800
  9. 11 3月, 2014 6 次提交
    • M
      Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock · 6c255e67
      Miao Xie 提交于
      We needn't flush all delalloc inodes when we doesn't get s_umount lock,
      or we would make the tasks wait for a long time.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6c255e67
    • M
      Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume · 8257b2dc
      Miao Xie 提交于
      If the snapshot creation happened after the nocow write but before the dirty
      data flush, we would fail to flush the dirty data because of no space.
      
      So we must keep track of when those nocow write operations start and when they
      end, if there are nocow writers, the snapshot creators must wait. In order
      to implement this function, I introduce btrfs_{start, end}_nocow_write(),
      which is similar to mnt_{want,drop}_write().
      
      These two functions are only used for nocow file write operations.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8257b2dc
    • F
      Btrfs: make defrag not fragment files when using prealloc extents · e2127cf0
      Filipe Manana 提交于
      When using prealloc extents, a file defragment operation may actually
      fragment the file and increase the amount of data space used by the file.
      This change fixes that behaviour.
      
      Example:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt
      $ cd /mnt
      $ xfs_io -f -c 'falloc 0 1048576' foobar && sync
      $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
      
      Before defragmenting the file:
      
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.25MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 0 nr 4096
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 4096 nr 102400 ram 1048576
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 106496 nr 90112
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 196608 nr 106496 ram 1048576
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 897024 nr 106496 ram 1048576
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 1003520 nr 45056
      (...)
      
      Now defragmenting the file results in more data space used than before:
      
      $ btrfs filesystem defragment -f foobar && sync
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.55MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      And the corresponding file extent items are now no longer perfectly sequential
      as before, and we're now needlessly using more space from data block groups:
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 0 nr 4096 ram 1048576
      		extent compression 0
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 13893632 nr 102400
      		extent data offset 0 nr 102400 ram 102400
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 106496 nr 90112 ram 1048576
      		extent compression 0
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 13996032 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 14102528 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 1003520 nr 45056 ram 1048576
      		extent compression 0
      (...)
      
      With this change, the above example will no longer cause allocation of new data
      space nor change the sequentiality of the file extents, that is, defragment will
      be effectless, leaving all extent items pointing to the extent starting at disk
      byte 12845056.
      
      In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
      400Mb each, initially consisting of a single prealloc extent of 400Mb, having
      random writes happening at a low rate, lead to a total of over ~17Gb of data
      space used, not far from eventually reaching an ENOSPC state.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e2127cf0
    • F
      Btrfs: correctly flush data on defrag when compression is enabled · dec8ef90
      Filipe Manana 提交于
      When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
      enabled, we weren't flushing completely, as writing compressed extents
      is a 2 steps process, one to compress the data and another one to write
      the compressed data to disk.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dec8ef90
    • H
      btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl · abccd00f
      Hugo Mills 提交于
      The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
      and 64-bit systems. This means that it is impossible to use btrfs
      receive on a system with a 64-bit kernel and 32-bit userspace, because
      the structure size (and hence the ioctl number) is different.
      
      This patch adds a compatibility structure and ioctl to deal with the
      above case.
      Signed-off-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      abccd00f
    • K
      btrfs: Return EXDEV for cross file system snapshot · 23ad5b17
      Kusanagi Kouichi 提交于
      EXDEV seems an appropriate error if an operation fails bacause it
      crosses file system boundaries.
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NKusanagi Kouichi <slash@ac.auone-net.jp>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      23ad5b17
  10. 15 2月, 2014 1 次提交