1. 25 11月, 2014 2 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
    • F
      Btrfs: ensure send always works on roots without orphans · e5fa8f86
      Filipe Manana 提交于
      Move the logic from the snapshot creation ioctl into send. This avoids
      doing the transaction commit if send isn't used, and ensures that if
      a crash/reboot happens after the transaction commit that created the
      snapshot and before the transaction commit that switched the commit
      root, send will not get a commit root that differs from the main root
      (that has orphan items).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e5fa8f86
  2. 24 10月, 2014 1 次提交
  3. 17 10月, 2014 1 次提交
    • C
      Revert "Btrfs: race free update of commit root for ro snapshots" · d3797308
      Chris Mason 提交于
      This reverts commit 9c3b306e.
      
      Switching only one commit root during a transaction is wrong because it
      leads the fs into an inconsistent state. All commit roots should be
      switched at once, at transaction commit time, otherwise backref walking
      can often miss important references that were only accessible through
      the old commit root.  Plus, the root item for the snapshot's root wasn't
      getting updated and preventing the next transaction commit to do it.
      
      This made several users get into random corruption issues after creation
      of readonly snapshots.
      
      A regression test for xfstests will follow soon.
      
      Cc: stable@vger.kernel.org # 3.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d3797308
  4. 09 10月, 2014 1 次提交
  5. 02 10月, 2014 3 次提交
  6. 18 9月, 2014 10 次提交
  7. 09 9月, 2014 1 次提交
    • D
      Btrfs: kfree()ing ERR_PTRs · c47ca32d
      Dan Carpenter 提交于
      The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
      btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.
      
      These kind of bugs are "One Err Bugs" where there is just one error
      label that does everything.  I could set the "inherit = NULL" and keep
      the single out label but it ends up being more complicated that way.  It
      makes the code simpler to re-order the unwind so it's in the mirror
      order of the allocation and introduce some new error labels.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c47ca32d
  8. 27 8月, 2014 1 次提交
    • C
      Btrfs: fix autodefrag with compression · e9512d72
      Chris Mason 提交于
      The autodefrag code skips defrag when two extents are adjacent.  But one
      big advantage for autodefrag is cutting down on the number of small
      extents, even when they are adjacent.  This commit changes it to defrag
      all small extents.
      Signed-off-by: NChris Mason <clm@fb.com>
      e9512d72
  9. 21 8月, 2014 2 次提交
    • F
      Btrfs: clone, don't create invalid hole extent map · 62e2390e
      Filipe Manana 提交于
      When cloning a file that consists of an inline extent, we were creating
      an extent map that represents a non-existing trailing hole starting at a
      file offset that isn't a multiple of the sector size. This happened because
      when processing an inline extent we weren't aligning the extent's length to
      the sector size, and therefore incorrectly treating the range
      [inline_extent_length; sector_size[ as a hole.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      62e2390e
    • F
      Btrfs: race free update of commit root for ro snapshots · 9c3b306e
      Filipe Manana 提交于
      This is a better solution for the problem addressed in the following
      commit:
      
          Btrfs: update commit root on snapshot creation after orphan cleanup
          (3821f348)
      
      The previous solution wasn't the best because of 2 reasons:
      
          1) It added another full transaction commit, which is more expensive
             than just swapping the commit root with the root;
      
          2) If a reboot happened after the first transaction commit (the one
             that creates the snapshot) and before the second transaction commit,
             then we would end up with the same problem if a send using that
             snapshot was requested before the first transaction commit after
             the reboot.
      
      This change addresses those 2 issues. The second issue is addressed by
      switching the commit root in the dentry lookup VFS callback, which is
      also called by the snapshot/subvol creation ioctl and performs orphan
      cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
      preventing race issues between a dentry lookup and snapshot creation.
      
      Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c3b306e
  10. 03 7月, 2014 2 次提交
  11. 14 6月, 2014 1 次提交
  12. 13 6月, 2014 5 次提交
  13. 10 6月, 2014 10 次提交
    • F
      Btrfs: make fsync work after cloning into a file · 7ffbb598
      Filipe Manana 提交于
      When cloning into a file, we were correctly replacing the extent
      items in the target range and removing the extent maps. However
      we weren't replacing the extent maps with new ones that point to
      the new extents - as a consequence, an incremental fsync (when the
      inode doesn't have the full sync flag) was a NOOP, since it relies
      on the existence of extent maps in the modified list of the inode's
      extent map tree, which was empty. Therefore add new extent maps to
      reflect the target clone range.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7ffbb598
    • A
      trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/ · 93915584
      Antonio Ospite 提交于
      Signed-off-by: NAntonio Ospite <ao2@ao2.it>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: linux-btrfs@vger.kernel.org
      Signed-off-by: NChris Mason <clm@fb.com>
      93915584
    • F
      Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled · f82a9901
      Filipe Manana 提交于
      If the NO_HOLES feature is enabled holes don't have file extent items in
      the btree that represent them anymore. This made the clone operation
      ignore the gaps that exist between consecutive file extent items and
      therefore not create the holes at the destination. When not using the
      NO_HOLES feature, the holes were created at the destination.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f82a9901
    • G
      btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX · 902c68a4
      Gui Hecheng 提交于
      To be accurate about the error case,
      if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      902c68a4
    • F
      Btrfs: update commit root on snapshot creation after orphan cleanup · 3821f348
      Filipe Manana 提交于
      On snapshot creation (either writable or read-only), we do orphan cleanup
      against the root of the snapshot. If the cleanup did remove any orphans,
      then the current root node will be different from the commit root node
      until the next transaction commit happens.
      
      A send operation always uses the commit root of a snapshot - this means
      it will see the orphans if it starts computing the send stream before the
      next transaction commit happens (triggered by a timer or sync() for .e.g),
      which is when the commit root gets assigned a reference to current root,
      where the orphans are not visible anymore. The consequence of send seeing
      the orphans is explained below.
      
      For example:
      
          mkfs.btrfs -f /dev/sdd
          mount -o commit=999 /dev/sdd /mnt
      
          # open a file with O_TMPFILE and leave it open
          # write some data to the file
          btrfs subvolume snapshot -r /mnt /mnt/snap1
      
          btrfs send /mnt/snap1 -f /tmp/send.data
      
      The send operation will fail with the following error:
      
          ERROR: send ioctl failed with -116: Stale file handle
      
      What happens here is that our snapshot has an orphan inode still visible
      through the commit root, that corresponds to the tmpfile. However send
      will attempt to call inode.c:btrfs_iget(), with the goal of reading the
      file's data, which will return -ESTALE because it will use the current
      root (and not the commit root) of the snapshot.
      
      Of course, there are other cases where we can get orphans, but this
      example using a tmpfile makes it much easier to reproduce the issue.
      
      Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
      the commit root is different from the current root, just commit the
      transaction associated with the snapshot's root (if it exists), so that
      a send will not see any orphans that don't exist anymore. This also
      guarantees a send will always see the same content regardless of whether
      a transaction commit happened already before the send was requested and
      after the orphan cleanup (meaning the commit root and current roots are
      the same) or it hasn't happened yet (commit and current roots are
      different).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3821f348
    • F
      Btrfs: ioctl, don't re-lock extent range when not necessary · ff5df9b8
      Filipe Manana 提交于
      In ioctl.c:lock_extent_range(), after locking our target range, the
      ordered extent that btrfs_lookup_first_ordered_extent() returns us
      may not overlap our target range at all. In this case we would just
      unlock our target range, wait for any new ordered extents that overlap
      the range to complete, lock again the range and repeat all these steps
      until we don't get any ordered extent and the delalloc flag isn't set
      in the io tree for our target range.
      
      Therefore just stop if we get an ordered extent that doesn't overlap
      our target range and the dealalloc flag isn't set for the range in
      the inode's io tree.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff5df9b8
    • F
      Btrfs: avoid visiting all extent items when cloning a range · 2c463823
      Filipe Manana 提交于
      When cloning a range of a file, we were visiting all the extent items in
      the btree that belong to our source inode. We don't need to visit those
      extent items that don't overlap the range we are cloning, as doing so only
      makes us waste time and do unnecessary btree navigations (btrfs_next_leaf)
      for inodes that have a large number of file extent items in the btree.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c463823
    • F
      Btrfs: set dead flag on the right root when destroying snapshot · c55bfa67
      Filipe Manana 提交于
      We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the
      parent of our target snapshot, instead of setting it in the target
      snapshot's root.
      
      This is easy to observe by running the following scenario:
      
          mkfs.btrfs -f /dev/sdd
          mount /dev/sdd /mnt
      
          btrfs subvolume create /mnt/first_subvol
          btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
          btrfs subvolume delete /mnt/first_subvol
          btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
          btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data
      
      The send command failed because the send ioctl returned -EPERM.
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c55bfa67
    • F
      Btrfs: ensure readers see new data after a clone operation · c125b8bf
      Filipe Manana 提交于
      We were cleaning the clone target file range from the page cache before
      we did replace the file extent items in the fs tree. This was racy,
      as right after cleaning the relevant range from the page cache and before
      replacing the file extent items, a read against that range could be
      performed by another task and populate again the page cache with stale
      data (stale after the cloning finishes). This would result in reads after
      the clone operation successfully finishes to get old data (and potentially
      for a very long time). Therefore evict the pages after replacing the file
      extent items, so that subsequent reads will always get the new data.
      
      Similarly, we were prone to races while cloning the file extent items
      because we weren't locking the target range and wait for any existing
      ordered extents against that range to complete. It was possible that
      after cloning the extent items, a write operation that was performed
      before the clone operation and overlaps the same range, would end up
      undoing all or part of the work the clone operation did (a worker task
      running inode.c:btrfs_finish_ordered_io). Therefore lock the target
      range in the io tree, wait for all pending ordered extents against that
      range to finish and then safely perform the cloning.
      
      The issue of reading stale data after the clone operation is easy to
      reproduce by running the following C program in a loop until it exits
      with return value 1.
      
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <errno.h>
       #include <pthread.h>
       #include <fcntl.h>
       #include <assert.h>
       #include <asm/types.h>
       #include <linux/ioctl.h>
       #include <sys/stat.h>
       #include <sys/types.h>
       #include <sys/ioctl.h>
      
       #define SRC_FILE "/mnt/sdd/foo"
       #define DST_FILE "/mnt/sdd/bar"
       #define FILE_SIZE (16 * 1024)
       #define PATTERN_SRC 'X'
       #define PATTERN_DST 'Y'
      
      struct btrfs_ioctl_clone_range_args {
      	__s64 src_fd;
      	__u64 src_offset, src_length;
      	__u64 dest_offset;
      };
      
       #define BTRFS_IOCTL_MAGIC 0x94
       #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
      				   struct btrfs_ioctl_clone_range_args)
      
      static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
      static int clone_done = 0;
      static int reader_ready = 0;
      static int stale_data = 0;
      
      static void *reader_loop(void *arg)
      {
      	char buf[4096], want_buf[4096];
      
      	memset(want_buf, PATTERN_SRC, 4096);
      	pthread_mutex_lock(&mutex);
      	reader_ready = 1;
      	pthread_mutex_unlock(&mutex);
      
      	while (1) {
      		int done, fd, ret;
      
      		fd = open(DST_FILE, O_RDONLY);
      		assert(fd != -1);
      
      		pthread_mutex_lock(&mutex);
      		done = clone_done;
      		pthread_mutex_unlock(&mutex);
      
      		ret = read(fd, buf, 4096);
      		assert(ret == 4096);
      		close(fd);
      
      		if (done) {
      			ret = memcmp(buf, want_buf, 4096);
      			if (ret == 0) {
      				printf("Found new content\n");
      			} else {
      				printf("Found old content\n");
      				pthread_mutex_lock(&mutex);
      				stale_data = 1;
      				pthread_mutex_unlock(&mutex);
      			}
      			break;
      		}
      	}
      	return NULL;
      }
      
      int main(int argc, char *argv[])
      {
      	pthread_t reader;
      	int ret, i, fd;
      	struct btrfs_ioctl_clone_range_args clone_args;
      	int fd1, fd2;
      
      	ret = remove(SRC_FILE);
      	if (ret == -1 && errno != ENOENT) {
      		fprintf(stderr, "Error deleting src file: %s\n", strerror(errno));
      		return 1;
      	}
      	ret = remove(DST_FILE);
      	if (ret == -1 && errno != ENOENT) {
      		fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno));
      		return 1;
      	}
      
      	fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
      	assert(fd != -1);
      	for (i = 0; i < FILE_SIZE; i++) {
      		char c = PATTERN_SRC;
      		ret = write(fd, &c, 1);
      		assert(ret == 1);
      	}
      	close(fd);
      	fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
      	assert(fd != -1);
      	for (i = 0; i < FILE_SIZE; i++) {
      		char c = PATTERN_DST;
      		ret = write(fd, &c, 1);
      		assert(ret == 1);
      	}
      	close(fd);
              sync();
      
      	ret = pthread_create(&reader, NULL, reader_loop, NULL);
      	assert(ret == 0);
      	while (1) {
      		int r;
      		pthread_mutex_lock(&mutex);
      		r = reader_ready;
      		pthread_mutex_unlock(&mutex);
      		if (r) break;
      	}
      
      	fd1 = open(SRC_FILE, O_RDONLY);
      	if (fd1 < 0) {
      		fprintf(stderr, "Error open src file: %s\n", strerror(errno));
      		return 1;
      	}
      	fd2 = open(DST_FILE, O_RDWR);
      	if (fd2 < 0) {
      		fprintf(stderr, "Error open dst file: %s\n", strerror(errno));
      		return 1;
      	}
      	clone_args.src_fd = fd1;
      	clone_args.src_offset = 0;
      	clone_args.src_length = 4096;
      	clone_args.dest_offset = 0;
      	ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args);
      	assert(ret == 0);
      	close(fd1);
      	close(fd2);
      
      	pthread_mutex_lock(&mutex);
      	clone_done = 1;
      	pthread_mutex_unlock(&mutex);
      	ret = pthread_join(reader, NULL);
      	assert(ret == 0);
      
      	pthread_mutex_lock(&mutex);
      	ret = stale_data ? 1 : 0;
      	pthread_mutex_unlock(&mutex);
      	return ret;
      }
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c125b8bf
    • Z
      btrfs: replace simple_strtoull() with kstrtoull() · 58dfae63
      ZhangZhen 提交于
      use the newer and more pleasant kstrtoull() to replace simple_strtoull(),
      because simple_strtoull() is marked for obsoletion.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      58dfae63