1. 20 8月, 2015 3 次提交
    • F
      Btrfs: fix file read corruption after extent cloning and fsync · b84b8390
      Filipe Manana 提交于
      If we partially clone one extent of a file into a lower offset of the
      file, fsync the file, power fail and then mount the fs to trigger log
      replay, we can get multiple checksum items in the csum tree that overlap
      each other and result in checksum lookup failures later. Those failures
      can make file data read requests assume a checksum value of 0, but they
      will not return an error (-EIO for example) to userspace exactly because
      the expected checksum value 0 is a special value that makes the read bio
      endio callback return success and set all the bytes of the corresponding
      page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
      From a userspace perspective this is equivalent to file corruption
      because we are not returning what was written to the file.
      
      Details about how this can happen, and why, are included inline in the
      following reproducer test case for fstests and the comment added to
      tree-log.c.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with a single 100K extent starting at file
        # offset 800K. We fsync the file here to make the fsync log tree gets
        # a single csum item that covers the whole 100K extent, which causes
        # the second fsync, done after the cloning operation below, to not
        # leave in the log tree two csum items covering two sub-ranges
        # ([0, 20K[ and [20K, 100K[)) of our extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                        -c "fsync"                     \
                         $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now clone part of our extent into file offset 400K. This adds a file
        # extent item to our inode's metadata that points to the 100K extent
        # we created before, using a data offset of 20K and a data length of
        # 20K, so that it refers to the sub-range [20K, 40K[ of our original
        # extent.
        $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
            -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # Now fsync our file to make sure the extent cloning is durably
        # persisted. This fsync will not add a second csum item to the log
        # tree containing the checksums for the blocks in the sub-range
        # [20K, 40K[ of our extent, because there was already a csum item in
        # the log tree covering the whole extent, added by the first fsync
        # we did before.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power
        # failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        # The fsync log replay first processes the file extent item
        # corresponding to the file offset 400K (the one which refers to the
        # [20K, 40K[ sub-range of our 100K extent) and then processes the file
        # extent item for file offset 800K. It used to happen that when
        # processing the later, it erroneously left in the csum tree 2 csum
        # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
        # 1 for the whole range of our extent. This introduced a problem where
        # subsequent lookups for the checksums of blocks within the range
        # [40K, 100K[ of our extent would not find anything because lookups in
        # the csum tree ended up looking only at the smaller csum item, the
        # one covering the subrange [20K, 40K[. This made read requests assume
        # an expected checksum with a value of 0 for those blocks, which caused
        # checksum verification failure when the read operations finished.
        # However those checksum failure did not result in read requests
        # returning an error to user space (like -EIO for e.g.) because the
        # expected checksum value had the special value 0, and in that case
        # btrfs set all bytes of the corresponding pages with the value 0x01
        # and produce the following warning in dmesg/syslog:
        #
        #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
        #   1322675045 expected csum 0"
        #
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest he had after cloning the extent and
        # before the power failure happened.
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b84b8390
    • Z
      btrfs: Remove unused arguments in tree-log.c · 60d53eb3
      Zhaolei 提交于
      Following arguments are not used in tree-log.c:
       insert_one_name(): path, type
       wait_log_commit(): trans
       wait_for_writer(): trans
      
      This patch remove them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      60d53eb3
    • Z
      btrfs: Remove useless condition in start_log_trans() · 34eb2a52
      Zhaolei 提交于
      Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
      for start_log_trans():
       fs/btrfs/tree-log.c:178 start_log_trans()
       warn: we tested 'root->log_root' before and it was 'false'
      
       fs/btrfs/tree-log.c
       147          if (root->log_root) {
       We test "root->log_root" here.
       ...
      
      Reason:
       Condition of:
       fs/btrfs/tree-log.c:178: if (!root->log_root) {
       is not necessary after commit: 7237f183
      
       It caused a smatch warning, and no functionally error.
      
      Fix:
       Deleting above condition will make smatch shut up,
       but a better way is to do cleanup for start_log_trans()
       to remove duplicated code and make code more readable.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      34eb2a52
  2. 09 8月, 2015 2 次提交
    • F
      Btrfs: fix stale dir entries after removing a link and fsync · 18aa0922
      Filipe Manana 提交于
      We have one more case where after a log tree is replayed we get
      inconsistent metadata leading to stale directory entries, due to
      some directories having entries pointing to some inode while the
      inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item.
      
      To trigger the problem we need to have a file with multiple hard links
      belonging to different parent directories. Then if one of those hard
      links is removed and we fsync the file using one of its other links
      that belongs to a different parent directory, we end up not logging
      the fact that the removed hard link doesn't exists anymore in the
      parent directory.
      
      Simple reproducer:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test directory and file.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/foo
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Now we remove one of our file's hardlinks in the directory testdir.
        unlink $SCRATCH_MNT/testdir/foo3
      
        # We now fsync our file using the "foo" link, which has a parent that
        # is not the directory "testdir".
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Silently drop all writes and unmount to simulate a crash/power
        # failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger journal/log replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After the journal/log is replayed we expect to not see the "foo3"
        # link anymore and we should be able to remove all names in the
        # directory "testdir" and then remove it (no stale directory entries
        # left after the journal/log replay).
        echo "Entries in testdir:"
        ls -1 $SCRATCH_MNT/testdir
      
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        status=0
        exit
      
      The test fails with:
      
        $ ./check generic/107
        FSTYP         -- btrfs
        PLATFORM      -- Linux/x86_64 debian3 4.1.0-rc6-btrfs-next-11+
        MKFS_OPTIONS  -- /dev/sdc
        MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
      
        generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad)
          --- tests/generic/107.out	2015-08-01 01:39:45.807462161 +0100
          +++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad
          @@ -1,3 +1,5 @@
           QA output created by 107
           Entries in testdir:
           foo2
          +foo3
          +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
          ...
          _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \
            (see /home/fdmanana/git/hub/xfstests/results//generic/107.full)
          _check_dmesg: something found in dmesg (see .../results/generic/107.dmesg)
        Ran: generic/107
        Failures: generic/107
        Failed 1 of 1 tests
      
        $ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full
        (...)
        checking fs roots
        root 5 inode 257 errors 200, dir isize wrong
      	unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref
        (...)
      
      And produces the following warning in dmesg:
      
        [127298.759064] BTRFS info (device dm-0): failed to delete reference to foo3, inode 258 parent 257
        [127298.762081] ------------[ cut here ]------------
        [127298.763311] WARNING: CPU: 10 PID: 7891 at fs/btrfs/inode.c:3956 __btrfs_unlink_inode+0x182/0x35a [btrfs]()
        [127298.767327] BTRFS: Transaction aborted (error -2)
        (...)
        [127298.788611] Call Trace:
        [127298.789137]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [127298.790090]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [127298.791157]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [127298.792323]  [<ffffffffa065ad09>] ? __btrfs_unlink_inode+0x182/0x35a [btrfs]
        [127298.793633]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
        [127298.794699]  [<ffffffffa065ad09>] __btrfs_unlink_inode+0x182/0x35a [btrfs]
        [127298.797640]  [<ffffffffa065be8f>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
        [127298.798876]  [<ffffffffa065bf11>] btrfs_unlink+0x60/0x9b [btrfs]
        [127298.800154]  [<ffffffff8116fb48>] vfs_unlink+0x9c/0xed
        [127298.801303]  [<ffffffff81173481>] do_unlinkat+0x12b/0x1fb
        [127298.802450]  [<ffffffff81253855>] ? lockdep_sys_exit_thunk+0x12/0x14
        [127298.803797]  [<ffffffff81174056>] SyS_unlinkat+0x29/0x2b
        [127298.805017]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
        [127298.806310] ---[ end trace bbfddacb7aaada7b ]---
        [127298.807325] BTRFS warning (device dm-0): __btrfs_unlink_inode:3956: Aborting unused transaction(No such entry).
      
      So fix this by logging all parent inodes, current and old ones, to make
      sure we do not get stale entries after log replay. This is not a simple
      solution such as triggering a full transaction commit because it would
      imply full transaction commit when an inode is fsynced in the same
      transaction that modified it and reloaded it after eviction (because its
      last_unlink_trans is set to the same value as its last_trans as of the
      commit with the title "Btrfs: fix stale dir entries after unlink, inode
      eviction and fsync"), and it would also make fstest generic/066 fail
      since one of the fsyncs triggers a full commit and the next fsync will
      not find the inode in the log anymore (therefore not removing the xattr).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18aa0922
    • F
      Btrfs: fix stale directory entries after fsync log replay · bb53eda9
      Filipe Manana 提交于
      We have another case where after an fsync log replay we get an inode with
      a wrong link count (smaller than it should be) and a number of directory
      entries greater than its link count. This happens when we add a new link
      hard link to our inode A and then we fsync some other inode B that has
      the side effect of logging the parent directory inode too. In this case
      at log replay time we add the new hard link to our inode (the item with
      key BTRFS_INODE_REF_KEY) when processing the parent directory but we
      never adjust the link count of our inode A. As a result we get stale dir
      entries for our inode A that can never be deleted and therefore it makes
      it impossible to remove the parent directory (as its i_size can never
      decrease back to 0).
      
      A simple reproducer for fstests that triggers this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test directory and files.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/testdir/foo
        touch $SCRATCH_MNT/testdir/bar
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Create one hard link for file foo and another one for file bar. After
        # that fsync only the file bar.
        ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
        ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar
      
        # Silently drop all writes on scratch device to simulate power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount the fs to trigger log/journal replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now verify both our files have a link count of 2.
        echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
        echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"
      
        # We should be able to remove all the links of our files in testdir, and
        # after that the parent directory should become empty and therefore
        # possible to remove it.
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        # The fstests framework will call fsck against our filesystem which will verify
        # that all metadata is in a consistent state.
      
        status=0
        exit
      
      The test fails with:
      
       -Link count for file foo: 2
       +Link count for file foo: 1
        Link count for file bar: 2
       +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
       +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
       (...)
       _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
      
      And fsck's output:
      
        (...)
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
            unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
        Checking filesystem on /dev/sdc
        (...)
      
      So fix this by marking inodes for link count fixup at log replay time
      whenever a directory entry is replayed if the entry was created in the
      transaction where the fsync was made and if it points to a non-directory
      inode.
      
      This isn't a new problem/regression, the issue exists for a long time,
      possibly since the log tree feature was added (2008).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bb53eda9
  3. 02 7月, 2015 1 次提交
    • F
      Btrfs: fix fsync after truncate when no_holes feature is enabled · a89ca6f2
      Filipe Manana 提交于
      When we have the no_holes feature enabled, if a we truncate a file to a
      smaller size, truncate it again but to a size greater than or equals to
      its original size and fsync it, the log tree will not have any information
      about the hole covering the range [truncate_1_offset, new_file_size[.
      Which means if the fsync log is replayed, the file will remain with the
      state it had before both truncate operations.
      
      Without the no_holes feature this does not happen, since when the inode
      is logged (full sync flag is set) it will find in the fs/subvol tree a
      leaf with a generation matching the current transaction id that has an
      explicit extent item representing the hole.
      
      Fix this by adding an explicit extent item representing a hole between
      the last extent and the inode's i_size if we are doing a full sync.
      
      The issue is easy to reproduce with the following test case for fstests:
      
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test files and make sure everything is durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                        -c "pwrite -S 0xbb 64K 61K"       \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                        -c "pwrite -S 0xff 64K 61K"       \
                        $SCRATCH_MNT/bar | _filter_xfs_io
        sync
      
        # Now truncate our file foo to a smaller size (64Kb) and then truncate
        # it to the size it had before the shrinking truncate (125Kb). Then
        # fsync our file. If a power failure happens after the fsync, we expect
        # our file to have a size of 125Kb, with the first 64Kb of data having
        # the value 0xaa and the second 61Kb of data having the value 0x00.
        $XFS_IO_PROG -c "truncate 64K" \
                     -c "truncate 125K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo
      
        # Do something similar to our file bar, but the first truncation sets
        # the file size to 0 and the second truncation expands the size to the
        # double of what it was initially.
        $XFS_IO_PROG -c "truncate 0" \
                     -c "truncate 253K" \
                     -c "fsync" \
                     $SCRATCH_MNT/bar
      
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # We expect foo to have a size of 125Kb, the first 64Kb of data all
        # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
        # with value 0x00).
        echo "File foo content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # We expect bar to have a size of 253Kb and no extents (any byte read
        # from bar has the value 0x00).
        echo "File bar content after log replay:"
        od -t x1 $SCRATCH_MNT/bar
      
        status=0
        exit
      
      The expected file contents in the golden output are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0372000
        File bar content after log replay:
        0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      Without this fix, their contents are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0372000
        File bar content after log replay:
        0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
        *
        0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      A test case submission for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a89ca6f2
  4. 01 7月, 2015 2 次提交
    • F
      Btrfs: fix fsync xattr loss in the fast fsync path · 36283bf7
      Filipe Manana 提交于
      After commit 4f764e51 ("Btrfs: remove deleted xattrs on fsync log
      replay"), we can end up in a situation where during log replay we end up
      deleting xattrs that were never deleted when their file was last fsynced.
      
      This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
      not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
      set, the xattr was added in a past transaction and the leaf where the
      xattr is located was not updated (COWed or created) in the current
      transaction. In this scenario the xattr item never ends up in the log
      tree and therefore at log replay time, which makes the replay code delete
      the xattr from the fs/subvol tree as it thinks that xattr was deleted
      prior to the last fsync.
      
      Fix this by always logging all xattrs, which is the simplest and most
      reliable way to detect deleted xattrs and replay the deletes at log replay
      time.
      
      This issue is reproducible with the following test case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        here=`pwd`
        tmp=/tmp/$$
        status=1	# failure is the default!
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
        . ./common/attr
      
        # real QA test starts here
      
        # We create a lot of xattrs for a single file. Only btrfs and xfs are currently
        # able to store such a large mount of xattrs per file, other filesystems such
        # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
        # less than 1000 xattrs with very small values.
        _supported_fs btrfs xfs
        _supported_os Linux
        _need_to_be_root
        _require_scratch
        _require_dm_flakey
        _require_attrs
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create the test file with some initial data and make sure everything is
        # durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
        sync
      
        # Add many small xattrs to our file.
        # We create such a large amount because it's needed to trigger the issue found
        # in btrfs - we need to have an amount that causes the fs to have at least 3
        # btree leafs with xattrs stored in them, and it must work on any leaf size
        # (maximum leaf/node size is 64Kb).
        num_xattrs=2000
        for ((i = 1; i <= $num_xattrs; i++)); do
            name="user.attr_$(printf "%04d" $i)"
            $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
        done
      
        # Sync the filesystem to force a commit of the current btrfs transaction, this
        # is a necessary condition to trigger the bug on btrfs.
        sync
      
        # Now update our file's data and fsync the file.
        # After a successful fsync, if the fsync log/journal is replayed we expect to
        # see all the xattrs we added before with the same values (and the updated file
        # data of course). Btrfs used to delete some of these xattrs when it replayed
        # its fsync log/journal.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount. This makes the fs replay its fsync log.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File content after crash and log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        echo "File xattrs after crash and log replay:"
        for ((i = 1; i <= $num_xattrs; i++)); do
            name="user.attr_$(printf "%04d" $i)"
            echo -n "$name="
            $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
            echo
        done
      
        status=0
        exit
      
      The golden output expects all xattrs to be available, and with the correct
      values, after the fsync log is replayed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      36283bf7
    • F
      Btrfs: fix fsync data loss after append write · e4545de5
      Filipe Manana 提交于
      If we do an append write to a file (which increases its inode's i_size)
      that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
      and the previous transaction added a new hard link to the file, which sets
      the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
      the file, the inode's new i_size isn't logged. This has the consequence
      that after the fsync log is replayed, the file size remains what it was
      before the append write operation, which means users/applications will
      not be able to read the data that was successsfully fsync'ed before.
      
      This happens because neither the inode item nor the delayed inode get
      their i_size updated when the append write is made - doing so would
      require starting a transaction in the buffered write path, something that
      we do not do intentionally for performance reasons.
      
      Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
      set the inode is logged with its current i_size (log the in-memory inode
      into the log tree).
      
      This issue is not a recent regression and is easy to reproduce with the
      following test case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        here=`pwd`
        tmp=/tmp/$$
        status=1	# failure is the default!
      
        _cleanup()
        {
                _cleanup_flakey
                rm -f $tmp.*
        }
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _supported_fs generic
        _supported_os Linux
        _need_to_be_root
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        _crash_and_mount()
        {
                # Simulate a crash/power loss.
                _load_flakey_table $FLAKEY_DROP_WRITES
                _unmount_flakey
                # Allow writes again and mount. This makes the fs replay its fsync log.
                _load_flakey_table $FLAKEY_ALLOW_WRITES
                _mount_flakey
        }
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create the test file with some initial data and then fsync it.
        # The fsync here is only needed to trigger the issue in btrfs, as it causes the
        # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                        -c "fsync" \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        sync
      
        # Add a hard link to our file.
        # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
        # which is a necessary condition to trigger the issue.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
      
        # Sync the filesystem to force a commit of the current btrfs transaction, this
        # is a necessary condition to trigger the bug on btrfs.
        sync
      
        # Now append more data to our file, increasing its size, and fsync the file.
        # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
        # write path did not update the inode item in the btree nor the delayed inode
        # item (in memory struture) in the current transaction (created by the fsync
        # handler), the fsync did not record the inode's new i_size in the fsync
        # log/journal. This made the data unavailable after the fsync log/journal is
        # replayed.
        $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo | _filter_xfs_io
      
        echo "File content after fsync and before crash:"
        od -t x1 $SCRATCH_MNT/foo
      
        _crash_and_mount
      
        echo "File content after crash and log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected file output before and after the crash/power failure expects the
      appended data to be available, which is:
      
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0200000
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e4545de5
  5. 03 6月, 2015 1 次提交
    • L
      Btrfs: remove csum_bytes_left · 0c304304
      Liu Bo 提交于
      After commit 8407f553
      ("Btrfs: fix data corruption after fast fsync and writeback error"),
      during wait_ordered_extents(), we wait for ordered extent setting
      BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
      already got checksum information, so we don't need to check
      (csum_bytes_left == 0) in the whole logging path.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0c304304
  6. 16 4月, 2015 1 次提交
  7. 11 4月, 2015 1 次提交
    • C
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason 提交于
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: NChris Mason <clm@fb.com>
      28ed1345
  8. 27 3月, 2015 2 次提交
    • F
      Btrfs: fix metadata inconsistencies after directory fsync · 2f2ff0ee
      Filipe Manana 提交于
      We can get into inconsistency between inodes and directory entries
      after fsyncing a directory. The issue is that while a directory gets
      the new dentries persisted in the fsync log and replayed at mount time,
      the link count of the inode that directory entries point to doesn't
      get updated, staying with an incorrect link count (smaller then the
      correct value). This later leads to stale file handle errors when
      accessing (including attempt to delete) some of the links if all the
      other ones are removed, which also implies impossibility to delete the
      parent directories, since the dentries can not be removed.
      
      Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
      when fsyncing a directory, new files aren't logged (their metadata and
      dentries) nor any child directories. So this patch fixes this issue too,
      since it has the same resolution as the incorrect inode link count issue
      mentioned before.
      
      This is very easy to reproduce, and the following excerpt from my test
      case for xfstests shows how:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file and directory.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        mkdir $SCRATCH_MNT/mydir
      
        # Make sure all metadata and data are durably persisted.
        sync
      
        # Add a hard link to 'foo' inside our test directory and fsync only the
        # directory. The btrfs fsync implementation had a bug that caused the new
        # directory entry to be visible after the fsync log replay but, the inode
        # of our file remained with a link count of 1.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
      
        # Add a few more links and new files.
        # This is just to verify nothing breaks or gives incorrect results after the
        # fsync log is replayed.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
        $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
        ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
      
        # Add some subdirectories and new files and links to them. This is to verify
        # that after fsyncing our top level directory 'mydir', all the subdirectories
        # and their files/links are registered in the fsync log and exist after the
        # fsync log is replayed.
        mkdir -p $SCRATCH_MNT/mydir/x/y/z
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
        touch $SCRATCH_MNT/mydir/x/y/z/qwerty
      
        # Now fsync only our top directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
      
        # And fsync now our new file named 'hello', just to verify later that it has
        # the expected content and that the previous fsync on the directory 'mydir' had
        # no bad influence on this fsync.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
        # all with the value 0xaa.
        echo "File 'foo' content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Remove the first name of our inode. Because of the directory fsync bug, the
        # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
        # deleting the inode and the other names became stale directory entries (still
        # visible to applications). Attempting to remove or access the remaining
        # dentries pointing to that inode resulted in stale file handle errors and
        # made it impossible to remove the parent directories since it was impossible
        # for them to become empty.
        echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
        rm -f $SCRATCH_MNT/foo
      
        # Now verify that all files, links and directories created before fsyncing our
        # directory exist after the fsync log was replayed.
        [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
        [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
        [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
        [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
            echo "Link mydir/x/y/foo_y_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
            echo "Link mydir/x/y/z/foo_z_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
            echo "File mydir/x/y/z/qwerty is missing"
      
        # We expect our file here to have a size of 64Kb and all the bytes having the
        # value 0xff.
        echo "file 'hello' content after log replay:"
        od -t x1 $SCRATCH_MNT/hello
      
        # Now remove all files/links, under our test directory 'mydir', and verify we
        # can remove all the directories.
        rm -f $SCRATCH_MNT/mydir/x/y/z/*
        rmdir $SCRATCH_MNT/mydir/x/y/z
        rm -f $SCRATCH_MNT/mydir/x/y/*
        rmdir $SCRATCH_MNT/mydir/x/y
        rmdir $SCRATCH_MNT/mydir/x
        rm -f $SCRATCH_MNT/mydir/*
        rmdir $SCRATCH_MNT/mydir
      
        # An fsck, run by the fstests framework everytime a test finishes, also detected
        # the inconsistency and printed the following error message:
        #
        # root 5 inode 257 errors 2001, no inode item, link count wrong
        #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
        #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
        status=0
        exit
      
      The expected golden output for the test is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 5
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
      
      Which is the output after this patch and when running the test against
      ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
      output is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 1
        Link mydir/foo_2 is missing
        Link mydir/foo_3 is missing
        Link mydir/x/y/foo_y_link is missing
        Link mydir/x/y/z/foo_z_link is missing
        File mydir/x/y/z/qwerty is missing
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
      
      Fsck, without this fix, also complains about the wrong link count:
      
        root 5 inode 257 errors 2001, no inode item, link count wrong
            unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
            unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
      So fix this by logging the inodes that the dentries point to when
      fsyncing a directory.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f2ff0ee
    • F
      Btrfs: remove deleted xattrs on fsync log replay · 4f764e51
      Filipe Manana 提交于
      If we deleted xattrs from a file and fsynced the file, after a log replay
      the xattrs would remain associated to the file. This was an unexpected
      behaviour and differs from what other filesystems do, such as for example
      xfs and ext3/4.
      
      Fix this by, on fsync log replay, check if every xattr in the fs/subvol
      tree (that belongs to a logged inode) has a matching xattr in the log,
      and if it does not, delete it from the fs/subvol tree. This is a similar
      approach to what we do for dentries when we replay a directory from the
      fsync log.
      
      This issue is trivial to reproduce, and the following excerpt from my
      test for xfstests triggers the issue:
      
        _crash_and_mount()
        {
             # Simulate a crash/power loss.
             _load_flakey_table $FLAKEY_DROP_WRITES
             _unmount_flakey
             _load_flakey_table $FLAKEY_ALLOW_WRITES
             _mount_flakey
        }
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create out test file and add 3 xattrs to it.
        touch $SCRATCH_MNT/foobar
        $SETFATTR_PROG -n user.attr1 -v val1 $SCRATCH_MNT/foobar
        $SETFATTR_PROG -n user.attr2 -v val2 $SCRATCH_MNT/foobar
        $SETFATTR_PROG -n user.attr3 -v val3 $SCRATCH_MNT/foobar
      
        # Make sure everything is durably persisted.
        sync
      
        # Now delete the second xattr and fsync the inode.
        $SETFATTR_PROG -x user.attr2 $SCRATCH_MNT/foobar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        _crash_and_mount
      
        # After the fsync log is replayed, the file should have only 2 xattrs, the ones
        # named user.attr1 and user.attr3. The btrfs fsync log replay bug left the file
        # with the 3 xattrs that we had before deleting the second one and fsyncing the
        # file.
        echo "xattr names and values after first fsync log replay:"
        $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch
      
        # Now write some data to our file, fsync it, remove the first xattr, add a new
        # hard link to our file and commit the fsync log by fsyncing some other new
        # file. This is to verify that after log replay our first xattr does not exist
        # anymore.
        echo "hello world!" >> $SCRATCH_MNT/foobar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
        $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar
        ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link
        touch $SCRATCH_MNT/qwerty
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty
      
        _crash_and_mount
      
        # Now only the xattr with name user.attr3 should be set in our file.
        echo "xattr names and values after second fsync log replay:"
        $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch
      
        status=0
        exit
      
      The expected golden output, which is produced with this patch applied or
      when testing against xfs or ext3/4, is:
      
        xattr names and values after first fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr1="val1"
        user.attr3="val3"
      
        xattr names and values after second fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr3="val3"
      
      Without this patch applied, the output is:
      
        xattr names and values after first fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr1="val1"
        user.attr2="val2"
        user.attr3="val3"
      
        xattr names and values after second fsync log replay:
        # file: SCRATCH_MNT/foobar
        user.attr1="val1"
        user.attr2="val2"
        user.attr3="val3"
      
      A patch with a test case for xfstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4f764e51
  9. 06 3月, 2015 1 次提交
  10. 17 2月, 2015 1 次提交
    • D
      Btrfs: disk-io: replace root args iff only fs_info used · 01d58472
      Daniel Dressler 提交于
      This is the 3rd independent patch of a larger project to cleanup btrfs's
      internal usage of btrfs_root. Many functions take btrfs_root only to
      grab the fs_info struct.
      
      By requiring a root these functions cause programmer overhead. That
      these functions can accept any valid root is not obvious until
      inspection.
      
      This patch reduces the specificity of such functions to accept the
      fs_info directly.
      
      These patches can be applied independently and thus are not being
      submitted as a patch series. There should be about 26 patches by the
      project's completion. Each patch will cleanup between 1 and 34 functions
      apiece.  Each patch covers a single file's functions.
      
      This patch affects the following function(s):
        1) csum_tree_block
        2) csum_dirty_buffer
        3) check_tree_block_fsid
        4) btrfs_find_tree_block
        5) clean_tree_block
      Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      01d58472
  11. 15 2月, 2015 3 次提交
    • F
      Btrfs: don't remove extents and xattrs when logging new names · a742994a
      Filipe Manana 提交于
      If we are recording in the tree log that an inode has new names (new hard
      links were added), we would drop items, belonging to the inode, that we
      shouldn't:
      
      1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
         flags, we ended up dropping all the extent and xattr items that were
         previously logged. This was done only in memory, since logging a new
         name doesn't imply syncing the log;
      
      2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
         flags, we ended up dropping all the xattr items that were previously
         logged. Like the case before, this was done only in memory because
         logging a new name doesn't imply syncing the log.
      
      This led to some surprises in scenarios such as the following:
      
      1) write some extents to an inode;
      2) fsync the inode;
      3) truncate the inode or delete/modify some of its xattrs
      4) add a new hard link for that inode
      5) fsync some other file, to force the log tree to be durably persisted
      6) power failure happens
      
      The next time the fs is mounted, the fsync log replay code is executed,
      and the resulting file doesn't have the content it had when the last fsync
      against it was performed, instead if has a content matching what it had
      when the last transaction commit happened.
      
      So change the behaviour such that when a new name is logged, only the inode
      item and reference items are processed.
      
      This is easy to reproduce with the test I just made for xfstests, whose
      main body is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with some data.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Make sure the file is durably persisted.
        sync
      
        # Append some data to our file, to increase its size.
        $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, so from this point on if a crash/power failure happens, our
        # new data is guaranteed to be there next time the fs is mounted.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Now shrink our file to 5000 bytes.
        $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo
      
        # Now do an expanding truncate to a size larger than what we had when we last
        # fsync'ed our file. This is just to verify that after power failure and
        # replaying the fsync log, our file matches what it was when we last fsync'ed
        # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
        # data had a value of 0xcc.
        $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo
      
        # Add one hard link to our file. This made btrfs drop all of our file's
        # metadata from the fsync log, including the metadata relative to the
        # extent we just wrote and fsync'ed. This change was made only to the fsync
        # log in memory, so adding the hard link alone doesn't change the persisted
        # fsync log. This happened because the previous truncates set the runtime
        # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now make sure the in memory fsync log is durably persisted.
        # Creating and fsync'ing another file will do it.
        # After this our persisted fsync log will no longer have metadata for our file
        # foo that points to the extent we wrote and fsync'ed before.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # As expected, before the crash/power failure, we should be able to see a file
        # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
        # the remaining bytes with value 0x00.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After mounting the fs again, the fsync log was replayed.
        # The expected result is to see a file with a size of 12Kb, with its first 8Kb
        # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
        # The btrfs bug used to leave the file as it used te be as of the last
        # transaction commit - that is, with a size of 8Kb with all bytes having a
        # value of 0xaa.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      The test case for xfstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a742994a
    • F
      Btrfs: fix fsync data loss after adding hard link to inode · 1a4bcf47
      Filipe Manana 提交于
      We have a scenario where after the fsync log replay we can lose file data
      that had been previously fsync'ed if we added an hard link for our inode
      and after that we sync'ed the fsync log (for example by fsync'ing some
      other file or directory).
      
      This is because when adding an hard link we updated the inode item in the
      log tree with an i_size value of 0. At that point the new inode item was
      in memory only and a subsequent fsync log replay would not make us lose
      the file data. However if after adding the hard link we sync the log tree
      to disk, by fsync'ing some other file or directory for example, we ended
      up losing the file data after log replay, because the inode item in the
      persisted log tree had an an i_size of zero.
      
      This is easy to reproduce, and the following excerpt from my test for
      xfstests shows this:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create one file with data and fsync it.
        # This made the btrfs fsync log persist the data and the inode metadata with
        # a correct inode->i_size (4096 bytes).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now add one hard link to our file. This made the btrfs code update the fsync
        # log, in memory only, with an inode metadata having a size of 0.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now force persistence of the fsync log to disk, for example, by fsyncing some
        # other file.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # Before a power loss or crash, we could read the 4Kb of data from our file as
        # expected.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After the fsync log replay, because the fsync log had a value of 0 for our
        # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
        # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      Another alternative test, that doesn't need to fsync an inode in the same
      transaction it was created, is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with some data.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Make sure the file is durably persisted.
        sync
      
        # Append some data to our file, to increase its size.
        $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, so from this point on if a crash/power failure happens, our
        # new data is guaranteed to be there next time the fs is mounted.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Add one hard link to our file. This made btrfs write into the in memory fsync
        # log a special inode with generation 0 and an i_size of 0 too. Note that this
        # didn't update the inode in the fsync log on disk.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now make sure the in memory fsync log is durably persisted.
        # Creating and fsync'ing another file will do it.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # As expected, before the crash/power failure, we should be able to read the
        # 12Kb of file data.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After mounting the fs again, the fsync log was replayed.
        # The btrfs fsync log replay code didn't update the i_size of the persisted
        # inode because the inode item in the log had a special generation with a
        # value of 0 (and it couldn't know the correct i_size, since that inode item
        # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
        # effectively lost.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      This isn't a new issue/regression. This problem has been around since the
      log tree code was added in 2008:
      
        Btrfs: Add a write ahead tree log to optimize synchronous operations
        (commit e02119d5)
      
      Test cases for xfstests follow soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4bcf47
    • F
      Btrfs: fix scheduler warning when syncing log · 575849ec
      Filipe Manana 提交于
      We try to lock a mutex while the current task state is not TASK_RUNNING,
      which results in the following warning when CONFIG_DEBUG_LOCK_ALLOC=y:
      
      [30736.772501] ------------[ cut here ]------------
      [30736.774545] WARNING: CPU: 9 PID: 19972 at kernel/sched/core.c:7300 __might_sleep+0x8b/0xa8()
      [30736.783453] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff8107499b>] prepare_to_wait+0x43/0x89
      [30736.786261] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc psmouse parport pcspkr microcode serio_raw evdev processor thermal_sys i2c_piix4 i2c_core button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod
      [30736.794323] CPU: 9 PID: 19972 Comm: fsstress Not tainted 3.19.0-rc7-btrfs-next-5+ #1
      [30736.795821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [30736.798788]  0000000000000009 ffff88042743fbd8 ffffffff814248ed ffff88043d32f2d8
      [30736.800504]  ffff88042743fc28 ffff88042743fc18 ffffffff81045338 0000000000000001
      [30736.802131]  ffffffff81064514 ffffffff817c52d1 000000000000026d 0000000000000000
      [30736.803676] Call Trace:
      [30736.804256]  [<ffffffff814248ed>] dump_stack+0x4c/0x65
      [30736.805245]  [<ffffffff81045338>] warn_slowpath_common+0xa1/0xbb
      [30736.806360]  [<ffffffff81064514>] ? __might_sleep+0x8b/0xa8
      [30736.807391]  [<ffffffff81045398>] warn_slowpath_fmt+0x46/0x48
      [30736.808511]  [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89
      [30736.809620]  [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89
      [30736.810691]  [<ffffffff81064514>] __might_sleep+0x8b/0xa8
      [30736.811703]  [<ffffffff81426eaf>] mutex_lock_nested+0x2f/0x3a0
      [30736.812889]  [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [30736.814138]  [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf
      [30736.819878]  [<ffffffffa038cfff>] wait_for_writer.isra.12+0x91/0xaa [btrfs]
      [30736.821260]  [<ffffffff810748bd>] ? signal_pending_state+0x31/0x31
      [30736.822410]  [<ffffffffa0391f0a>] btrfs_sync_log+0x160/0x947 [btrfs]
      [30736.823574]  [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [30736.824847]  [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf
      [30736.825972]  [<ffffffffa036e555>] btrfs_sync_file+0x2b0/0x319 [btrfs]
      [30736.827684]  [<ffffffff8117901a>] vfs_fsync_range+0x21/0x23
      [30736.828932]  [<ffffffff81179038>] vfs_fsync+0x1c/0x1e
      [30736.829917]  [<ffffffff8117928b>] do_fsync+0x34/0x4e
      [30736.830862]  [<ffffffff811794b3>] SyS_fsync+0x10/0x14
      [30736.831819]  [<ffffffff8142a512>] system_call_fastpath+0x12/0x17
      [30736.832982] ---[ end trace c0b57df60d32ae5c ]---
      
      Fix this my acquiring the mutex after calling finish_wait(), which sets the
      task's state to TASK_RUNNING.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      575849ec
  12. 05 2月, 2015 1 次提交
  13. 03 2月, 2015 1 次提交
  14. 22 1月, 2015 3 次提交
    • F
      Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs · df8d116f
      Filipe Manana 提交于
      If we have an inode with a large number of hard links, some of which may
      be extrefs, turn a regular ref into an extref, fsync the inode and then
      replay the fsync log (after a crash/reboot), we can endup with an fsync
      log that makes the replay code always fail with -EOVERFLOW when processing
      the inode's references.
      
      This is easy to reproduce with the test case I made for xfstests. Its steps
      are the following:
      
         _scratch_mkfs "-O extref" >> $seqres.full 2>&1
         _init_flakey
         _mount_flakey
      
         # Create a test file with 3001 hard links. This number is large enough to
         # make btrfs start using extrefs at some point even if the fs has the maximum
         # possible leaf/node size (64Kb).
         echo "hello world" > $SCRATCH_MNT/foo
         for i in `seq 1 3000`; do
             ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
         done
      
         # Make sure all metadata and data are durably persisted.
         sync
      
         # Now remove one link, add a new one with a new name, add another new one with
         # the same name as the one we just removed and fsync the inode.
         rm -f $SCRATCH_MNT/foo_link_0001
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
         rm -f $SCRATCH_MNT/foo_link_0002
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
         # Simulate a crash/power loss. This makes sure the next mount
         # will see an fsync log and will replay that log.
      
         _load_flakey_table $FLAKEY_DROP_WRITES
         _unmount_flakey
      
         _load_flakey_table $FLAKEY_ALLOW_WRITES
         _mount_flakey
      
         # Check that the number of hard links is correct, we are able to remove all
         # the hard links and read the file's data. This is just to verify we don't
         # get stale file handle errors (due to dangling directory index entries that
         # point to inodes that no longer exist).
         echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
         [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
         for ((i = 1; i <= 3003; i++)); do
             name=foo_link_`printf "%04d" $i`
             if [ $i -eq 2 ]; then
                 [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
             else
                 [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
             fi
         done
         rm -f $SCRATCH_MNT/foo_link_*
         cat $SCRATCH_MNT/foo
         rm -f $SCRATCH_MNT/foo
      
         status=0
         exit
      
      The fix is simply to correct the overflow condition when overwriting a
      reference item because it was wrong, trying to increase the item in the
      fs/subvol tree by an impossible amount. Also ensure that we don't insert
      one normal ref and one ext ref for the same dentry - this happened because
      processing a dir index entry from the parent in the log happened when
      the normal ref item was full, which made the logic insert an extref and
      later when the normal ref had enough room, it would be inserted again
      when processing the ref item from the child inode in the log.
      
      This issue has been present since the introduction of the extrefs feature
      (2012).
      
      A test case for xfstests follows soon. This test only passes if the previous
      patch titled "Btrfs: fix fsync when extend references are added to an inode"
      is applied too.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      df8d116f
    • F
      Btrfs: fix fsync when extend references are added to an inode · 2c2c452b
      Filipe Manana 提交于
      If we added an extended reference to an inode and fsync'ed it, the log
      replay code would make our inode have an incorrect link count, which
      was lower then the expected/correct count.
      This resulted in stale directory index entries after deleting some of
      the hard links, and any access to the dangling directory entries resulted
      in -ESTALE errors because the entries pointed to inode items that don't
      exist anymore.
      
      This is easy to reproduce with the test case I made for xfstests, and
      the bulk of that test is:
      
          _scratch_mkfs "-O extref" >> $seqres.full 2>&1
          _init_flakey
          _mount_flakey
      
          # Create a test file with 3001 hard links. This number is large enough to
          # make btrfs start using extrefs at some point even if the fs has the maximum
          # possible leaf/node size (64Kb).
          echo "hello world" > $SCRATCH_MNT/foo
          for i in `seq 1 3000`; do
              ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
          done
      
          # Make sure all metadata and data are durably persisted.
          sync
      
          # Add one more link to the inode that ends up being a btrfs extref and fsync
          # the inode.
          ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
          $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
          # Simulate a crash/power loss. This makes sure the next mount
          # will see an fsync log and will replay that log.
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          # Now after the fsync log replay btrfs left our inode with a wrong link count N,
          # which was smaller than the correct link count M (N < M).
          # So after removing N hard links, the remaining M - N directory entries were
          # still visible to user space but it was impossible to do anything with them
          # because they pointed to an inode that didn't exist anymore. This resulted in
          # stale file handle errors (-ESTALE) when accessing those dentries for example.
          #
          # So remove all hard links except the first one and then attempt to read the
          # file, to verify we don't get an -ESTALE error when accessing the inodel
          #
          # The btrfs fsck tool also detected the incorrect inode link count and it
          # reported an error message like the following:
          #
          # root 5 inode 257 errors 2001, no inode item, link count wrong
          #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
          #
          # The fstests framework automatically calls fsck after a test is run, so we
          # don't need to call fsck explicitly here.
      
          rm -f $SCRATCH_MNT/foo_link_*
          cat $SCRATCH_MNT/foo
      
          status=0
          exit
      
      So make sure an fsync always flushes the delayed inode item, so that the
      fsync log contains it (needed in order to trigger the link count fixup
      code) and fix the extref counting function, which always return -ENOENT
      to its caller (and made it assume there were always 0 extrefs).
      
      This issue has been present since the introduction of the extrefs feature
      (2012).
      
      A test case for xfstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c2c452b
    • F
      Btrfs: fix directory inconsistency after fsync log replay · d36808e0
      Filipe Manana 提交于
      If we have an inode (file) with a link count greater than 1, remove
      one of its hard links, fsync the inode, power fail/crash and then
      replay the fsync log on the next mount, we end up getting the parent
      directory's metadata inconsistent - its i_size still reflects the
      deleted hard link and has dangling index entries (with no matching
      inode reference entries). This prevents the directory from ever being
      deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
      even if all of its children inodes are deleted, and the dangling index
      entries can never be removed (as they point to an inode that does not
      exist anymore).
      
      This is easy to reproduce with the following excerpt from the test case
      for xfstests that I just made:
      
          _scratch_mkfs >> $seqres.full 2>&1
      
          _init_flakey
          _mount_flakey
      
          # Create a test file with 2 hard links in the same directory.
          mkdir -p $SCRATCH_MNT/a/b
          echo "hello world" > $SCRATCH_MNT/a/b/foo
          ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar
      
          # Make sure all metadata and data are durably persisted.
          sync
      
          # Now remove one of the hard links and fsync the inode.
          rm -f $SCRATCH_MNT/a/b/bar
          $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo
      
          # Simulate a crash/power loss. This makes sure the next mount
          # will see an fsync log and will replay that log.
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          # Remove the last hard link of the file and attempt to remove its parent
          # directory - this failed in btrfs because the fsync log and replay code
          # didn't decrement the parent directory's i_size and left dangling directory
          # index entries - this made the btrfs rmdir implementation always fail with
          # the error -ENOTEMPTY.
          #
          # The dangling directory index entries were visible to user space, but it was
          # impossible to do anything on them (unlink, open, read, write, stat, etc)
          # because the inode they pointed to did not exist anymore.
          #
          # The parent directory's metadata inconsistency (stale index entries) was
          # also detected by btrfs' fsck tool, which is run automatically by the fstests
          # framework when the test finishes. The error message reported by fsck was:
          #
          # root 5 inode 259 errors 2001, no inode item, link count wrong
          #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
          #
          rm -f $SCRATCH_MNT/a/b/*
          rmdir $SCRATCH_MNT/a/b
          rmdir $SCRATCH_MNT/a
      
      To fix this just make sure that after an unlink, if the inode is fsync'ed,
      he parent inode is fully logged in the fsync log.
      
      A test case for xfstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d36808e0
  15. 15 1月, 2015 2 次提交
  16. 13 12月, 2014 1 次提交
    • D
      btrfs: sink blocksize parameter to btrfs_find_create_tree_block · a83fffb7
      David Sterba 提交于
      Finally it's clear that the requested blocksize is always equal to
      nodesize, with one exception, the superblock.
      
      Superblock has fixed size regardless of the metadata block size, but
      uses the same helpers to initialize sys array/chunk tree and to work
      with the chunk items. So it pretends to be an extent_buffer for a
      moment, btrfs_read_sys_array is full of special cases, we're adding one
      more.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      a83fffb7
  17. 22 11月, 2014 4 次提交
    • F
      Btrfs: ensure ordered extent errors aren't missed on fsync · b38ef71c
      Filipe Manana 提交于
      When doing a fsync with a fast path we have a time window where we can miss
      the fact that writeback of some file data failed, and therefore we endup
      returning success (0) from fsync when we should return an error.
      The steps that lead to this are the following:
      
      1) We start all ordered extents by calling filemap_fdatawrite_range();
      
      2) We do some other work like locking the inode's i_mutex, start a transaction,
         start a log transaction, etc;
      
      3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
         ordered extents from inode's ordered tree into a list;
      
      4) But by the time we do ordered extent collection, some ordered extents we started
         at step 1) might have already completed with an error, and therefore we didn't
         found them in the ordered tree and had no idea they finished with an error. This
         makes our fsync return success (0) to userspace, but has no bad effects on the log
         like for example insertion of file extent items into the log that point to unwritten
         extents, because the invalid extent maps were removed before the ordered extent
         completed (in inode.c:btrfs_finish_ordered_io).
      
      So after collecting the ordered extents just check if the inode's i_mapping has any
      error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
      writeback fails for a page of an ordered extent, we call mapping_set_error (done in
      extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
      that sets one of those error flags in the inode's i_mapping flags.
      
      This change also has the side effect of fixing the issue where for fast fsyncs we
      never checked/cleared the error flags from the inode's i_mapping flags, which means
      that a full fsync performed after a fast fsync could get such errors that belonged
      to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
      calls filemap_fdatawait_range(), and the later checks for and clears those flags,
      while for fast fsyncs we never call filemap_fdatawait_range() or anything else
      that checks for and clears the error flags from the inode's i_mapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b38ef71c
    • F
      Btrfs: collect only the necessary ordered extents on ranged fsync · 0870295b
      Filipe Manana 提交于
      Instead of collecting all ordered extents from the inode's ordered tree
      and then wait for all of them to complete, just collect the ones that
      overlap the fsync range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0870295b
    • F
      Btrfs: don't ignore log btree writeback errors · 5ab5e44a
      Filipe Manana 提交于
      If an error happens during writeback of log btree extents, make sure the
      error is returned to the caller (fsync), so that it takes proper action
      (commit current transaction) instead of writing a superblock that points
      to log btrees with all or some nodes that weren't durably persisted.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ab5e44a
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  18. 21 11月, 2014 1 次提交
    • J
      Btrfs: make sure we wait on logged extents when fsycning two subvols · 9dba8cf1
      Josef Bacik 提交于
      If we have two fsync()'s race on different subvols one will do all of its work
      to get into the log_tree, wait on it's outstanding IO, and then allow the
      log_tree to finish it's commit.  The problem is we were just free'ing that
      subvols logged extents instead of waiting on them, so whoever lost the race
      wouldn't really have their data on disk.  Fix this by waiting properly instead
      of freeing the logged extents.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9dba8cf1
  19. 28 10月, 2014 1 次提交
    • F
      Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd
      Filipe Manana 提交于
      If we couldn't find our extent item, we accessed the current slot
      (path->slots[0]) to check if it corresponds to an equivalent skinny
      metadata item. However this slot could be beyond our last item in the
      leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
      we shouldn't process it.
      
      Since btrfs_lookup_extent() is only used to find extent items for data
      extents, fix this by removing completely the logic that looks up for an
      equivalent skinny metadata item, since it can not exist.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4ed8fd
  20. 19 9月, 2014 1 次提交
    • F
      Btrfs: fix data corruption after fast fsync and writeback error · 8407f553
      Filipe Manana 提交于
      When we do a fast fsync, we start all ordered operations and then while
      they're running in parallel we visit the list of modified extent maps
      and construct their matching file extent items and write them to the
      log btree. After that, in btrfs_sync_log() we wait for all the ordered
      operations to finish (via btrfs_wait_logged_extents).
      
      The problem with this is that we were completely ignoring errors that
      can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
      or -EIO errors for example. When such error happens, it means we have parts
      of the on disk extent that weren't written to, and so we end up logging
      file extent items that point to these extents that contain garbage/random
      data - so after a crash/reboot plus log replay, we get our inode's metadata
      pointing to those extents.
      
      This worked in contrast with the full (non-fast) fsync path, where we
      start all ordered operations, wait for them to finish and then write
      to the log btree. In this path, after each ordered operation completes
      we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
      -EIO if so (via btrfs_wait_ordered_range).
      
      So if an error happens with any ordered operation, just return a -EIO
      error to userspace, so that it knows that not all of its previous writes
      were durably persisted and the application can take proper action (like
      redo the writes for e.g.) - and definitely not leave any file extent items
      in the log refer to non fully written extents.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8407f553
  21. 18 9月, 2014 4 次提交
    • F
      Btrfs: fix directory recovery from fsync log · a2cc11db
      Filipe Manana 提交于
      When replaying a directory from the fsync log, if a directory entry
      exists both in the fs/subvol tree and in the log, the directory's inode
      got its i_size updated incorrectly, accounting for the dentry's name
      twice.
      
      Reproducer, from a test for xfstests:
      
          _scratch_mkfs >> $seqres.full 2>&1
          _init_flakey
          _mount_flakey
      
          touch $SCRATCH_MNT/foo
          sync
      
          touch $SCRATCH_MNT/bar
          xfs_io -c "fsync" $SCRATCH_MNT
          xfs_io -c "fsync" $SCRATCH_MNT/bar
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
          [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"
      
          _unmount_flakey
          _check_scratch_fs $FLAKEY_DEV
      
      The filesystem check at the end failed with the message:
      "root 5 root dir 256 error".
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a2cc11db
    • F
      Btrfs: make btrfs_search_forward return with nodes unlocked · f98de9b9
      Filipe Manana 提交于
      None of the uses of btrfs_search_forward() need to have the path
      nodes (level >= 1) read locked, only the leaf needs to be locked
      while the caller processes it. Therefore make it return a path
      with all nodes unlocked, except for the leaf.
      
      This change is motivated by the observation that during a file
      fsync we repeatdly call btrfs_search_forward() and process the
      returned leaf while upper nodes of the returned path (level >= 1)
      are read locked, which unnecessarily blocks other tasks that want
      to write to the same fs/subvol btree.
      Therefore instead of modifying the fsync code to unlock all nodes
      with level >= 1 immediately after calling btrfs_search_forward(),
      change btrfs_search_forward() to do it, so that it benefits all
      callers.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f98de9b9
    • D
      btrfs: use nodesize everywhere, kill leafsize · 707e8a07
      David Sterba 提交于
      The nodesize and leafsize were never of different values. Unify the
      usage and make nodesize the one. Cleanup the redundant checks and
      helpers.
      
      Shaves a few bytes from .text:
      
        text    data     bss     dec     hex filename
      852418   24560   23112  900090   dbbfa btrfs.ko.before
      851074   24584   23112  898770   db6d2 btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      707e8a07
    • D
      btrfs: kill the key type accessor helpers · 962a298f
      David Sterba 提交于
      btrfs_set_key_type and btrfs_key_type are used inconsistently along with
      open coded variants. Other members of btrfs_key are accessed directly
      without any helpers anyway.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      962a298f
  22. 17 9月, 2014 1 次提交
    • F
      Btrfs: set inode's logged_trans/last_log_commit after ranged fsync · 125c4cf9
      Filipe Manana 提交于
      When a ranged fsync finishes if there are still extent maps in the modified
      list, still set the inode's logged_trans and last_log_commit. This is important
      in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
      inode ref gets deleted from the log and the respective dentries in its parent
      are deleted too from the log (if the parent directory was fsync'ed in the same
      transaction).
      
      Instead make btrfs_inode_in_log() return false if the list of modified extent
      maps isn't empty.
      
      This is an incremental on top of the v4 version of the patch:
      
          "Btrfs: fix fsync data loss after a ranged fsync"
      
      which was added to its v5, but didn't make it on time.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      125c4cf9
  23. 09 9月, 2014 1 次提交
    • F
      Btrfs: fix fsync data loss after a ranged fsync · 49dae1bc
      Filipe Manana 提交于
      While we're doing a full fsync (when the inode has the flag
      BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
      portion of the file), we might have ordered operations that are started
      before or while we're logging the inode and that fall outside the fsync
      range.
      
      Therefore when a full ranged fsync finishes don't remove every extent
      map from the list of modified extent maps - as for some of them, that
      fall outside our fsync range, their respective ordered operation hasn't
      finished yet, meaning the corresponding file extent item wasn't inserted
      into the fs/subvol tree yet and therefore we didn't log it, and we must
      let the next fast fsync (one that checks only the modified list) see this
      extent map and log a matching file extent item to the log btree and wait
      for its ordered operation to finish (if it's still ongoing).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      49dae1bc
  24. 26 8月, 2014 1 次提交