1. 20 1月, 2016 1 次提交
  2. 16 12月, 2015 1 次提交
    • C
      Btrfs: check prepare_uptodate_page() error code earlier · bb1591b4
      Chris Mason 提交于
      prepare_pages() may end up calling prepare_uptodate_page() twice if our
      write only spans a single page.  But if the first call returns an error,
      our page will be unlocked and its not safe to call it again.
      
      This bug goes all the way back to 2011, and it's not something commonly
      hit.
      
      While we're here, add a more explicit check for the page being truncated
      away.  The bare lock_page() alone is protected only by good thoughts and
      i_mutex, which we're sure to regret eventually.
      Reported-by: NDave Jones <dsj@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bb1591b4
  3. 03 12月, 2015 1 次提交
  4. 25 11月, 2015 1 次提交
    • D
      btrfs: fix signed overflows in btrfs_sync_file · 9dcbeed4
      David Sterba 提交于
      The calculation of range length in btrfs_sync_file leads to signed
      overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.
      
      https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
      
      The fsync call passes 0 and LLONG_MAX, the range length does not fit to
      loff_t and overflows, but the value is converted to u64 so it silently
      works as expected.
      
      The minimal fix is a typecast to u64, switching functions to take
      (start, end) instead of (start, len) would be more intrusive.
      
      Coccinelle script found that there's one more opencoded calculation of
      the length.
      
      <smpl>
      @@
      loff_t start, end;
      @@
      * end - start
      </smpl>
      
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9dcbeed4
  5. 09 11月, 2015 1 次提交
    • F
      Btrfs: fix race leading to incorrect item deletion when dropping extents · aeafbf84
      Filipe Manana 提交于
      While running a stress test I got the following warning triggered:
      
        [191627.672810] ------------[ cut here ]------------
        [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
        (...)
        [191627.701485] Call Trace:
        [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
        [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
        [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
        [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
        [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
        [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
        [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
        [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
        [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
        [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
        [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
        [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
        [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
        [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.744206] ---[ end trace bbfddacb7aaada8d ]---
      
        $ cat -n fs/btrfs/file.c
        691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
        (...)
        758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
        759                  if (key.objectid > ino ||
        760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
        761                          break;
        762
        763                  fi = btrfs_item_ptr(leaf, path->slots[0],
        764                                      struct btrfs_file_extent_item);
        765                  extent_type = btrfs_file_extent_type(leaf, fi);
        766
        767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
        768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
        (...)
        774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
        (...)
        778                  } else {
        779                          WARN_ON(1);
        780                          extent_end = search_start;
        781                  }
        (...)
      
      This happened because the item we were processing did not match a file
      extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
      case we cast the item to a struct btrfs_file_extent_item pointer and
      then find a type field value that does not match any of the expected
      values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
      due to a tiny time window where a race can happen as exemplified below.
      For example, consider the following scenario where we're using the
      NO_HOLES feature and we have the following two neighbour leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
      [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                slot N - 2         slot N - 1              slot 0
      
      Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
      than explicit because NO_HOLES is enabled). Now if our inode has an
      ordered extent for the range [4K, 8K[ that is finishing, the following
      can happen:
      
                CPU 1                                       CPU 2
      
        btrfs_finish_ordered_io()
          insert_reserved_file_extent()
            __btrfs_drop_extents()
               Searches for the key
                (257 EXTENT_DATA 4096) through
                btrfs_lookup_file_extent()
      
               Key not found and we get a path where
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
               Because path->slots[0] is >=
               btrfs_header_nritems(leaf X), we call
               btrfs_next_leaf()
      
               btrfs_next_leaf() releases the path
      
                                                        inserts key
                                                        (257 INODE_REF 4096)
                                                        at the end of leaf X,
                                                        leaf X now has N + 1 keys,
                                                        and the new key is at
                                                        slot N
      
               btrfs_next_leaf() searches for
               key (257 INODE_REF 256), with
               path->keep_locks set to 1,
               because it was the last key it
               saw in leaf X
      
                 finds it in leaf X again and
                 notices it's no longer the last
                 key of the leaf, so it returns 0
                 with path->nodes[0] == leaf X and
                 path->slots[0] == N (which is now
                 < btrfs_header_nritems(leaf X)),
                 pointing to the new key
                 (257 INODE_REF 4096)
      
               __btrfs_drop_extents() casts the
               item at path->nodes[0], slot
               path->slots[0], to a struct
               btrfs_file_extent_item - it does
               not skip keys for the target
               inode with a type less than
               BTRFS_EXTENT_DATA_KEY
               (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
      
               sees a bogus value for the type
               field triggering the WARN_ON in
               the trace shown above, and sets
               extent_end = search_start (4096)
      
               does the if-then-else logic to
               fixup 0 length extent items created
               by a past bug from hole punching:
      
                 if (extent_end == key.offset &&
                     extent_end >= search_start)
                     goto delete_extent_item;
      
               that evaluates to true and it ends
               up deleting the key pointed to by
               path->slots[0], (257 INODE_REF 4096),
               from leaf X
      
      The same could happen for example for a xattr that ends up having a key
      with an offset value that matches search_start (very unlikely but not
      impossible).
      
      So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
      skipped, never casted to struct btrfs_file_extent_item and never deleted
      by accident. Also protect against the unexpected case of getting a key
      for a lower inode number by skipping that key and issuing a warning.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      aeafbf84
  6. 03 11月, 2015 2 次提交
    • F
      Btrfs: fix hole punching when using the no-holes feature · 2959a32a
      Filipe Manana 提交于
      When we are using the no-holes feature, if we punch a hole into a file
      range that already contains a hole which overlaps the range we are passing
      to fallocate(), we end up removing the extent map that represents the
      existing hole without adding a new one. This happens because with the
      no-holes feature we do not have explicit extent items to represent holes
      and therefore the call to __btrfs_drop_extents(), made from
      btrfs_punch_hole(), returns an end offset to the variable drop_end that
      is smaller than the end of the range passed to fallocate(), while it
      drops all existing extent maps in that range.
      Normally having a missing extent map is not a problem, for example for
      a readpages() operation we just end up building the extent map by
      looking at the fs/subvol tree for a matching extent item (or a lack of
      one for implicit holes). However for an fsync that uses the fast path,
      which needs to look at the list of modified extent maps, this means
      the fsync will not record information about the complete hole we had
      before the fallocate() call into the log tree, resulting in a file with
      content/layout that does not match what we had neither before nor after
      the hole punch operation.
      
      The following test case for fstests reproduces the issue. It fails without
      this change because we get a file with a different digest after the fsync
      log replay and also with a different extent/hole layout.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
           _cleanup_flakey
           rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/punch
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "fpunch"
        _require_xfs_io_command "fiemap"
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create out test file with some data and then fsync it.
        # We do the fsync only to make sure the last fsync we do in this test
        # triggers the fast code path of btrfs' fsync implementation, a
        # condition necessary to trigger the bug btrfs had.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
                        -c "fsync"                  \
                        $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Now punch a hole against the range [96K, 128K[.
        $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the previous range
        # and ends beyond eof.
        $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the first range
        # ([96K, 128K[) and ends at eof.
        $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
      
        # Fsync our file. We want to verify that, after a power failure and
        # mounting the filesystem again, the file content reflects all the hole
        # punch operations.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap before power failure:"
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        # Silently drop all writes and umount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest we got before the power failure.
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap after log replay:"
        # Must match the same extent listing we got before the power failure.
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2959a32a
    • Q
      btrfs: Fix a data space underflow warning · 485290a7
      Qu Wenruo 提交于
      Even with quota disabled, generic/127 will trigger a kernel warning by
      underflow data space info.
      
      The bug is caused by buffered write, which in case of short copy, the
      start parameter for btrfs_delalloc_release_space() is wrong, and
      round_up/down() in btrfs_delalloc_release() extents the range to page
      aligned, decreasing one more page than expected.
      
      This patch will fix it by passing correct start.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      485290a7
  7. 26 10月, 2015 1 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  8. 22 10月, 2015 6 次提交
  9. 21 10月, 2015 1 次提交
    • Q
      btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size · 0f6925fa
      Qu Wenruo 提交于
      Current code will always truncate tailing page if its alloc_start is
      smaller than inode size.
      
      For example, the file extent layout is like:
      0	4K	8K	16K	32K
      |<-----Extent A---------------->|
      |<--Inode size: 18K---------->|
      
      But if calling fallocate even for range [0,4K), it will cause btrfs to
      re-truncate the range [16,32K), causing COW and a new extent.
      
      0	4K	8K	16K	32K
      |///////|	<- Fallocate call range
      |<-----Extent A-------->|<--B-->|
      
      The cause is quite easy, just a careless btrfs_truncate_inode() in a
      else branch without extra judgment.
      Fix it by add judgment on whether the fallocate range is beyond isize.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f6925fa
  10. 24 6月, 2015 1 次提交
  11. 10 6月, 2015 1 次提交
    • F
      Btrfs: avoid syncing log in the fast fsync path when not necessary · b659ef02
      Filipe Manana 提交于
      Commit 3a8b36f3 ("Btrfs: fix data loss in the fast fsync path") added
      a performance regression for that causes an unnecessary sync of the log
      trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
      against a file, without no writes or any metadata updates to the inode in
      between them and if a transaction is committed before the second fsync is
      called.
      
      Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
      after a test sysbench test that measured a -62% decrease of file io
      requests per second for that tests' workload.
      
      The test is:
      
        echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
        mkfs -t btrfs /dev/sda2
        mount -t btrfs /dev/sda2 /fs/sda2
        cd /fs/sda2
        for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
        sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
          --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
          --file-num=1024 run
      
      A test on kvm guest, running a debug kernel gave me the following results:
      
      Without 3a8b36f3:             16.01 reqs/sec
      With 3a8b36f3:                 3.39 reqs/sec
      With 3a8b36f3 and this patch: 16.04 reqs/sec
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Tested-by: NHuang, Ying <ying.huang@intel.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b659ef02
  12. 16 4月, 2015 1 次提交
  13. 13 4月, 2015 2 次提交
    • D
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang 提交于
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e2d1f923
    • D
      Btrfs: qgroup, Account data space in more proper timings. · 237c0e9f
      Dongsheng Yang 提交于
      Currenly, in data writing, ->reserved is accounted in
      fill_delalloc(), but ->may_use is released in clear_bit_hook()
      which is called by btrfs_finish_ordered_io(). That's too late,
      that said, between fill_delalloc() and btrfs_finish_ordered_io(),
      the data is doublely accounted by qgroup. It will cause some
      unexpected -EDQUOT.
      
      Example:
      	# btrfs quota enable /root/btrfs-auto-test/
      	# btrfs subvolume create /root/btrfs-auto-test//sub
      	Create subvolume '/root/btrfs-auto-test/sub'
      	# btrfs qgroup limit 1G /root/btrfs-auto-test//sub
      	dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000
      	dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded
      	681353+0 records in
      	681352+0 records out
      	697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s
      It's (698 MB) when we got an -EDQUOT, but we limit it by 1G.
      
      This patch move the btrfs_qgroup_reserve/free() for data from
      btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space()
      and btrfs_free_reserved_data_space(). Then the accounter in qgroup
      will be updated at the same time with the accounter in space_info updated.
      In this way, the unexpected -EDQUOT will be killed.
      Reported-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      237c0e9f
  14. 12 4月, 2015 4 次提交
  15. 27 3月, 2015 2 次提交
    • F
      Btrfs: fix metadata inconsistencies after directory fsync · 2f2ff0ee
      Filipe Manana 提交于
      We can get into inconsistency between inodes and directory entries
      after fsyncing a directory. The issue is that while a directory gets
      the new dentries persisted in the fsync log and replayed at mount time,
      the link count of the inode that directory entries point to doesn't
      get updated, staying with an incorrect link count (smaller then the
      correct value). This later leads to stale file handle errors when
      accessing (including attempt to delete) some of the links if all the
      other ones are removed, which also implies impossibility to delete the
      parent directories, since the dentries can not be removed.
      
      Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
      when fsyncing a directory, new files aren't logged (their metadata and
      dentries) nor any child directories. So this patch fixes this issue too,
      since it has the same resolution as the incorrect inode link count issue
      mentioned before.
      
      This is very easy to reproduce, and the following excerpt from my test
      case for xfstests shows how:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file and directory.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        mkdir $SCRATCH_MNT/mydir
      
        # Make sure all metadata and data are durably persisted.
        sync
      
        # Add a hard link to 'foo' inside our test directory and fsync only the
        # directory. The btrfs fsync implementation had a bug that caused the new
        # directory entry to be visible after the fsync log replay but, the inode
        # of our file remained with a link count of 1.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
      
        # Add a few more links and new files.
        # This is just to verify nothing breaks or gives incorrect results after the
        # fsync log is replayed.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
        $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
        ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
      
        # Add some subdirectories and new files and links to them. This is to verify
        # that after fsyncing our top level directory 'mydir', all the subdirectories
        # and their files/links are registered in the fsync log and exist after the
        # fsync log is replayed.
        mkdir -p $SCRATCH_MNT/mydir/x/y/z
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
        touch $SCRATCH_MNT/mydir/x/y/z/qwerty
      
        # Now fsync only our top directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
      
        # And fsync now our new file named 'hello', just to verify later that it has
        # the expected content and that the previous fsync on the directory 'mydir' had
        # no bad influence on this fsync.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
        # all with the value 0xaa.
        echo "File 'foo' content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Remove the first name of our inode. Because of the directory fsync bug, the
        # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
        # deleting the inode and the other names became stale directory entries (still
        # visible to applications). Attempting to remove or access the remaining
        # dentries pointing to that inode resulted in stale file handle errors and
        # made it impossible to remove the parent directories since it was impossible
        # for them to become empty.
        echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
        rm -f $SCRATCH_MNT/foo
      
        # Now verify that all files, links and directories created before fsyncing our
        # directory exist after the fsync log was replayed.
        [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
        [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
        [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
        [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
            echo "Link mydir/x/y/foo_y_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
            echo "Link mydir/x/y/z/foo_z_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
            echo "File mydir/x/y/z/qwerty is missing"
      
        # We expect our file here to have a size of 64Kb and all the bytes having the
        # value 0xff.
        echo "file 'hello' content after log replay:"
        od -t x1 $SCRATCH_MNT/hello
      
        # Now remove all files/links, under our test directory 'mydir', and verify we
        # can remove all the directories.
        rm -f $SCRATCH_MNT/mydir/x/y/z/*
        rmdir $SCRATCH_MNT/mydir/x/y/z
        rm -f $SCRATCH_MNT/mydir/x/y/*
        rmdir $SCRATCH_MNT/mydir/x/y
        rmdir $SCRATCH_MNT/mydir/x
        rm -f $SCRATCH_MNT/mydir/*
        rmdir $SCRATCH_MNT/mydir
      
        # An fsck, run by the fstests framework everytime a test finishes, also detected
        # the inconsistency and printed the following error message:
        #
        # root 5 inode 257 errors 2001, no inode item, link count wrong
        #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
        #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
        status=0
        exit
      
      The expected golden output for the test is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 5
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
      
      Which is the output after this patch and when running the test against
      ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
      output is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 1
        Link mydir/foo_2 is missing
        Link mydir/foo_3 is missing
        Link mydir/x/y/foo_y_link is missing
        Link mydir/x/y/z/foo_z_link is missing
        File mydir/x/y/z/qwerty is missing
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
      
      Fsck, without this fix, also complains about the wrong link count:
      
        root 5 inode 257 errors 2001, no inode item, link count wrong
            unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
            unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
      So fix this by logging the inodes that the dentries point to when
      fsyncing a directory.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f2ff0ee
    • F
      Btrfs: add missing inode item update in fallocate() · 3d850dd4
      Filipe Manana 提交于
      If we fallocate(), without the keep size flag, into an area already covered
      by an extent previously fallocated, we were updating the inode's i_size but
      we weren't updating the inode item in the fs/subvol tree. A following umount
      + mount would result in a loss of the inode's size (and an fsync would miss
      too the fact that the inode changed).
      
      Reproducer:
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt
        $ fallocate -n -l 1M /mnt/foobar
        $ fallocate -l 512K /mnt/foobar
        $ umount /mnt
        $ mount /dev/sdd /mnt
        $ od -t x1 /mnt/foobar
        0000000
      
      The expected result is:
      
        $ od -t x1 /mnt/foobar
        0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        2000000
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d850dd4
  16. 26 3月, 2015 1 次提交
  17. 06 3月, 2015 1 次提交
    • F
      Btrfs: fix data loss in the fast fsync path · 3a8b36f3
      Filipe Manana 提交于
      When using the fast file fsync code path we can miss the fact that new
      writes happened since the last file fsync and therefore return without
      waiting for the IO to finish and write the new extents to the fsync log.
      
      Here's an example scenario where the fsync will miss the fact that new
      file data exists that wasn't yet durably persisted:
      
      1. fs_info->last_trans_committed == N - 1 and current transaction is
         transaction N (fs_info->generation == N);
      
      2. do a buffered write;
      
      3. fsync our inode, this clears our inode's full sync flag, starts
         an ordered extent and waits for it to complete - when it completes
         at btrfs_finish_ordered_io(), the inode's last_trans is set to the
         value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
         btrfs_set_inode_last_trans);
      
      4. transaction N is committed, so fs_info->last_trans_committed is now
         set to the value N and fs_info->generation remains with the value N;
      
      5. do another buffered write, when this happens btrfs_file_write_iter
         sets our inode's last_trans to the value N + 1 (that is
         fs_info->generation + 1 == N + 1);
      
      6. transaction N + 1 is started and fs_info->generation now has the
         value N + 1;
      
      7. transaction N + 1 is committed, so fs_info->last_trans_committed
         is set to the value N + 1;
      
      8. fsync our inode - because it doesn't have the full sync flag set,
         we only start the ordered extent, we don't wait for it to complete
         (only in a later phase) therefore its last_trans field has the
         value N + 1 set previously by btrfs_file_write_iter(), and so we
         have:
      
             inode->last_trans <= fs_info->last_trans_committed
                 (N + 1)              (N + 1)
      
         Which made us not log the last buffered write and exit the fsync
         handler immediately, returning success (0) to user space and resulting
         in data loss after a crash.
      
      This can actually be triggered deterministically and the following excerpt
      from a testcase I made for xfstests triggers the issue. It moves a dummy
      file across directories and then fsyncs the old parent directory - this
      is just to trigger a transaction commit, so moving files around isn't
      directly related to the issue but it was chosen because running 'sync' for
      example does more than just committing the current transaction, as it
      flushes/waits for all file data to be persisted. The issue can also happen
      at random periods, since the transaction kthread periodicaly commits the
      current transaction (about every 30 seconds by default).
      The body of the test is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file 'foo', the one we check for data loss.
        # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
        # bit from its flags (btrfs inode specific flags).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                        -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now create one other file and 2 directories. We will move this second file
        # from one directory to the other later because it forces btrfs to commit its
        # currently open transaction if we fsync the old parent directory. This is
        # necessary to trigger the data loss bug that affected btrfs.
        mkdir $SCRATCH_MNT/testdir_1
        touch $SCRATCH_MNT/testdir_1/bar
        mkdir $SCRATCH_MNT/testdir_2
      
        # Make sure everything is durably persisted.
        sync
      
        # Write more 8Kb of data to our file.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Move our 'bar' file into a new directory.
        mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
      
        # Fsync our first directory. Because it had a file moved into some other
        # directory, this made btrfs commit the currently open transaction. This is
        # a condition necessary to trigger the data loss bug.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
      
        # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
        # data we wrote previously to be persisted and available if a crash happens.
        # This did not happen with btrfs, because of the transaction commit that
        # happened when we fsynced the parent directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now check that all data we wrote before are available.
        echo "File content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected golden output for the test, which is what we get with this
      fix applied (or when running against ext3/4 and xfs), is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0040000
      
      Without this fix applied, the output shows the test file does not have
      the second 8Kb extent that we successfully fsynced:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
      
      So fix this by skipping the fsync only if we're doing a full sync and
      if the inode's last_trans is <= fs_info->last_trans_committed, or if
      the inode is already in the log. Also remove setting the inode's
      last_trans in btrfs_file_write_iter since it's useless/unreliable.
      
      Also because btrfs_file_write_iter no longer sets inode->last_trans to
      fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
      bail out if last_trans is 0, otherwise something as simple as the following
      example wouldn't log the second write on the last fsync:
      
        1. write to file
      
        2. fsync file
      
        3. fsync file
             |--> btrfs_inode_in_log() returns true and it set last_trans to 0
      
        4. write to file
             |--> btrfs_file_write_iter() no longers sets last_trans, so it
                  remained with a value of 0
        5. fsync
             |--> inode->last_trans == 0, so it bails out without logging the
                  second write
      
      A test case for xfstests will be sent soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a8b36f3
  18. 04 3月, 2015 3 次提交
  19. 03 3月, 2015 1 次提交
    • F
      Btrfs: add missing inode update when punching hole · e8c1c76e
      Filipe Manana 提交于
      When punching a file hole if we endup only zeroing parts of a page,
      because the start offset isn't a multiple of the sector size or the
      start offset and length fall within the same page, we were not updating
      the inode item. This prevented an fsync from doing anything, if no other
      file changes happened in the current transaction, because the fields
      in btrfs_inode used to check if the inode needs to be fsync'ed weren't
      updated.
      
      This issue is easy to reproduce and the following excerpt from the
      xfstest case I made shows how to trigger it:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file.
        $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, this makes btrfs update some btrfs inode specific fields
        # that are used to track if the inode needs to be written/updated to the fsync
        # log or not. After this fsync, the new values for those fields indicate that
        # a subsequent fsync does not need to touch the fsync log.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Force a commit of the current transaction. After this point, any operation
        # that modifies the data or metadata of our file, should update those fields in
        # the btrfs inode with values that make the next fsync operation write to the
        # fsync log.
        sync
      
        # Punch a hole in our file. This small range affects only 1 page.
        # This made the btrfs hole punching implementation write only some zeroes in
        # one page, but it did not update the btrfs inode fields used to determine if
        # the next fsync needs to write to the fsync log.
        $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo
      
        # Another variation of the previously mentioned case.
        $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo
      
        # Now fsync the file. This was a no-operation because the previous hole punch
        # operation didn't update the inode's fields mentioned before, so they remained
        # with the values they had after the first fsync - that is, they indicate that
        # it is not needed to write to fsync log.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Enable writes and mount the fs. This makes the fsync log replay code run.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Because the last fsync didn't do anything, here the file content matched what
        # it was after the first fsync, before the holes were punched, and not what it
        # was after the holes were punched.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      This issue has been around since 2012, when the punch hole implementation
      was added, commit 2aaa6655 ("Btrfs: add hole punching").
      
      A test case for xfstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e8c1c76e
  20. 17 2月, 2015 1 次提交
    • D
      Btrfs: ctree: reduce args where only fs_info used · b7a0365e
      Daniel Dressler 提交于
      This patch is part of a larger project to cleanup btrfs's internal usage
      of struct btrfs_root. Many functions take btrfs_root only to grab a
      pointer to fs_info.
      
      This causes programmers to ponder which root can be passed. Since only
      the fs_info is read affected functions can accept any root, except this
      is only obvious upon inspection.
      
      This patch reduces the specificty of such functions to accept the
      fs_info directly.
      
      This patch does not address the two functions in ctree.c (insert_ptr,
      and split_item) which only use root for BUG_ONs in ctree.c
      
      This patch affects the following functions:
        1) fixup_low_keys
        2) btrfs_set_item_key_safe
      Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      b7a0365e
  21. 11 2月, 2015 1 次提交
  22. 21 1月, 2015 1 次提交
  23. 25 11月, 2014 1 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
  24. 21 11月, 2014 2 次提交
  25. 02 10月, 2014 1 次提交
  26. 19 9月, 2014 1 次提交
    • F
      Btrfs: fix data corruption after fast fsync and writeback error · 8407f553
      Filipe Manana 提交于
      When we do a fast fsync, we start all ordered operations and then while
      they're running in parallel we visit the list of modified extent maps
      and construct their matching file extent items and write them to the
      log btree. After that, in btrfs_sync_log() we wait for all the ordered
      operations to finish (via btrfs_wait_logged_extents).
      
      The problem with this is that we were completely ignoring errors that
      can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
      or -EIO errors for example. When such error happens, it means we have parts
      of the on disk extent that weren't written to, and so we end up logging
      file extent items that point to these extents that contain garbage/random
      data - so after a crash/reboot plus log replay, we get our inode's metadata
      pointing to those extents.
      
      This worked in contrast with the full (non-fast) fsync path, where we
      start all ordered operations, wait for them to finish and then write
      to the log btree. In this path, after each ordered operation completes
      we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
      -EIO if so (via btrfs_wait_ordered_range).
      
      So if an error happens with any ordered operation, just return a -EIO
      error to userspace, so that it knows that not all of its previous writes
      were durably persisted and the application can take proper action (like
      redo the writes for e.g.) - and definitely not leave any file extent items
      in the log refer to non fully written extents.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8407f553