1. 29 9月, 2015 7 次提交
  2. 23 9月, 2015 1 次提交
    • J
      Btrfs: keep dropped roots in cache until transaction commit · 2b9dbef2
      Josef Bacik 提交于
      When dropping a snapshot we need to account for the qgroup changes.  If we drop
      the snapshot in all one go then the backref code will fail to find blocks from
      the snapshot we dropped since it won't be able to find the root in the fs root
      cache.  This can lead to us failing to find refs from other roots that pointed
      at blocks in the now deleted root.  To handle this we need to not remove the fs
      roots from the cache until after we process the qgroup operations.  Do this by
      adding dropped roots to a list on the transaction, and letting the transaction
      remove the roots at the same time it drops the commit roots.  This will keep all
      of the backref searching code in sync properly, and fixes a problem Mark was
      seeing with snapshot delete and qgroups.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2b9dbef2
  3. 22 9月, 2015 1 次提交
    • C
      Btrfs: Direct I/O: Fix space accounting · 50745b0a
      chandan 提交于
      The following call trace is seen when generic/095 test is executed,
      
      WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
      Modules linked in:
      CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
       ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
       0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
       ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
      Call Trace:
       [<ffffffff81984058>] dump_stack+0x45/0x57
       [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
       [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
       [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
       [<ffffffff8117ce07>] destroy_inode+0x37/0x60
       [<ffffffff8117cf39>] evict+0x109/0x170
       [<ffffffff8117cfd5>] dispose_list+0x35/0x50
       [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
       [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
       [<ffffffff81165951>] kill_anon_super+0x11/0x20
       [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
       [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
       [<ffffffff811660cf>] deactivate_super+0x5f/0x70
       [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
       [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
       [<ffffffff81069c06>] task_work_run+0x96/0xb0
       [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
       [<ffffffff8198cbc2>] int_signal+0x12/0x17
      
      This means that the inode had non-zero "outstanding extents" during
      eviction. This occurs because, during direct I/O a task which successfully
      used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
      not clear the bit after finishing the DIO write. A future DIO write could
      actually fail and the unused reserve space won't be freed because of the
      previously set BTRFS_INODE_DIO_READY bit.
      
      Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
      following issue,
      |-----------------------------------+-------------------------------------|
      | Task A                            | Task B                              |
      |-----------------------------------+-------------------------------------|
      | Start direct i/o write on inode X.|                                     |
      | reserve space                     |                                     |
      | Allocate ordered extent           |                                     |
      | release reserved space            |                                     |
      | Set BTRFS_INODE_DIO_READY bit.    |                                     |
      |                                   | splice()                            |
      |                                   | Transfer data from pipe buffer to   |
      |                                   | destination file.                   |
      |                                   | - kmap(pipe buffer page)            |
      |                                   | - Start direct i/o write on         |
      |                                   |   inode X.                          |
      |                                   |   - reserve space                   |
      |                                   |   - dio_refill_pages()              |
      |                                   |     - sdio->blocks_available == 0   |
      |                                   |     - Since a kernel address is     |
      |                                   |       being passed instead of a     |
      |                                   |       user space address,           |
      |                                   |       iov_iter_get_pages() returns  |
      |                                   |       -EFAULT.                      |
      |                                   |   - Since BTRFS_INODE_DIO_READY is  |
      |                                   |     set, we don't release reserved  |
      |                                   |     space.                          |
      |                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
      | -EIOCBQUEUED is returned.         |                                     |
      |-----------------------------------+-------------------------------------|
      
      Hence this commit introduces "struct btrfs_dio_data" to track the usage of
      reserved data space. The remaining unused "reserve space" can now be freed
      reliably.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50745b0a
  4. 15 9月, 2015 2 次提交
    • J
      btrfs: skip waiting on ordered range for special files · a30e577c
      Jeff Mahoney 提交于
      In btrfs_evict_inode, we properly truncate the page cache for evicted
      inodes but then we call btrfs_wait_ordered_range for every inode as well.
      It's the right thing to do for regular files but results in incorrect
      behavior for device inodes for block devices.
      
      filemap_fdatawrite_range gets called with inode->i_mapping which gets
      resolved to the block device inode before getting passed to
      wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
      next depends on whether there's an open file handle associated with the
      inode.  If there is, we write to the block device, which is unexpected
      behavior.  If there isn't, we through normally and inode->i_data is used.
      We can also end up racing against open/close which can result in crashes
      when i_mapping points to a block device inode that has been closed.
      
      Since there can't be any page cache associated with special file inodes,
      it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
      the problem.
      
      Cc: <stable@vger.kernel.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911Tested-by: NChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      a30e577c
    • F
      Btrfs: fix read corruption of compressed and shared extents · 005efedf
      Filipe Manana 提交于
      If a file has a range pointing to a compressed extent, followed by
      another range that points to the same compressed extent and a read
      operation attempts to read both ranges (either completely or part of
      them), the pages that correspond to the second range are incorrectly
      filled with zeroes.
      
      Consider the following example:
      
        File layout
        [0 - 8K]                      [8K - 24K]
            |                             |
            |                             |
         points to extent X,         points to extent X,
         offset 4K, length of 8K     offset 0, length 16K
      
        [extent X, compressed length = 4K uncompressed length = 16K]
      
      If a readpages() call spans the 2 ranges, a single bio to read the extent
      is submitted - extent_io.c:submit_extent_page() would only create a new
      bio to cover the second range pointing to the extent if the extent it
      points to had a different logical address than the extent associated with
      the first range. This has a consequence of the compressed read end io
      handler (compression.c:end_compressed_bio_read()) finish once the extent
      is decompressed into the pages covering the first range, leaving the
      remaining pages (belonging to the second range) filled with zeroes (done
      by compression.c:btrfs_clear_biovec_end()).
      
      So fix this by submitting the current bio whenever we find a range
      pointing to a compressed extent that was preceded by a range with a
      different extent map. This is the simplest solution for this corner
      case. Making the end io callback populate both ranges (or more, if we
      have multiple pointing to the same extent) is a much more complex
      solution since each bio is tightly coupled with a single extent map and
      the extent maps associated to the ranges pointing to the shared extent
      can have different offsets and lengths.
      
      The following test case for fstests triggers the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        test_clone_and_read_compressed_extent()
        {
            local mount_opts=$1
      
            _scratch_mkfs >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # Create a test file with a single extent that is compressed (the
            # data we write into it is highly compressible no matter which
            # compression algorithm is used, zlib or lzo).
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                            -c "pwrite -S 0xbb 4K 8K"        \
                            -c "pwrite -S 0xcc 12K 4K"       \
                            $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Now clone our extent into an adjacent offset.
            $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
                $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
            # Same as before but for this file we clone the extent into a lower
            # file offset.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                            -c "pwrite -S 0xbb 12K 8K"        \
                            -c "pwrite -S 0xcc 20K 4K"        \
                            $SCRATCH_MNT/bar | _filter_xfs_io
      
            $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
                $SCRATCH_MNT/bar $SCRATCH_MNT/bar
      
            echo "File digests before unmounting filesystem:"
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
      
            # Evicting the inode or clearing the page cache before reading
            # again the file would also trigger the bug - reads were returning
            # all bytes in the range corresponding to the second reference to
            # the extent with a value of 0, but the correct data was persisted
            # (it was a bug exclusively in the read path). The issue happened
            # only if the same readpages() call targeted pages belonging to the
            # first and second ranges that point to the same compressed extent.
            _scratch_remount
      
            echo "File digests after mounting filesystem again:"
            # Must match the same digests we got before.
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
        }
      
        echo -e "\nTesting with zlib compression..."
        test_clone_and_read_compressed_extent "-o compress=zlib"
      
        _scratch_unmount
      
        echo -e "\nTesting with lzo compression..."
        test_clone_and_read_compressed_extent "-o compress=lzo"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      005efedf
  5. 10 9月, 2015 1 次提交
    • F
      Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock · 85e0a0f2
      Filipe Manana 提交于
      After commmit e44163e1 ("btrfs: explictly delete unused block groups
      in close_ctree and ro-remount"), added in the 4.3 merge window, we have
      calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex.
      This can cause a deadlock with a concurrent block group relocation (when
      a filesystem balance or shrink operation is in progress for example)
      because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the
      relocation path locks first delete_unused_bgs_mutex and then it locks
      cleaner_mutex, resulting in a classic ABBA deadlock:
      
               CPU 0                                        CPU 1
      
      lock fs_info->cleaner_mutex
      
                                                 __btrfs_balance() || btrfs_shrink_device()
                                                   lock fs_info->delete_unused_bgs_mutex
                                                   btrfs_relocate_chunk()
                                                     btrfs_relocate_block_group()
                                                       lock fs_info->cleaner_mutex
      btrfs_delete_unused_bgs()
        lock fs_info->delete_unused_bgs_mutex
      
      Fix this by not taking the cleaner_mutex before calling
      btrfs_delete_unused_bgs() because it's no longer needed after
      commit 67c5e7d4 ("Btrfs: fix race between balance and unused block
      group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the
      spinlock fs_info->unused_bgs_lock and a block group's spinlock are
      enough to get correct serialization between tasks running relocation
      and unused block group deletion (as well as between multiple tasks
      concurrently calling btrfs_delete_unused_bgs()).
      
      This issue was discussed (in the mailing list) during the review of
      the patch titled "btrfs: explictly delete unused block groups in
      close_ctree and ro-remount" and it was agreed that acquiring the
      cleaner mutex had to be dropped after the patch titled
      "Btrfs: fix race between balance and unused block group deletion"
      got merged (both patches were submitted at about the same time, but
      one landed in kernel 4.2 and the other in the 4.3 merge window).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      85e0a0f2
  6. 08 9月, 2015 1 次提交
    • F
      Btrfs: don't initialize a space info as full to prevent ENOSPC · 6af3e3ad
      Filipe Manana 提交于
      Commit 2e6e5183 ("Btrfs: fix block group ->space_info null pointer
      dereference") accidently marked a space info as full when initializing
      it with a value of 0 total bytes. This introduces an ENOSPC problem when
      writing file data if we mount a filesystem that has no data block groups
      allocated, because the data space info is initialized with 0 total bytes,
      marked as full, and it never gets its total bytes incremented by a
      (positive) value to unmark it as full (because there are no data block
      groups loaded when the fs is mounted).
      For metadata and system spaces this issue can never happen since we always
      have at least one metadata block group and one system block group (even
      for an empty filesystem).
      
      So fix this by just not initializing a space info as full, reverting the
      offending part of the commit mentioned above.
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Mount our filesystem without space caches enabled so that we do not
        # get any space used from the initial data block group that mkfs creates
        # (space caches used space from data block groups).
        _scratch_mount "-o nospace_cache"
      
        # Need an fs with at least 2Gb to make sure mkfs.btrfs does not create
        # an fs using mixed block groups (used both for data and metadata). We
        # really need to have dedicated block groups for data to reproduce the
        # issue and mkfs.btrfs defaults to mixed block groups only for small
        # filesystems (up to 1Gb).
        _require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024))
      
        # Run balance with the purpose of deleting the unused data block group
        # that mkfs created. We could also wait for the background kthread to
        # automatically delete the unused block group, but we do not have a way
        # to make it run and wait for it to complete, so just do a balance
        # instead of some unreliable sleep
        _run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT
      
        # Now unmount the filesystem, mount it again (either with or with space
        # caches enabled, it does not matter to trigger the problem) and attempt
        # to create a file with some data - this used to fail with ENOSPC
        # because there were no data block groups when the filesystem was
        # mounted and the data space info object was marked as full when
        # initialized (because it had 0 total bytes), which prevented the file
        # write path from attempting to allocate a data block group and fail
        # immediately with ENOSPC.
        _scratch_remount
        echo "hello world" > $SCRATCH_MNT/foobar
      
        echo "Silence is golden"
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      6af3e3ad
  7. 01 9月, 2015 6 次提交
  8. 22 8月, 2015 1 次提交
    • C
      btrfs: fix compile when block cgroups are not enabled · 3a9508b0
      Chris Mason 提交于
      bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
      This adds an ifdef around them.  It's not perfect, but our
      use of bi_ioc is being removed in the 4.3 merge window.
      
      The bi_css usage really should go into bio_clone, but I want to make
      sure that doesn't introduce problems for other bio_clone use cases.
      Signed-off-by: NChris Mason <clm@fb.com>
      3a9508b0
  9. 20 8月, 2015 6 次提交
    • F
      Btrfs: fix file read corruption after extent cloning and fsync · b84b8390
      Filipe Manana 提交于
      If we partially clone one extent of a file into a lower offset of the
      file, fsync the file, power fail and then mount the fs to trigger log
      replay, we can get multiple checksum items in the csum tree that overlap
      each other and result in checksum lookup failures later. Those failures
      can make file data read requests assume a checksum value of 0, but they
      will not return an error (-EIO for example) to userspace exactly because
      the expected checksum value 0 is a special value that makes the read bio
      endio callback return success and set all the bytes of the corresponding
      page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
      From a userspace perspective this is equivalent to file corruption
      because we are not returning what was written to the file.
      
      Details about how this can happen, and why, are included inline in the
      following reproducer test case for fstests and the comment added to
      tree-log.c.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with a single 100K extent starting at file
        # offset 800K. We fsync the file here to make the fsync log tree gets
        # a single csum item that covers the whole 100K extent, which causes
        # the second fsync, done after the cloning operation below, to not
        # leave in the log tree two csum items covering two sub-ranges
        # ([0, 20K[ and [20K, 100K[)) of our extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                        -c "fsync"                     \
                         $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now clone part of our extent into file offset 400K. This adds a file
        # extent item to our inode's metadata that points to the 100K extent
        # we created before, using a data offset of 20K and a data length of
        # 20K, so that it refers to the sub-range [20K, 40K[ of our original
        # extent.
        $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
            -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # Now fsync our file to make sure the extent cloning is durably
        # persisted. This fsync will not add a second csum item to the log
        # tree containing the checksums for the blocks in the sub-range
        # [20K, 40K[ of our extent, because there was already a csum item in
        # the log tree covering the whole extent, added by the first fsync
        # we did before.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power
        # failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        # The fsync log replay first processes the file extent item
        # corresponding to the file offset 400K (the one which refers to the
        # [20K, 40K[ sub-range of our 100K extent) and then processes the file
        # extent item for file offset 800K. It used to happen that when
        # processing the later, it erroneously left in the csum tree 2 csum
        # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
        # 1 for the whole range of our extent. This introduced a problem where
        # subsequent lookups for the checksums of blocks within the range
        # [40K, 100K[ of our extent would not find anything because lookups in
        # the csum tree ended up looking only at the smaller csum item, the
        # one covering the subrange [20K, 40K[. This made read requests assume
        # an expected checksum with a value of 0 for those blocks, which caused
        # checksum verification failure when the read operations finished.
        # However those checksum failure did not result in read requests
        # returning an error to user space (like -EIO for e.g.) because the
        # expected checksum value had the special value 0, and in that case
        # btrfs set all bytes of the corresponding pages with the value 0x01
        # and produce the following warning in dmesg/syslog:
        #
        #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
        #   1322675045 expected csum 0"
        #
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest he had after cloning the extent and
        # before the power failure happened.
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b84b8390
    • F
      Btrfs: check if previous transaction aborted to avoid fs corruption · 1f9b8c8f
      Filipe Manana 提交于
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1f9b8c8f
    • M
      btrfs: use __GFP_NOFAIL in alloc_btrfs_bio · 277fb5fc
      Michal Hocko 提交于
      alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
      transaction but this allocation context is rather weak wrt. reclaim
      capabilities. The page allocator currently tries hard to not fail these
      allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
      still fail if the _current_ process is the OOM killer victim. Moreover
      there is an attempt to move away from the default no-fail behavior and
      allow these allocation to fail more eagerly. This would lead to:
      
      [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
      
      which is clearly undesirable and the nofail behavior should be explicit
      if the allocation failure cannot be tolerated.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      277fb5fc
    • M
      btrfs: Prevent from early transaction abort · d1b5c567
      Michal Hocko 提交于
      Btrfs relies on GFP_NOFS allocation when committing the transaction but
      this allocation context is rather weak wrt. reclaim capabilities. The
      page allocator currently tries hard to not fail these allocations if
      they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
      currently but there is an attempt to move away from the default no-fail
      behavior and allow these allocation to fail more eagerly. And this would
      lead to a pre-mature transaction abort as follows:
      
      [   55.328093] Call Trace:
      [   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
      [   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
      [   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
      [   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
      [   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
      [   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
      [   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
      [   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
      [   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
      [   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
      [   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
      [   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
      [   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
      [   55.384280] ------------[ cut here ]------------
      [   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
      [...]
      [   55.384337] Call Trace:
      [   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
      [   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
      [   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
      [   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
      [   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
      [   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [...]
      [   55.384608] ---[ end trace c29799da1d4dd621 ]---
      [   55.437323] BTRFS info (device hdb1): forced readonly
      [   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry
      
      Fix this by being explicit about the no-fail behavior of this allocation
      path and use __GFP_NOFAIL.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d1b5c567
    • Z
      btrfs: Remove unused arguments in tree-log.c · 60d53eb3
      Zhaolei 提交于
      Following arguments are not used in tree-log.c:
       insert_one_name(): path, type
       wait_log_commit(): trans
       wait_for_writer(): trans
      
      This patch remove them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      60d53eb3
    • Z
      btrfs: Remove useless condition in start_log_trans() · 34eb2a52
      Zhaolei 提交于
      Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
      for start_log_trans():
       fs/btrfs/tree-log.c:178 start_log_trans()
       warn: we tested 'root->log_root' before and it was 'false'
      
       fs/btrfs/tree-log.c
       147          if (root->log_root) {
       We test "root->log_root" here.
       ...
      
      Reason:
       Condition of:
       fs/btrfs/tree-log.c:178: if (!root->log_root) {
       is not necessary after commit: 7237f183
      
       It caused a smatch warning, and no functionally error.
      
      Fix:
       Deleting above condition will make smatch shut up,
       but a better way is to do cleanup for start_log_trans()
       to remove duplicated code and make code more readable.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      34eb2a52
  10. 15 8月, 2015 1 次提交
  11. 14 8月, 2015 2 次提交
  12. 09 8月, 2015 11 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
    • B
      Btrfs: remove unused mutex from struct 'btrfs_fs_info' · a4027a20
      Byongho Lee 提交于
      The code using 'ordered_extent_flush_mutex' mutex has removed by below
      commit.
       - 8d875f95
         btrfs: disable strict file flushes for renames and truncates
      But the mutex still lives in struct 'btrfs_fs_info'.
      
      So, this patch removes the mutex from struct 'btrfs_fs_info' and its
      initialization code.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a4027a20
    • O
      Btrfs: fix parity scrub of RAID 5/6 with missing device · 4a770891
      Omar Sandoval 提交于
      When testing the previous patch, Zhao Lei reported a similar bug when
      attempting to scrub a degraded RAID 5/6 filesystem with a missing
      device, leading to NULL pointer dereferences from the RAID 5/6 parity
      scrubbing code.
      
      The first cause was the same as in the previous patch: attempting to
      call bio_add_page() on a missing block device. To fix this,
      scrub_extent_for_parity() can just mark the sectors on the missing
      device as errors instead of attempting to read from it.
      
      Additionally, the code uses scrub_remap_extent() to map the extent of
      the corresponding data stripe, but the extent wasn't already mapped. If
      scrub_remap_extent() finds a missing block device, it doesn't initialize
      extent_dev, so we're left with a NULL struct btrfs_device. The solution
      is to use btrfs_map_block() directly.
      Reported-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4a770891
    • O
      Btrfs: fix device replace of a missing RAID 5/6 device · 73ff61db
      Omar Sandoval 提交于
      The original implementation of device replace on RAID 5/6 seems to have
      missed support for replacing a missing device. When this is attempted,
      we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
      crashes when we try to dereference it. This happens because
      btrfs_map_block() has no choice but to return us the missing device
      because RAID 5/6 don't have any alternate mirrors to read from, and a
      missing device has a NULL bdev.
      
      The idea implemented here is to handle the missing device case
      separately, which better only happen when we're replacing a missing RAID
      5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
      reconstruct the data from parity, check it with
      scrub_recheck_block_checksum(), and write it out with
      scrub_write_block_to_dev_replace().
      Reported-by: NPhilip <bugzilla@philip-seeger.de>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73ff61db
    • O
      Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation · b4ee1782
      Omar Sandoval 提交于
      The current RAID 5/6 recovery code isn't quite prepared to handle
      missing devices. In particular, it expects a bio that we previously
      attempted to use in the read path, meaning that it has valid pages
      allocated. However, missing devices have a NULL blkdev, and we can't
      call bio_add_page() on a bio with a NULL blkdev. We could do manual
      manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
      a separate path that allows us to manually add pages to the rbio.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b4ee1782
    • O
      Btrfs: count devices correctly in readahead during RAID 5/6 replace · 7cb2c420
      Omar Sandoval 提交于
      Commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for raid5/6
      degraded mounting") fixed a problem where we would skip a missing device
      when we shouldn't have because there are no other mirrors to read from
      in RAID 5/6. After commit 2c8cdd6e ("Btrfs, replace: write dirty
      pages into the replace target device"), the fix doesn't work when we're
      doing a missing device replace on RAID 5/6 because the replace device is
      counted as a mirror so we're tricked into thinking we can safely skip
      the missing device. The fix is to count only the real stripes and decide
      based on that.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7cb2c420
    • O
      Btrfs: remove misleading handling of missing device scrub · 03679ade
      Omar Sandoval 提交于
      scrub_submit() claims that it can handle a bio with a NULL block device,
      but this is misleading, as calling bio_add_page() on a bio with a NULL
      ->bi_bdev would've already crashed. Delete this, as we're about to
      properly handle a missing block device.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      03679ade
    • M
      btrfs: fix clone / extent-same deadlocks · 293a8489
      Mark Fasheh 提交于
      Clone and extent same lock their source and target inodes in opposite order.
      In addition to this, the range locking in clone doesn't take ordering into
      account. Fix this by having clone use the same locking helpers as
      btrfs-extent-same.
      
      In addition, I do a small cleanup of the locking helpers, removing a case
      (both inodes being the same) which was poorly accounted for and never
      actually used by the callers.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      293a8489
    • L
      Btrfs: fix defrag to merge tail file extent · 4a3560c4
      Liu Bo 提交于
      The file layout is
      
      [extent 1]...[extent n][4k extent][HOLE][extent x]
      
      extent 1~n and 4k extent can be merged during defrag, and the whole
      defrag bytes is larger than our defrag thresh(256k), 4k extent as a
      tail is left unmerged since we check if its next extent can be merged
      (the next one is a hole, so the check will fail), the layout thus can
      be
      
      [new extent][4k extent][HOLE][extent x]
       (1~n)
      
      To fix it, beside looking at the next one, this also looks at the
      previous one by checking @defrag_end, which is set to 0 when we
      decide to stop merging contiguous extents, otherwise, we can merge
      the previous one with our extent.
      
      Also, this makes btrfs behave consistent with how xfs and ext4 do.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4a3560c4
    • L
      Btrfs: fix warning in backref walking · acdf898d
      Liu Bo 提交于
      When we do backref walking, we search firstly in queued delayed refs
      and then the on-disk backrefs, but we parse differently for shared
      references, for delayed refs we also add 'ref->root' while for on-disk
      backrefs we don't, this can prevent us from merging refs indexed
      by the same bytenr and cause find_parent_nodes() to throw a warning at
      'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
      with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
      that happens.
      
      For shared references, no matter if it's delayed or on-disk, ref->root is
      not at all used, instead it's ref->parent that really matters, so this has
      delayed refs handled as the same way as on-disk refs.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      acdf898d
    • Z
      btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() · 166f66d0
      Zhaolei 提交于
      When a task trying to double lock a extent buffer, there are no
      lockdep warning about it because this lock may be in "blocking_lock"
      state, and make us hard to debug.
      
      This patch add a WARN_ON() for above condition, it can not report
      all deadlock cases(as lock between tasks), but at least helps us
      some.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      166f66d0