1. 22 10月, 2015 3 次提交
  2. 11 10月, 2015 1 次提交
  3. 08 10月, 2015 1 次提交
  4. 06 10月, 2015 1 次提交
    • F
      Btrfs: fix deadlock when finalizing block group creation · d9a0540a
      Filipe Manana 提交于
      Josef ran into a deadlock while a transaction handle was finalizing the
      creation of its block groups, which produced the following trace:
      
        [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
        [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
        [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
        [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
        [260445.593137] Call Trace:
        [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
        [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
        [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
        [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
        [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
        [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
        [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
        [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
        [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      This happened because the same transaction handle created a large number
      of block groups and while finalizing their creation (inserting new items
      and updating existing items in the chunk and device trees) a new metadata
      extent had to be allocated and no free space was found in the current
      metadata block groups, which made find_free_extent() attempt to allocate
      a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
      ended up allocating a new system chunk too and exceeded the threshold
      of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
      final part of block group creation again (at
      btrfs_create_pending_block_groups()) and attempt to lock again the root
      of the chunk tree when it's already write locked by the same task.
      
      Similarly we can deadlock on extent tree nodes/leafs if while we are
      running delayed references we end up creating a new metadata block group
      in order to allocate a new node/leaf for the extent tree (as part of
      a CoW operation or growing the tree), as btrfs_create_pending_block_groups
      inserts items into the extent tree as well. In this case we get the
      following trace:
      
        [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
        [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
        [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
        [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
        [14242.773606] Call Trace:
        [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
        [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
        [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
        [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
        [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
        [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
        [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
        [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
        [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
        [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
        [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
        [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
        [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
        [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
        [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
        [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
        [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Fix this by never recursing into the finalization phase of block group
      creation and making sure we never trigger the finalization of block group
      creation while running delayed references.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Fixes: 00d80e34 ("Btrfs: fix quick exhaustion of the system array in the superblock")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d9a0540a
  5. 29 9月, 2015 1 次提交
  6. 23 9月, 2015 1 次提交
    • J
      Btrfs: keep dropped roots in cache until transaction commit · 2b9dbef2
      Josef Bacik 提交于
      When dropping a snapshot we need to account for the qgroup changes.  If we drop
      the snapshot in all one go then the backref code will fail to find blocks from
      the snapshot we dropped since it won't be able to find the root in the fs root
      cache.  This can lead to us failing to find refs from other roots that pointed
      at blocks in the now deleted root.  To handle this we need to not remove the fs
      roots from the cache until after we process the qgroup operations.  Do this by
      adding dropped roots to a list on the transaction, and letting the transaction
      remove the roots at the same time it drops the commit roots.  This will keep all
      of the backref searching code in sync properly, and fixes a problem Mark was
      seeing with snapshot delete and qgroups.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2b9dbef2
  7. 08 9月, 2015 1 次提交
    • F
      Btrfs: don't initialize a space info as full to prevent ENOSPC · 6af3e3ad
      Filipe Manana 提交于
      Commit 2e6e5183 ("Btrfs: fix block group ->space_info null pointer
      dereference") accidently marked a space info as full when initializing
      it with a value of 0 total bytes. This introduces an ENOSPC problem when
      writing file data if we mount a filesystem that has no data block groups
      allocated, because the data space info is initialized with 0 total bytes,
      marked as full, and it never gets its total bytes incremented by a
      (positive) value to unmark it as full (because there are no data block
      groups loaded when the fs is mounted).
      For metadata and system spaces this issue can never happen since we always
      have at least one metadata block group and one system block group (even
      for an empty filesystem).
      
      So fix this by just not initializing a space info as full, reverting the
      offending part of the commit mentioned above.
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Mount our filesystem without space caches enabled so that we do not
        # get any space used from the initial data block group that mkfs creates
        # (space caches used space from data block groups).
        _scratch_mount "-o nospace_cache"
      
        # Need an fs with at least 2Gb to make sure mkfs.btrfs does not create
        # an fs using mixed block groups (used both for data and metadata). We
        # really need to have dedicated block groups for data to reproduce the
        # issue and mkfs.btrfs defaults to mixed block groups only for small
        # filesystems (up to 1Gb).
        _require_fs_space $SCRATCH_MNT $((2 * 1024 * 1024))
      
        # Run balance with the purpose of deleting the unused data block group
        # that mkfs created. We could also wait for the background kthread to
        # automatically delete the unused block group, but we do not have a way
        # to make it run and wait for it to complete, so just do a balance
        # instead of some unreliable sleep
        _run_btrfs_util_prog balance start -dusage=0 $SCRATCH_MNT
      
        # Now unmount the filesystem, mount it again (either with or with space
        # caches enabled, it does not matter to trigger the problem) and attempt
        # to create a file with some data - this used to fail with ENOSPC
        # because there were no data block groups when the filesystem was
        # mounted and the data space info object was marked as full when
        # initialized (because it had 0 total bytes), which prevented the file
        # write path from attempting to allocate a data block group and fail
        # immediately with ENOSPC.
        _scratch_remount
        echo "hello world" > $SCRATCH_MNT/foobar
      
        echo "Silence is golden"
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      6af3e3ad
  8. 09 8月, 2015 3 次提交
  9. 29 7月, 2015 5 次提交
    • J
      btrfs: add missing discards when unpinning extents with -o discard · e33e17ee
      Jeff Mahoney 提交于
      When we clear the dirty bits in btrfs_delete_unused_bgs for extents
      in the empty block group, it results in btrfs_finish_extent_commit being
      unable to discard the freed extents.
      
      The block group removal patch added an alternate path to forget extents
      other than btrfs_finish_extent_commit.  As a result, any extents that
      would be freed when the block group is removed aren't discarded.  In my
      test run, with a large copy of mixed sized files followed by removal, it
      left nearly 2/3 of extents undiscarded.
      
      To clean up the block groups, we add the removed block group onto a list
      that will be discarded after transaction commit.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e33e17ee
    • J
      btrfs: iterate over unused chunk space in FITRIM · 499f377f
      Jeff Mahoney 提交于
      Since we now clean up block groups automatically as they become
      empty, iterating over block groups is no longer sufficient to discard
      unused space.
      
      This patch iterates over the unused chunk space and discards any regions
      that are unallocated, regardless of whether they were ever used.  This is
      a change for btrfs but is consistent with other file systems.
      
      We do this in a transactionless manner since the discard process can take
      a substantial amount of time and a transaction would need to be started
      before the acquisition of the device list lock.  That would mean a
      transaction would be held open across /all/ of the discards collectively.
      In order to prevent other threads from allocating or freeing chunks, we
      hold the chunks lock across the search and discard calls.  We release it
      between searches to allow the file system to perform more-or-less
      normally.  Since the running transaction can commit and disappear while
      we're using the transaction pointer, we take a reference to it and
      release it after the search.  This is safe since it would happen normally
      at the end of the transaction commit after any locks are released anyway.
      We also take the commit_root_sem to protect against a transaction starting
      and committing while we're running.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      499f377f
    • J
      btrfs: skip superblocks during discard · 86557861
      Jeff Mahoney 提交于
      Btrfs doesn't track superblocks with extent records so there is nothing
      persistent on-disk to indicate that those blocks are in use.  We track
      the superblocks in memory to ensure they don't get used by removing them
      from the free space cache when we load a block group from disk.  Prior
      to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
      was fine since the block group would never be reclaimed so the superblock
      was always safe.  Once we started removing the empty block groups, we
      were protected by the fact that discards weren't being properly issued
      for unused space either via FITRIM or -odiscard.  The block groups were
      still being released, but the blocks remained on disk.
      
      In order to properly discard unused block groups, we need to filter out
      the superblocks from the discard range.  Superblocks are located at fixed
      locations on each device, so it makes sense to filter them out in
      btrfs_issue_discard, which is used by both -odiscard and FITRIM.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      86557861
    • J
      btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries · 4d89d377
      Jeff Mahoney 提交于
      It's possible, though unexpected, to pass unaligned offsets and lengths
      to btrfs_issue_discard.  We then shift the offset/length values to sector
      units.  If an unaligned offset has been passed, it will result in the
      entire sector being discarded, possibly losing data.  An unaligned
      length is safe but we'll end up returning an inaccurate number of
      discarded bytes.
      
      This patch aligns the offset to the 512B boundary, adjusts the length,
      and warns, since we shouldn't be discarding on an offset that isn't
      aligned with our sector size.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4d89d377
    • J
      btrfs: make btrfs_issue_discard return bytes discarded · d04c6b88
      Jeff Mahoney 提交于
      Initially this will just be the length argument passed to it,
      but the following patches will adjust that to reflect re-alignment
      and skipped blocks.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d04c6b88
  10. 23 7月, 2015 1 次提交
    • F
      Btrfs: fix quick exhaustion of the system array in the superblock · 00d80e34
      Filipe Manana 提交于
      Omar reported that after commit 4fbcdf66 ("Btrfs: fix -ENOSPC when
      finishing block group creation"), introduced in 4.2-rc1, the following
      test was failing due to exhaustion of the system array in the superblock:
      
        #!/bin/bash
      
        truncate -s 100T big.img
        mkfs.btrfs big.img
        mount -o loop big.img /mnt/loop
      
        num=5
        sz=10T
        for ((i = 0; i < $num; i++)); do
            echo fallocate $i $sz
            fallocate -l $sz /mnt/loop/testfile$i
        done
        btrfs filesystem sync /mnt/loop
      
        for ((i = 0; i < $num; i++)); do
              echo rm $i
              rm /mnt/loop/testfile$i
              btrfs filesystem sync /mnt/loop
        done
        umount /mnt/loop
      
      This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
      allocation of system block groups. This happened because the test creates
      a large number of data block groups per transaction and when committing
      the transaction we start the writeout of the block group caches for all
      the new new (dirty) block groups, which results in pre-allocating space
      for each block group's free space cache using the same transaction handle.
      That in turn often leads to creation of more block groups, and all get
      attached to the new_bgs list of the same transaction handle to the point
      of getting a list with over 1500 elements, and creation of new block groups
      leads to the need of reserving space in the chunk block reserve and often
      creating a new system block group too.
      
      So that made us quickly exhaust the chunk block reserve/system space info,
      because as of the commit mentioned before, we do reserve space for each
      new block group in the chunk block reserve, unlike before where we would
      not and would at most allocate one new system block group and therefore
      would only ensure that there was enough space in the system space info to
      allocate 1 new block group even if we ended up allocating thousands of
      new block groups using the same transaction handle. That worked most of
      the time because the computed required space at check_system_chunk() is
      very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
      that all nodes/leafs in a path will be COWed and split) and since the
      updates to the chunk tree all happen at btrfs_create_pending_block_groups
      it is unlikely that a path needs to be COWed more than once (unless
      writepages() for the btree inode is called by mm in between) and that
      compensated for the need of creating any new nodes/leads in the chunk
      tree.
      
      So fix this by ensuring we don't accumulate a too large list of new block
      groups in a transaction's handles new_bgs list, inserting/updating the
      chunk tree for all accumulated new block groups and releasing the unused
      space from the chunk block reserve whenever the list becomes sufficiently
      large. This is a generic solution even though the problem currently can
      only happen when starting the writeout of the free space caches for all
      dirty block groups (btrfs_start_dirty_block_groups()).
      Reported-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      00d80e34
  11. 12 7月, 2015 1 次提交
    • F
      Btrfs: fix order by which delayed references are run · cffc3374
      Filipe Manana 提交于
      When we have an extent that got N references removed and N new references
      added in the same transaction, we must run the insertion of the references
      first because otherwise the last removed reference will remove the extent
      item from the extent tree, resulting in a failure for the insertions.
      
      This is a regression introduced in the 4.2-rc1 release and this fix just
      brings back the behaviour of selecting reference additions before any
      reference removals.
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create prealloc extent covering range [160K, 620K[
        $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo
      
        # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
        # space that immediately follows it. This creates a new extent of 40K that spans
        # the range [620K, 660K[.
        $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # At this point, there are now 2 back references to the prealloc extent in our
        # extent tree. Both are for our file offset 160K and one relates to a file
        # extent item with a data offset of 0 and a length of 380K, while the other
        # relates to a file extent item with a data offset of 380K and a length of 80K.
      
        # Make sure everything done so far is durably persisted (all back references are
        # in the extent tree, etc).
        sync
      
        # Now clone all extents of our file that cover the offset 160K up to its eof
        # (660K at this point) into itself at offset 2M. This leaves a hole in the file
        # covering the range [660K, 2M[. The prealloc extent will now be referenced by
        # the file twice, once for offset 160K and once for offset 2M. The 40K extent
        # that follows the prealloc extent will also be referenced twice by our file,
        # once for offset 620K and once for offset 2M + 460K.
        $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
      	$SCRATCH_MNT/foo
      
        # Now create one new extent in our file with a size of 100Kb. It will span the
        # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
        # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
        $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # At this point, there are now (in memory) 4 back references to the prealloc
        # extent.
        #
        # Two of them are for file offset 160K, related to file extent items
        # matching the file offsets 160K and 540K respectively, with data offsets of
        # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
        #
        # The other two references are for file offset 2M, related to file extent items
        # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
        # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
        #
        # The 40K extent has 2 back references, one for file offset 620K and the other
        # for file offset 2M + 460K.
        #
        # The 100K extent has a single back reference and it relates to file offset 3M.
      
        # Now clone our 100K extent into offset 600K. That offset covers the last 20K
        # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
        # offset 660K.
        $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
            $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # At this point there's only one reference to the 40K extent, at file offset
        # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
        # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
        # file offset 3M and a new one for file offset 600K).
      
        # Now fsync our file to make all its new data and metadata updates are durably
        # persisted and present if a power failure/crash happens after a successful
        # fsync and before the next transaction commit.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file contents.
        # During log replay, the btrfs delayed references implementation used to run the
        # deletion of back references before the addition of new back references, which
        # made the addition fail as it didn't find the key in the extent tree that it
        # was looking for. The failure triggered by this test was related to the 40K
        # extent, which got 1 reference dropped and 1 reference added during the fsync
        # log replay - when running the delayed references at transaction commit time,
        # btrfs was applying the deletion before the insertion, resulting in a failure
        # of the insertion that ended up turning the fs into read-only mode.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      
      This issue turned the filesystem into read-only mode (current transaction
      aborted) and produced the following traces:
      
        [ 8247.578385] ------------[ cut here ]------------
        [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
        (...)
        [ 8247.601697] Call Trace:
        [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
        [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
        [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
        [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
        [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
        [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
        [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
        [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
        (...)
        [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---
      
        [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
        [ 8247.728954] BTRFS: Transaction aborted (error -5)
        (...)
        [ 8247.760866] Call Trace:
        [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
        [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
        [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
        [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
        [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
        [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
        [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
        [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
        [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
        [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
        [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
        [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
        (...)
        [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
        [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
        [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Acked-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      cffc3374
  12. 01 7月, 2015 1 次提交
    • F
      Btrfs: fix race between balance and unused block group deletion · 67c5e7d4
      Filipe Manana 提交于
      We have a race between deleting an unused block group and balancing the
      same block group that leads to an assertion failure/BUG(), producing the
      following trace:
      
      [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
      [181631.220591] ------------[ cut here ]------------
      [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
      [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
      [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
      [181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
      [181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
      [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
      [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
      [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
      [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
      [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
      [181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
      [181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
      [181631.224566] Stack:
      [181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
      [181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
      [181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
      [181631.224566] Call Trace:
      [181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
      [181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
      [181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
      [181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
      [181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      [181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
      [181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
      [181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
      [181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
      [181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
      (...)
      
      The sequence of steps leading to this are:
      
                 CPU 0                                         CPU 1
      
        btrfs_balance()
          btrfs_relocate_chunk()
      
            btrfs_relocate_block_group(bg X)
              btrfs_lookup_block_group(bg X)
      
                                                     cleaner_kthread
                                                        locks fs_info->cleaner_mutex
      
                                                        btrfs_delete_unused_bgs()
                                                          finds bg X, which became
                                                          unused in the previous
                                                          transaction
      
                                                          checks bg X ->ro == 0,
                                                          so it proceeds
              sets bg X ->ro to 1
              (btrfs_set_block_group_ro(bg X))
      
              blocks on fs_info->cleaner_mutex
                                                          btrfs_remove_chunk(bg X)
                                                        unlocks fs_info->cleaner_mutex
      
              acquires fs_info->cleaner_mutex
              relocate_block_group()
                --> does nothing, no extents found in
                    the extent tree from bg X
              unlocks fs_info->cleaner_mutex
      
            btrfs_relocate_block_group(bg X) returns
      
          btrfs_remove_chunk(bg X)
             extent map not found
                --> ASSERT(0)
      
      Fix this by using a new mutex to make sure these 2 operations, block
      group relocation and removal, are serialized.
      
      This issue is reproducible by running fstests generic/038 (which stresses
      chunk allocation and automatic removal of unused block groups) together
      with the following balance loop:
      
          while true; do btrfs balance start -dusage=0 <mountpoint> ; done
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      67c5e7d4
  13. 11 6月, 2015 5 次提交
  14. 10 6月, 2015 1 次提交
    • F
      Btrfs: fix necessary chunk tree space calculation when allocating a chunk · 4617ea3a
      Filipe Manana 提交于
      When allocating a new chunk or removing one we need to update num_devs
      device items and insert or remove a chunk item in the chunk tree, so
      in the worst case the space needed in the chunk space_info is:
      
        btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
           btrfs_calc_trans_metadata_size(chunk_root, 1)
      
      That is, in the worst case we need to cow num_devs paths and cow 1 other
      path that can result in splitting every node and leaf, and each path
      consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
      some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
      which were unnecessary since updating the existing device items does
      not result in splitting the nodes and leaf since after updating them
      they remain with the same size.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4617ea3a
  15. 03 6月, 2015 5 次提交
    • L
      Btrfs: fix up read_tree_block to return proper error · 64c043de
      Liu Bo 提交于
      The return value of read_tree_block() can confuse callers as it always
      returns NULL for either -ENOMEM or -EIO, so it's likely that callers
      parse it to a wrong error, for instance, in btrfs_read_tree_root().
      
      This fixes the above issue.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      64c043de
    • L
      Btrfs: add missing free_extent_buffer · 8635eda9
      Liu Bo 提交于
      read_tree_block may take a reference on the 'eb', a following
      free_extent_buffer is necessary.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      8635eda9
    • F
      Btrfs: fix -ENOSPC on block group removal · 39c2d7fa
      Filipe Manana 提交于
      Unlike when attempting to allocate a new block group, where we check
      that we have enough space in the system space_info to update the device
      items and insert a new chunk item in the chunk tree, we were not checking
      if the system space_info had enough space for updating the device items
      and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
      error when attempting to allocate blocks for the chunk tree (during btree
      node/leaf COW operations) while updating the device items or deleting the
      chunk item, which resulted in the current transaction being aborted and
      turning the filesystem into read-only mode.
      
      While running fstests generic/038, which stresses allocation of block
      groups and removal of unused block groups, with a large scratch device
      (750Gb) this happened often, despite more than enough unallocated space,
      and resulted in the following trace:
      
      [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [68663.600407] BTRFS: Transaction aborted (error -28)
      (...)
      [68663.730829] Call Trace:
      [68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
      [68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
      [68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
      [68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
      [68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
      [68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
      [68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
      [68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
      [68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
      [68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
      [68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
      [68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
      [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left
      
      So fix this by verifying that enough space exists in system space_info,
      and reserving the space in the chunk block reserve, before attempting to
      delete the block group and allocate a new system chunk if we don't have
      enough space to perform the necessary updates and delete in the chunk
      tree. Like for the block group creation case, we don't error our if we
      fail to allocate a new system chunk, since we might end up not needing
      it (no node/leaf splits happen during the COW operations and/or we end
      up not needing to COW any btree nodes or leafs because they were already
      COWed in the current transaction and their writeback didn't start yet).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      39c2d7fa
    • F
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana 提交于
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4fbcdf66
    • F
      Btrfs: fix block group ->space_info null pointer dereference · 2e6e5183
      Filipe Manana 提交于
      When we create a block group we add it to the rbtree of block groups
      before setting its ->space_info field (while it's NULL). This is
      problematic since other tasks can access the block group from the
      rbtree and attempt to use its ->space_info before it is set by
      btrfs_make_block_group().
      
      This can happen for example when a concurrent fitrim ioctl operation
      is ongoing, which produces a trace like the following when
      CONFIG_DEBUG_PAGEALLOC is set.
      
      [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0
      [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou
      [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000
      [11509.608179] RIP: 0010:[<ffffffff8107d675>]  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] RSP: 0018:ffff8801b3edf9e8  EFLAGS: 00010002
      [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
      [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
      [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000
      [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000
      [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0
      [11509.608179] FS:  00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
      [11509.608179] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0
      [11509.608179] Stack:
      [11509.608179]  0000000000000000 0000000000000000 0000000000000004 0000000000000000
      [11509.608179]  ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000
      [11509.608179]  0000000000000001 0000000000000000 ffff880100000000 00000000000006c4
      [11509.608179] Call Trace:
      [11509.608179]  [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02
      [11509.608179]  [<ffffffff8107e806>] lock_acquire+0xa5/0x116
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs]
      [11509.608179]  [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs]
      [11509.608179]  [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs]
      [11509.608179]  [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs]
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e
      [11509.608179]  [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479
      [11509.608179]  [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e
      [11509.608179]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [11509.608179]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [11509.608179]  [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f
      [11509.608179]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00
      [11509.608179] RIP  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179]  RSP <ffff8801b3edf9e8>
      [11509.608179] CR2: 0000000000000018
      [11509.608179] ---[ end trace 570a5c6769f0e49a ]---
      
      Which corresponds to the following access in fs/btrfs/free-space-cache.c:
      
        static int do_trimming(struct btrfs_block_group_cache *block_group,
                               u64 *total_trimmed, u64 start, u64 bytes,
                               u64 reserved_start, u64 reserved_bytes,
                               struct btrfs_trim_range *trim_entry)
        {
             struct btrfs_space_info *space_info = block_group->space_info;
        (...)
             spin_lock(&space_info->lock);
             ^^^^^ - block_group->space_info is NULL...
      
      Fix this by ensuring the block group's ->space_info is set before adding
      the block group to the rbtree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2e6e5183
  16. 21 5月, 2015 1 次提交
  17. 20 5月, 2015 1 次提交
    • F
      Btrfs: fix racy system chunk allocation when setting block group ro · a9629596
      Filipe Manana 提交于
      If while setting a block group read-only we end up allocating a system
      chunk, through check_system_chunk(), we were not doing it while holding
      the chunk mutex which is a problem if a concurrent chunk allocation is
      happening, through do_chunk_alloc(), as it means both block groups can
      end up using the same logical addresses and physical regions in the
      device(s). So make sure we hold the chunk mutex.
      
      Cc: stable@vger.kernel.org  # 4.0+
      Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when
                            setting block group ro")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a9629596
  18. 11 5月, 2015 1 次提交
    • F
      Btrfs: fix race between block group creation and their cache writeout · ff1f8250
      Filipe Manana 提交于
      So creating a block group has 2 distinct phases:
      
      Phase 1 - creates the btrfs_block_group_cache item and adds it to the
      rbtree fs_info->block_group_cache_tree and to the corresponding list
      space_info->block_groups[];
      
      Phase 2 - adds the block group item to the extent tree and corresponding
      items to the chunk tree.
      
      The first phase adds the block_group_cache_item to a list of pending block
      groups in the transaction handle, and phase 2 happens when
      btrfs_end_transaction() is called against the transaction handle.
      
      It happens that once phase 1 completes, other concurrent tasks that use
      their own transaction handle, but points to the same running transaction
      (struct btrfs_trans_handle->transaction), can use this block group for
      space allocations and therefore mark it dirty. Dirty block groups are
      tracked in a list belonging to the currently running transaction (struct
      btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).
      
      This is a problem because once a task calls btrfs_commit_transaction(),
      it calls btrfs_start_dirty_block_groups() which will see all dirty block
      groups and attempt to start their writeout, including those that are
      still attached to the transaction handle of some concurrent task that
      hasn't called btrfs_end_transaction() yet - which means those block
      groups haven't gone through phase 2 yet and therefore when
      write_one_cache_group() is called, it won't find the block group items
      in the extent tree and abort the current transaction with -ENOENT,
      turning the fs into readonly mode and require a remount.
      
      Fix this by ignoring -ENOENT when looking for block group items in the
      extent tree when we attempt to start the writeout of the block group
      caches outside the critical section of the transaction commit. We will
      try again later during the critical section and if there we still don't
      find the block group item in the extent tree, we then abort the current
      transaction.
      
      This issue happened twice, once while running fstests btrfs/067 and once
      for btrfs/078, which produced the following trace:
      
      [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [ 3278.707329] BTRFS: Transaction aborted (error -2)
      (...)
      [ 3278.731555] Call Trace:
      [ 3278.732396]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [ 3278.733860]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [ 3278.735312]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [ 3278.736874]  [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.738302]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [ 3278.739520]  [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.741222]  [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs]
      [ 3278.742797]  [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
      [ 3278.744492]  [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [ 3278.746084]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [ 3278.747249]  [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs]
      [ 3278.748744]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [ 3278.749958]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 3278.751218]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [ 3278.754197]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [ 3278.755192]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [ 3278.756236]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff1f8250
  19. 26 4月, 2015 4 次提交
    • O
      btrfs: handle ENOMEM in btrfs_alloc_tree_block · 67b7859e
      Omar Sandoval 提交于
      This is one of the first places to give out when memory is tight. Handle
      it properly rather than with a BUG_ON.
      
      Also fix the comment about the return value, which is an ERR_PTR, not
      NULL, on error.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      67b7859e
    • C
      Btrfs: don't check for delalloc_bytes in cache_save_setup · e4c88f00
      Chris Mason 提交于
      Now that we're doing free space cache writeback outside the critical
      section in the commit, there is a bigger window for delalloc_bytes to
      be added after a cache has been written.  find_free_extent may do this
      without putting the block group back into the dirty list, and also
      without a transaction running.
      
      Checking for delalloc_bytes in cache_save_setup means we might leave the
      cache marked as written without invalidating it.  Consistency checks
      during mount will toss the cache, but it's better to get rid of the
      check in cache_save_setup and let it get invalidated by the checks
      already done during cache write out.
      Signed-off-by: NChris Mason <clm@fb.com>
      e4c88f00
    • F
      Btrfs: fix deadlock when starting writeback of bg caches · 24b89d08
      Filipe Manana 提交于
      While starting the writes of the dirty block group caches, if we don't
      find a block group item in the extent tree we were leaving without
      releasing our path, running delayed references and then looping again to
      process any new dirty block groups. However this second iteration of the
      loop could cause a deadlock because it tries to lock some other extent
      tree node/leaf which another task already locked and it's blocked because
      it's waiting for a lock on some node/leaf that is in our path that was not
      released before.
      We could also deadlock when running the delayed references - as we could
      end up trying to lock the same nodes/leafs that we have in our local path
      (with a different lock type).
      
      Got into such case when running xfstests:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      (...)
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      [20892.298222] ------------[ cut here ]------------
      [20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
      (...)
      [20892.326253] Call Trace:
      [20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
      [20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
      [20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
      [20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
      [20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.361015] ---[ end trace 597f77e664245374 ]---
      [21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
      [21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.703696] Call Trace:
      [21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
      [21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
      [21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
      [21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
      [21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
      [21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
      [21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
      [21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
      (...)
      [21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
      [21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.768457] Call Trace:
      [21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
      [21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
      [21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
      [21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
      [21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
      [21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
      [21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
      [21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      (...)
      [21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
      [21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.899960] Call Trace:
      [21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
      [21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
      [21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
      [21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
      [21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
      [21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
      [21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
      [21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
      [21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
      [21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
      [21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      (...)
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      24b89d08
    • F
      Btrfs: fix race between start dirty bg cache writeout and bg deletion · b58d1a9e
      Filipe Manana 提交于
      While running xfstests I ran into the following:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      [20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
      [20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
      [20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
      [20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
      [20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      
      This happens because in btrfs_start_dirty_block_groups() we splice the
      transaction's list of dirty block groups into a local list and then we
      keep extracting the first element of the list without holding the
      cache_write_mutex mutex. This means that before we acquire that mutex
      the first block group on the list might be removed by a conurrent task
      running btrfs_remove_block_group(). So make sure we extract the first
      element (and test the list emptyness) while holding that mutex.
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b58d1a9e
  20. 13 4月, 2015 2 次提交
    • D
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang 提交于
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e2d1f923
    • D
      Btrfs: qgroup, Account data space in more proper timings. · 237c0e9f
      Dongsheng Yang 提交于
      Currenly, in data writing, ->reserved is accounted in
      fill_delalloc(), but ->may_use is released in clear_bit_hook()
      which is called by btrfs_finish_ordered_io(). That's too late,
      that said, between fill_delalloc() and btrfs_finish_ordered_io(),
      the data is doublely accounted by qgroup. It will cause some
      unexpected -EDQUOT.
      
      Example:
      	# btrfs quota enable /root/btrfs-auto-test/
      	# btrfs subvolume create /root/btrfs-auto-test//sub
      	Create subvolume '/root/btrfs-auto-test/sub'
      	# btrfs qgroup limit 1G /root/btrfs-auto-test//sub
      	dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000
      	dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded
      	681353+0 records in
      	681352+0 records out
      	697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s
      It's (698 MB) when we got an -EDQUOT, but we limit it by 1G.
      
      This patch move the btrfs_qgroup_reserve/free() for data from
      btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space()
      and btrfs_free_reserved_data_space(). Then the accounter in qgroup
      will be updated at the same time with the accounter in space_info updated.
      In this way, the unexpected -EDQUOT will be killed.
      Reported-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      237c0e9f