1. 22 5月, 2019 1 次提交
  2. 03 4月, 2019 1 次提交
  3. 10 1月, 2019 1 次提交
    • J
      btrfs: run delayed items before dropping the snapshot · 6911b074
      Josef Bacik 提交于
      commit 0568e82dbe2510fc1fa664f58e5c997d3f1e649e upstream.
      
      With my delayed refs patches in place we started seeing a large amount
      of aborts in __btrfs_free_extent:
      
       BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964  owner 1 offset 0
       Call Trace:
        ? btrfs_merge_delayed_refs+0xaf/0x340
        __btrfs_run_delayed_refs+0x6ea/0xfc0
        ? btrfs_set_path_blocking+0x31/0x60
        btrfs_run_delayed_refs+0xeb/0x180
        btrfs_commit_transaction+0x179/0x7f0
        ? btrfs_check_space_for_delayed_refs+0x30/0x50
        ? should_end_transaction.isra.19+0xe/0x40
        btrfs_drop_snapshot+0x41c/0x7c0
        btrfs_clean_one_deleted_snapshot+0xb5/0xd0
        cleaner_kthread+0xf6/0x120
        kthread+0xf8/0x130
        ? btree_invalidatepage+0x90/0x90
        ? kthread_bind+0x10/0x10
        ret_from_fork+0x35/0x40
      
      This was because btrfs_drop_snapshot depends on the root not being
      modified while it's dropping the snapshot.  It will unlock the root node
      (and really every node) as it walks down the tree, only to re-lock it
      when it needs to do something.  This is a problem because if we modify
      the tree we could cow a block in our path, which frees our reference to
      that block.  Then once we get back to that shared block we'll free our
      reference to it again, and get ENOENT when trying to lookup our extent
      reference to that block in __btrfs_free_extent.
      
      This is ultimately happening because we have delayed items left to be
      processed for our deleted snapshot _after_ all of the inodes are closed
      for the snapshot.  We only run the delayed inode item if we're deleting
      the inode, and even then we do not run the delayed insertions or delayed
      removals.  These can be run at any point after our final inode does its
      last iput, which is what triggers the snapshot deletion.  We can end up
      with the snapshot deletion happening and then have the delayed items run
      on that file system, resulting in the above problem.
      
      This problem has existed forever, however my patches made it much easier
      to hit as I wake up the cleaner much more often to deal with delayed
      iputs, which made us more likely to start the snapshot dropping work
      before the transaction commits, which is when the delayed items would
      generally be run.  Before, generally speaking, we would run the delayed
      items, commit the transaction, and wakeup the cleaner thread to start
      deleting snapshots, which means we were less likely to hit this problem.
      You could still hit it if you had multiple snapshots to be deleted and
      ended up with lots of delayed items, but it was definitely harder.
      
      Fix for now by simply running all the delayed items before starting to
      drop the snapshot.  We could make this smarter in the future by making
      the delayed items per-root, and then simply drop any delayed items for
      roots that we are going to delete.  But for now just a quick and easy
      solution is the safest.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6911b074
  4. 14 11月, 2018 13 次提交
    • J
      btrfs: fix insert_reserved error handling · 186b5248
      Josef Bacik 提交于
      commit 80ee54bf upstream.
      
      We were not handling the reserved byte accounting properly for data
      references.  Metadata was fine, if it errored out the error paths would
      free the bytes_reserved count and pin the extent, but it even missed one
      of the error cases.  So instead move this handling up into
      run_one_delayed_ref so we are sure that both cases are properly cleaned
      up in case of a transaction abort.
      
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      186b5248
    • J
      btrfs: don't use ctl->free_space for max_extent_size · a746cfd0
      Josef Bacik 提交于
      commit fb5c39d7 upstream.
      
      max_extent_size is supposed to be the largest contiguous range for the
      space info, and ctl->free_space is the total free space in the block
      group.  We need to keep track of these separately and _only_ use the
      max_free_space if we don't have a max_extent_size, as that means our
      original request was too large to search any of the block groups for and
      therefore wouldn't have a max_extent_size set.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a746cfd0
    • J
      btrfs: reset max_extent_size properly · e982beca
      Josef Bacik 提交于
      commit 21a94f7a upstream.
      
      If we use up our block group before allocating a new one we'll easily
      get a max_extent_size that's set really really low, which will result in
      a lot of fragmentation.  We need to make sure we're resetting the
      max_extent_size when we add a new chunk or add new space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e982beca
    • F
      Btrfs: fix deadlock when writing out free space caches · ea9c846f
      Filipe Manana 提交于
      commit 5ce55557 upstream.
      
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea9c846f
    • J
      btrfs: make sure we create all new block groups · 1d6d4a03
      Josef Bacik 提交于
      commit 545e3366db823dc3342ca9d7fea803f829c9062f upstream.
      
      Allocating new chunks modifies both the extent and chunk tree, which can
      trigger new chunk allocations.  So instead of doing list_for_each_safe,
      just do while (!list_empty()) so we make sure we don't exit with other
      pending bg's still on our list.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d6d4a03
    • J
      btrfs: wait on caching when putting the bg cache · 2974abff
      Josef Bacik 提交于
      commit 3aa7c7a31c26321696b92841d5103461c6f3f517 upstream.
      
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2974abff
    • J
      btrfs: keep trim from interfering with transaction commits · 0a7f6c7e
      Jeff Mahoney 提交于
      commit fee7acc361314df6561208c2d3c0882d663dd537 upstream.
      
      Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      fixed free space trimming, but introduced latency when it was running.
      This is due to it pinning the transaction using both a incremented
      refcount and holding the commit root sem for the duration of a single
      trim operation.
      
      This was to ensure safety but it's unnecessary.  We already hold the the
      chunk mutex so we know that the chunk we're using can't be allocated
      while we're trimming it.
      
      In order to check against chunks allocated already in this transaction,
      we need to check the pending chunks list.  To to that safely without
      joining the transaction (or attaching than then having to commit it) we
      need to ensure that the dev root's commit root doesn't change underneath
      us and the pending chunk lists stays around until we're done with it.
      
      We can ensure the former by holding the commit root sem and the latter
      by pinning the transaction.  We do this now, but the critical section
      covers the trim operation itself and we don't need to do that.
      
      This patch moves the pinning and unpinning logic into helpers and unpins
      the transaction after performing the search and check for pending
      chunks.
      
      Limiting the critical section of the transaction pinning improves the
      latency substantially on slower storage (e.g. image files over NFS).
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a7f6c7e
    • J
      btrfs: don't attempt to trim devices that don't support it · 4d0dfd8f
      Jeff Mahoney 提交于
      commit 0be88e367fd8fbdb45257615d691f4675dda062f upstream.
      
      We check whether any device the file system is using supports discard in
      the ioctl call, but then we attempt to trim free extents on every device
      regardless of whether discard is supported.  Due to the way we mask off
      EOPNOTSUPP, we can end up issuing the trim operations on each free range
      on devices that don't support it, just wasting time.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d0dfd8f
    • J
      btrfs: iterate all devices during trim, instead of fs_devices::alloc_list · 76e59a62
      Jeff Mahoney 提交于
      commit d4e329de5e5e21594df2e0dd59da9acee71f133b upstream.
      
      btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
      device_list_mutex.  The problem is that ->alloc_list is protected by the
      chunk mutex.  We don't want to hold the chunk mutex over the trim of the
      entire file system.  Fortunately, the ->dev_list list is protected by
      the dev_list mutex and while it will give us all devices, including
      read-only devices, we already just skip the read-only devices.  Then we
      can continue to take and release the chunk mutex while scanning each
      device.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76e59a62
    • Q
      btrfs: Ensure btrfs_trim_fs can trim the whole filesystem · d147f4dc
      Qu Wenruo 提交于
      commit 6ba9fc8e upstream.
      
      [BUG]
      fstrim on some btrfs only trims the unallocated space, not trimming any
      space in existing block groups.
      
      [CAUSE]
      Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
      range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
      able to trim block groups in range [0, super->total_bytes).
      
      While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
      uses its logical address space, there is nothing limiting the location
      where we put block groups.
      
      For filesystem with frequent balance, it's quite easy to relocate all
      block groups and bytenr of block groups will start beyond
      super->total_bytes.
      
      In that case, btrfs will not trim existing block groups.
      
      [FIX]
      Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
      can get the unmodified range, which is normally set to [0, U64_MAX].
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
      CC: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d147f4dc
    • Q
      btrfs: Enhance btrfs_trim_fs function to handle error better · c9ee7109
      Qu Wenruo 提交于
      commit 93bba24d upstream.
      
      Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
      error happens when trimming existing block groups, it will skip the
      remaining blocks and continue to trim unallocated space for each device.
      
      The return value will only reflect the final error from device trimming.
      
      This patch will fix such behavior by:
      
      1) Recording the last error from block group or device trimming
         The return value will also reflect the last error during trimming.
         Make developer more aware of the problem.
      
      2) Continuing trimming if possible
         If we failed to trim one block group or device, we could still try
         the next block group or device.
      
      3) Report number of failures during block group and device trimming
         It would be less noisy, but still gives user a brief summary of
         what's going wrong.
      
      Such behavior can avoid confusion for cases like failure to trim the
      first block group and then only unallocated space is trimmed.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add bg_ret and dev_ret to the messages ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9ee7109
    • Q
      btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock · d4f56c44
      Qu Wenruo 提交于
      commit b72c3aba upstream.
      
      [BUG]
      For certain crafted image, whose csum root leaf has missing backref, if
      we try to trigger write with data csum, it could cause deadlock with the
      following kernel WARN_ON():
      
        WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
        CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
        Workqueue: btrfs-endio-write btrfs_endio_write_helper
        RIP: 0010:btrfs_tree_lock+0x3e2/0x400
        Call Trace:
         btrfs_alloc_tree_block+0x39f/0x770
         __btrfs_cow_block+0x285/0x9e0
         btrfs_cow_block+0x191/0x2e0
         btrfs_search_slot+0x492/0x1160
         btrfs_lookup_csum+0xec/0x280
         btrfs_csum_file_blocks+0x2be/0xa60
         add_pending_csums+0xaf/0xf0
         btrfs_finish_ordered_io+0x74b/0xc90
         finish_ordered_fn+0x15/0x20
         normal_work_helper+0xf6/0x500
         btrfs_endio_write_helper+0x12/0x20
         process_one_work+0x302/0x770
         worker_thread+0x81/0x6d0
         kthread+0x180/0x1d0
         ret_from_fork+0x35/0x40
      
      [CAUSE]
      That crafted image has missing backref for csum tree root leaf.  And
      when we try to allocate new tree block, since there is no
      EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
      and use it.
      
      The extent tree of the image looks like:
      
        Normal image                      |       This fuzzed image
        ----------------------------------+--------------------------------
        BG 29360128                       | BG 29360128
         One empty slot                   |  One empty slot
        29364224: backref to UUID tree    | 29364224: backref to UUID tree
         Two empty slots                  |  Two empty slots
        29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
        29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
        ...                               | ...
      
      Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
      alloc tree block, it's an valid slot for btrfs.
      
      And for finish_ordered_write, when we need to insert csum, we try to CoW
      csum tree root.
      
      By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
      BG_OFFSET + 12K is already used by tree block COW for other trees, the
      next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
      tree.
      
      But due to the bad type, btrfs can recognize it and still consider it as
      an empty slot, and will try to use it for csum tree CoW.
      
      Then in the following call trace, we will try to lock the new tree
      block, which turns out to be the old csum tree root which is already
      locked:
      
      btrfs_search_slot() called on csum tree root, which is at 29376512
      |- btrfs_cow_block()
         |- btrfs_set_lock_block()
         |  |- Now locks tree block 29376512 (old csum tree root)
         |- __btrfs_cow_block()
            |- btrfs_alloc_tree_block()
               |- btrfs_reserve_extent()
                  | Now it returns tree block 29376512, which extent tree
                  | shows its empty slot, but it's already hold by csum tree
                  |- btrfs_init_new_buffer()
                     |- btrfs_tree_lock()
                        | Triggers WARN_ON(eb->lock_owner == current->pid)
                        |- wait_event()
                           Wait lock owner to release the lock, but it's
                           locked by ourself, so it will deadlock
      
      [FIX]
      This patch will do the lock_owner and current->pid check at
      btrfs_init_new_buffer().
      So above deadlock can be avoided.
      
      Since such problem can only happen in crafted image, we will still
      trigger kernel warning for later aborted transaction, but with a little
      more meaningful warning message.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4f56c44
    • Q
      btrfs: Handle owner mismatch gracefully when walking up tree · 4f2a4e02
      Qu Wenruo 提交于
      commit 65c6e82b upstream.
      
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f2a4e02
  5. 23 8月, 2018 1 次提交
  6. 06 8月, 2018 23 次提交
    • Q
      btrfs: Verify that every chunk has corresponding block group at mount time · 7ef49515
      Qu Wenruo 提交于
      If a crafted image has missing block group items, it could cause
      unexpected behavior and breaks the assumption of 1:1 chunk<->block group
      mapping.
      
      Although we have the block group -> chunk mapping check, we still need
      chunk -> block group mapping check.
      
      This patch will do extra check to ensure each chunk has its
      corresponding block group.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847Reported-by: NXu Wen <wen.xu@gatech.edu>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NGu Jinxiang <gujx@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ef49515
    • Q
      btrfs: Check that each block group has corresponding chunk at mount time · 514c7dca
      Qu Wenruo 提交于
      A crafted btrfs image with incorrect chunk<->block group mapping will
      trigger a lot of unexpected things as the mapping is essential.
      
      Although the problem can be caught by block group item checker
      added in "btrfs: tree-checker: Verify block_group_item", it's still not
      sufficient.  A sufficiently valid block group item can pass the check
      added by the mentioned patch but could fail to match the existing chunk.
      
      This patch will add extra block group -> chunk mapping check, to ensure
      we have a completely matching (start, len, flags) chunk for each block
      group at mount time.
      
      Here we reuse the original helper find_first_block_group(), which is
      already doing the basic bg -> chunk checks, adding further checks of the
      start/len and type flags.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837Reported-by: NXu Wen <wen.xu@gatech.edu>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      514c7dca
    • M
      btrfs: extent-tree: Remove unused __btrfs_free_block_rsv · 85c39548
      Misono Tomohiro 提交于
      There is no user of this function anymore.
      
      This was forgotten to be removed in commit a575ceeb
      ("Btrfs: get rid of unused orphan infrastructure").
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      85c39548
    • L
      btrfs: Remove fs_info from btrfs_del_root · ab9ce7d4
      Lu Fengqi 提交于
      It can be referenced from the passed transaction handle.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab9ce7d4
    • D
      btrfs: extent-tree: remove unused member walk_control::for_reloc · b5851021
      David Sterba 提交于
      Leftover after fix e339a6b0 ("Btrfs: __btrfs_mod_ref should always
      use no_quota"), that removed it from the function calls but not the
      structure.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5851021
    • J
      btrfs: don't leak ret from do_chunk_alloc · 4559b0a7
      Josef Bacik 提交于
      If we're trying to make a data reservation and we have to allocate a
      data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if
      it allocated a chunk.  Since the end of the function is the success path
      just return 0.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4559b0a7
    • N
      btrfs: Remove fs_info from btrfs_finish_chunk_alloc · 97aff912
      Nikolay Borisov 提交于
      It can be referenced from the passed transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97aff912
    • Q
      btrfs: extent-tree: Remove dead alignment check · 5e23a6fe
      Qu Wenruo 提交于
      In find_free_extent() under checks: label, we have the following code:
      
      		search_start = ALIGN(offset, fs_info->stripesize);
      		/* move on to the next group */
      		if (search_start + num_bytes >
      		    block_group->key.objectid + block_group->key.offset) {
      			btrfs_add_free_space(block_group, offset, num_bytes);
      			goto loop;
      		}
      		if (offset < search_start)
      			btrfs_add_free_space(block_group, offset,
      					     search_start - offset);
      		BUG_ON(offset > search_start);
      
      However ALIGN() is rounding up, thus @search_start >= @offset and that
      BUG_ON() will never be triggered.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e23a6fe
    • D
      btrfs: refactor block group replication factor calculation to a helper · 46df06b8
      David Sterba 提交于
      There are many places that open code the duplicity factor of the block
      group profiles, create a common helper. This can be easily extended for
      more copies.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46df06b8
    • L
      btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree · deb40627
      Lu Fengqi 提交于
      The fs_info can be fetched from the transaction handle directly.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      deb40627
    • L
      btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_items · 8d38d7eb
      Lu Fengqi 提交于
      It can be fetched from the transaction handle.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d38d7eb
    • Q
      btrfs: Use btrfs_mark_bg_unused to replace open code · 031f24da
      Qu Wenruo 提交于
      Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
      add a block group to unused_bgs list.
      
      No functional modification, and only 3 callers are involved.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      031f24da
    • N
      btrfs: Rewrite retry logic in do_chunk_alloc · 2556fbb0
      Nikolay Borisov 提交于
      do_chunk_alloc implements logic to detect whether there is currently
      pending chunk allocation (by means of space_info->chunk_alloc being
      set) and if so it loops around to the 'again' label. Additionally,
      based on the state of the space_info (e.g. whether it's full or not)
      and the return value of should_alloc_chunk() it decides whether this
      is a "hard" error (ENOSPC) or we can just return 0.
      
      This patch refactors all of this:
      
      1. Put order to the scattered ifs handling the various cases in an
      easy-to-read if {} else if{} branches. This makes clear the various
      cases we are interested in handling.
      
      2. Call should_alloc_chunk only once and use the result in the
      if/else if constructs. All of this is done under space_info->lock, so
      even before multiple calls of should_alloc_chunk were unnecessary.
      
      3. Rewrite the "do {} while()" loop currently implemented via label
      into an explicit loop construct.
      
      4. Move the mutex locking for the case where the caller is the one doing
      the allocation. For the case where the caller needs to wait a concurrent
      allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
      barrier and reword the comment.
      
      5. Switch local vars to bool type where pertinent.
      
      All in all this shouldn't introduce any functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2556fbb0
    • E
      btrfs: use customized batch size for total_bytes_pinned · dec59fa3
      Ethan Lien 提交于
      In commit b150a4f1 ("Btrfs: use a percpu to keep track of possibly
      pinned bytes") we use total_bytes_pinned to track how many bytes we are
      going to free in this transaction. When we are close to ENOSPC, we check it
      and know if we can make the allocation by commit the current transaction.
      For every data/metadata extent we are going to free, we add
      total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
      release it in unpin_extent_range() when we finish the transaction. So this
      is a variable we frequently update but rarely read - just the suitable
      use of percpu_counter. But in previous commit we update total_bytes_pinned
      by default 32 batch size, making every update essentially a spin lock
      protected update. Since every spin lock/unlock operation involves syncing
      a globally used variable and some kind of barrier in a SMP system, this is
      more expensive than using total_bytes_pinned as a simple atomic64_t.
      
      So fix this by using a customized batch size. Since we only read
      total_bytes_pinned when we are close to ENOSPC and fail to allocate new
      chunk, we can use a really large batch size and have nearly no penalty
      in most cases.
      
      [Test]
      We tested the patch on a 4-cores x86 machine:
      
      1. fallocate a 16GiB size test file
      2. take snapshot (so all following writes will be COW)
      3. run a 180 sec, 4 jobs, 4K random write fio on test file
      
      We also added a temporary lockdep class on percpu_counter's spin lock
      used by total_bytes_pinned to track it by lock_stat.
      
      [Results]
      unpatched:
      lock_stat version 0.4
      -----------------------------------------------------------------------
                                    class name    con-bounces    contentions
      waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
      acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
      
                     total_bytes_pinned_percpu:            82             82
              0.21           0.61          29.46           0.36         298340
            635973           0.09          11.01      173476.25           0.27
      
      patched:
      lock_stat version 0.4
      -----------------------------------------------------------------------
                                    class name    con-bounces    contentions
      waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
      acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
      
                     total_bytes_pinned_percpu:             1              1
              0.62           0.62           0.62           0.62          13601
             31542           0.14           9.61       11016.90           0.35
      
      [Analysis]
      Since the spin lock only protects a single in-memory variable, the
      contentions (number of lock acquisitions that had to wait) in both
      unpatched and patched version are low. But when we see acquisitions and
      acq-bounces, we get much lower counts in patched version. Here the most
      important metric is acq-bounces. It means how many times the lock gets
      transferred between different cpus, so the patch can really reduce
      cacheline bouncing of spin lock (also the global counter of percpu_counter)
      in a SMP system.
      
      Fixes: b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes")
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dec59fa3
    • D
      btrfs: simplify pointer chasing of local fs_info variables · 3ffbd68c
      David Sterba 提交于
      Functions that get btrfs inode can simply reach the fs_info by
      dereferencing the root and this looks a bit more straightforward
      compared to the btrfs_sb(...) indirection.
      
      If the transaction handle is available and not NULL it's used instead.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ffbd68c
    • D
      btrfs: annotate unlikely branches after V0 extent type removal · 6d8ff4e4
      David Sterba 提交于
      The v0 extent type checks are the right case for the unlikely
      annotations as we don't expect to ever see them, so let's give the
      compiler some hint.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d8ff4e4
    • N
      btrfs: Add graceful handling of V0 extents · ba3c2b19
      Nikolay Borisov 提交于
      Following the removal of the v0 handling code let's be courteous and
      print an error message when such extents are handled. In the cases
      where we have a transaction just abort it, otherwise just call
      btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.
      
      In case the error handling would be too intrusive, leave the BUG_ON in
      place, like extent_data_ref_count, other proper handling would catch
      that earlier.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba3c2b19
    • N
      btrfs: Remove V0 extent support · a79865c6
      Nikolay Borisov 提交于
      The v0 compat code was introduced in commit 5d4f98a2
      ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)") 9
      years ago, which was merged in 2.6.31. This means that the code is
      there to support filesystems which are _VERY_ old and if you are using
      btrfs on such an old kernel, you have much bigger problems. This coupled
      with the fact that no one is likely testing/maintining this code likely
      means it has bugs lurking. All things considered I think 43 kernel
      releases later it's high time this remnant of the past got removed.
      
      This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
      we have a v0 with no support intact as a sort of safety-net.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a79865c6
    • S
      btrfs: return EUCLEAN if extent_inline_ref type is invalid · af431dcb
      Su Yue 提交于
      If type of extent_inline_ref found is not expected, filesystem may have
      been corrupted, should return EUCLEAN instead of EINVAL.
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af431dcb
    • Q
      btrfs: Don't remove block group that still has pinned down bytes · 43794446
      Qu Wenruo 提交于
      [BUG]
      Under certain KVM load and LTP tests, it is possible to hit the
      following calltrace if quota is enabled:
      
      BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
      BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
      
      WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
      CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
      task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
      RIP: 0010:blk_status_to_errno+0x1a/0x30
      Call Trace:
       submit_extent_page+0x191/0x270 [btrfs]
       ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
       __do_readpage+0x2d2/0x810 [btrfs]
       ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
       ? run_one_async_done+0xc0/0xc0 [btrfs]
       __extent_read_full_page+0xe7/0x100 [btrfs]
       ? run_one_async_done+0xc0/0xc0 [btrfs]
       read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
       ? run_one_async_done+0xc0/0xc0 [btrfs]
       btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
       read_tree_block+0x31/0x60 [btrfs]
       read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
       btrfs_search_slot+0x46b/0xa00 [btrfs]
       ? kmem_cache_alloc+0x1a8/0x510
       ? btrfs_get_token_32+0x5b/0x120 [btrfs]
       find_parent_nodes+0x11d/0xeb0 [btrfs]
       ? leaf_space_used+0xb8/0xd0 [btrfs]
       ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
       ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
       btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
       btrfs_find_all_roots+0x45/0x60 [btrfs]
       btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
       btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
       btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
       insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
       btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
       ? pick_next_task_fair+0x2cd/0x530
       ? __switch_to+0x92/0x4b0
       btrfs_worker_helper+0x81/0x300 [btrfs]
       process_one_work+0x1da/0x3f0
       worker_thread+0x2b/0x3f0
       ? process_one_work+0x3f0/0x3f0
       kthread+0x11a/0x130
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x35/0x40
      
      BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
      BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
      BTRFS info (device vda2): forced readonly
      BTRFS error (device vda2): pending csums is 2887680
      
      [CAUSE]
      It's caused by race with block group auto removal:
      
      - There is a meta block group X, which has only one tree block
        The tree block belongs to fs tree 257.
      - In current transaction, some operation modified fs tree 257
        The tree block gets COWed, so the block group X is empty, and marked
        as unused, queued to be deleted.
      - Some workload (like fsync) wakes up cleaner_kthread()
        Which will call btrfs_delete_unused_bgs() to remove unused block
        groups.
        So block group X along its chunk map get removed.
      - Some delalloc work finished for fs tree 257
        Quota needs to get the original reference of the extent, which will
        read tree blocks of commit root of 257.
        Then since the chunk map gets removed, the above warning gets
        triggered.
      
      [FIX]
      Just let btrfs_delete_unused_bgs() skip block group which still has
      pinned bytes.
      
      However there is a minor side effect: currently we only queue empty
      blocks at update_block_group(), and such empty block group with pinned
      bytes won't go through update_block_group() again, such block group
      won't be removed, until it gets new extent allocated and removed.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43794446
    • N
      btrfs: Deduplicate extent_buffer init code · bc877d28
      Nikolay Borisov 提交于
      When a new extent buffer is allocated there are a few mandatory fields
      which need to be set in order for the buffer to be sane: level,
      generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
      is open coded in the callers of btrfs_alloc_tree_block, meaning it's
      fairly high in the abstraction hierarchy of operations. This patch
      solves this by simply moving this init code in btrfs_init_new_buffer,
      since this is the function which initializes a newly allocated
      extent buffer. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc877d28
    • N
      btrfs: Remove fs_info from btrfs_force_chunk_alloc · 43a7e99d
      Nikolay Borisov 提交于
      It can be referenced from the passed transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43a7e99d
    • N
      btrfs: Remove fs_info from btrfs_inc_block_group_ro · c83488af
      Nikolay Borisov 提交于
      It can be referenced from the passed bg cache.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83488af