1. 17 12月, 2018 15 次提交
    • J
      btrfs: add new flushing states for the delayed refs rsv · 413df725
      Josef Bacik 提交于
      A nice thing we gain with the delayed refs rsv is the ability to flush
      the delayed refs on demand to deal with enospc pressure.  Add states to
      flush delayed refs on demand, and this will allow us to remove a lot of
      ad-hoc work around checking to see if we should commit the transaction
      to run our delayed refs.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      413df725
    • J
      btrfs: update may_commit_transaction to use the delayed refs rsv · 4c8edbc7
      Josef Bacik 提交于
      Any space used in the delayed_refs_rsv will be freed up by a transaction
      commit, so instead of just counting the pinned space we also need to
      account for any space in the delayed_refs_rsv when deciding if it will
      make a different to commit the transaction to satisfy our space
      reservation.  If we have enough bytes to satisfy our reservation ticket
      then we are good to go, otherwise subtract out what space we would gain
      back by committing the transaction and compare that against the pinned
      space to make our decision.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c8edbc7
    • J
      btrfs: introduce delayed_refs_rsv · ba2c4d4e
      Josef Bacik 提交于
      Traditionally we've had voodoo in btrfs to account for the space that
      delayed refs may take up by having a global_block_rsv.  This works most
      of the time, except when it doesn't.  We've had issues reported and seen
      in production where sometimes the global reserve is exhausted during
      transaction commit before we can run all of our delayed refs, resulting
      in an aborted transaction.  Because of this voodoo we have equally
      dubious flushing semantics around throttling delayed refs which we often
      get wrong.
      
      So instead give them their own block_rsv.  This way we can always know
      exactly how much outstanding space we need for delayed refs.  This
      allows us to make sure we are constantly filling that reservation up
      with space, and allows us to put more precise pressure on the enospc
      system.  Instead of doing math to see if its a good time to throttle,
      the normal enospc code will be invoked if we have a lot of delayed refs
      pending, and they will be run via the normal flushing mechanism.
      
      For now the delayed_refs_rsv will hold the reservations for the delayed
      refs, the block group updates, and deleting csums.  We could have a
      separate rsv for the block group updates, but the csum deletion stuff is
      still handled via the delayed_refs so that will stay there.
      
      Historical background:
      
      The global reserve has grown to cover everything we don't reserve space
      explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
      if we're running short on space and when it's time to force a commit.  A
      failure rate of 20-40 file systems when we run hundreds of thousands of
      them isn't super high, but cleaning up this code will make things less
      ugly and more predictible.
      
      Thus the delayed refs rsv.  We always know how many delayed refs we have
      outstanding, and although running them generates more we can use the
      global reserve for that spill over, which fits better into it's desired
      use than a full blown reservation.  This first approach is to simply
      take how many times we're reserving space for and multiply that by 2 in
      order to save enough space for the delayed refs that could be generated.
      This is a niave approach and will probably evolve, but for now it works.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
      [ added background notes from the cover letter ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba2c4d4e
    • J
      btrfs: cleanup extent_op handling · bedc6617
      Josef Bacik 提交于
      The cleanup_extent_op function actually would run the extent_op if it
      needed running, which made the name sort of a misnomer.  Change it to
      run_and_cleanup_extent_op, and move the actual cleanup work to
      cleanup_extent_op so it can be used by check_ref_cleanup() in order to
      unify the extent op handling.
      Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bedc6617
    • J
      btrfs: add cleanup_ref_head_accounting helper · 07c47775
      Josef Bacik 提交于
      We were missing some quota cleanups in check_ref_cleanup, so break the
      ref head accounting cleanup into a helper and call that from both
      check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
      we don't screw up accounting in the future for other things that we add.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07c47775
    • J
      btrfs: add btrfs_delete_ref_head helper · d7baffda
      Josef Bacik 提交于
      We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
      into a helper and cleanup the calling functions.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7baffda
    • F
      Btrfs: allow clear_extent_dirty() to receive a cached extent state record · 0e6ec385
      Filipe Manana 提交于
      We can have a lot freed extents during the life span of transaction, so
      the red black tree that keeps track of the ranges of each freed extent
      (fs_info->freed_extents[]) can get quite big. When finishing a
      transaction commit we find each range, process it (discard the extents,
      unpin them) and then remove it from the red black tree.
      
      We can use an extent state record as a cache when searching for a range,
      so that when we clean the range we can use the cached extent state we
      passed to the search function instead of iterating the red black tree
      again. Doing things as fast as possible when finishing a transaction (in
      state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
      block another task that wants to commit the next transaction.
      
      So change clear_extent_dirty() to allow an optional extent state record to
      be passed as an argument, which will be passed down to __clear_extent_bit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e6ec385
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • Q
      btrfs: Refactor find_free_extent loops update into find_free_extent_update_loop · e72d79d6
      Qu Wenruo 提交于
      We have a complex loop design for find_free_extent(), that has different
      behavior for each loop, some even includes new chunk allocation.
      
      Instead of putting such a long code into find_free_extent() and makes it
      harder to read, just extract them into find_free_extent_update_loop().
      
      With all the cleanups, the main find_free_extent() should be pretty
      barebone:
      
      find_free_extent()
      |- Iterate through all block groups
      |  |- Get a valid block group
      |  |- Try to do clustered allocation in that block group
      |  |- Try to do unclustered allocation in that block group
      |  |- Check if the result is valid
      |  |  |- If valid, then exit
      |  |- Jump to next block group
      |
      |- Push harder to find free extents
         |- If not found, re-iterate all block groups
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      [ copy callchain from changelog to function comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e72d79d6
    • Q
      btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered() · e1a41848
      Qu Wenruo 提交于
      This patch will extract unclsutered extent allocation code into
      find_free_extent_unclustered().
      
      And this helper function will use return value to indicate what to do
      next.
      
      This should make find_free_extent() a little easier to read.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      [Update merge conflict with fb5c39d7 ("btrfs: don't use ctl->free_space for max_extent_size")]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1a41848
    • Q
      btrfs: Refactor clustered extent allocation into find_free_extent_clustered · d06e3bb6
      Qu Wenruo 提交于
      We have two main methods to find free extents inside a block group:
      
      1) clustered allocation
      2) unclustered allocation
      
      This patch will extract the clustered allocation into
      find_free_extent_clustered() to make it a little easier to read.
      
      Instead of jumping between different labels in find_free_extent(), the
      helper function will use return value to indicate different behavior.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d06e3bb6
    • Q
      btrfs: Introduce find_free_extent_ctl structure for later rework · b4bd745d
      Qu Wenruo 提交于
      Instead of tons of different local variables in find_free_extent(),
      extract them into find_free_extent_ctl structure, and add better
      explanation for them.
      
      Some modification may looks redundant, but will later greatly simplify
      function parameter list during find_free_extent() refactor.
      
      Also add two comments to co-operate with fb5c39d7 ("btrfs: don't use
      ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size
      update more reader-friendly.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4bd745d
    • L
      btrfs: extent-tree: Detect bytes_pinned underflow earlier · e2907c1a
      Lu Fengqi 提交于
      Introduce a new wrapper update_bytes_pinned to replace open coded
      bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned
      get detected and reported.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2907c1a
    • Q
      btrfs: extent-tree: Detect bytes_may_use underflow earlier · 9f9b8e8d
      Qu Wenruo 提交于
      Although we have space_info::bytes_may_use underflow detection in
      btrfs_free_reserved_data_space_noquota(), we have more callers who are
      subtracting number from space_info::bytes_may_use.
      
      So instead of doing underflow detection for every caller, introduce a
      new wrapper update_bytes_may_use() to replace open coded bytes_may_use
      modifiers.
      
      This also introduce a macro to declare more wrappers, but currently
      space_info::bytes_may_use is the mostly interesting one.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f9b8e8d
  2. 19 10月, 2018 3 次提交
  3. 17 10月, 2018 1 次提交
    • F
      Btrfs: fix deadlock when writing out free space caches · 5ce55557
      Filipe Manana 提交于
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ce55557
  4. 15 10月, 2018 21 次提交
    • L
      btrfs: remove fs_info from btrfs_should_throttle_delayed_refs · 7c861627
      Lu Fengqi 提交于
      The avg_delayed_ref_runtime can be referenced from the transaction
      handle.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c861627
    • L
      btrfs: remove fs_info from btrfs_check_space_for_delayed_refs · af9b8a0e
      Lu Fengqi 提交于
      It can be referenced from the transaction handle.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af9b8a0e
    • L
      btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock · 9e920a6f
      Lu Fengqi 提交于
      Since trans is only used for referring to delayed_refs, there is no need
      to pass it instead of delayed_refs to btrfs_delayed_ref_lock().
      
      No functional change.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e920a6f
    • L
      btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head · 5637c74b
      Lu Fengqi 提交于
      Since trans is only used for referring to delayed_refs, there is no need
      to pass it instead of delayed_refs to btrfs_select_ref_head().  No
      functional change.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5637c74b
    • J
      btrfs: make sure we create all new block groups · 545e3366
      Josef Bacik 提交于
      Allocating new chunks modifies both the extent and chunk tree, which can
      trigger new chunk allocations.  So instead of doing list_for_each_safe,
      just do while (!list_empty()) so we make sure we don't exit with other
      pending bg's still on our list.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      545e3366
    • Q
      btrfs: qgroup: Don't trace subtree if we're dropping reloc tree · 2cd86d30
      Qu Wenruo 提交于
      Reloc tree doesn't contribute to qgroup numbers, as we have accounted
      them at balance time (see replace_path()).
      
      Skipping the unneeded subtree tracing should reduce the overhead.
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22900          | -0.1%
      qgroup dirty extents | 227757       | 167139         | -26.6%
      time (sys)           | 65.253s      | 50.123s        | -23.2%
      time (real)          | 74.032s      | 52.551s        | -29.0%
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2cd86d30
    • N
      btrfs: refactor __btrfs_run_delayed_refs loop · 0110a4c4
      Nikolay Borisov 提交于
      Refactor the delayed refs loop by using the newly introduced
      btrfs_run_delayed_refs_for_head function. This greatly simplifies
      __btrfs_run_delayed_refs and makes it more obvious what is happening.
      
      We now have 1 loop which iterates the existing delayed_heads and then
      each selected ref head is processed by the new helper. All existing
      semantics of the code are preserved so no functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0110a4c4
    • N
      btrfs: Factor out loop processing all refs of a head · e7261386
      Nikolay Borisov 提交于
      This patch introduces a new helper encompassing the implicit inner loop
      in __btrfs_run_delayed_refs which processes all the refs for a given
      head. The code is mostly copy/paste, the only difference is that if we
      detect a newer reference then -EAGAIN is returned so that callers can
      react correctly.
      
      Also, at the end of the loop the head is relocked and
      btrfs_merge_delayed_refs is run again to retain the pre-refactoring
      semantics.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7261386
    • N
      btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs · b1cdbcb5
      Nikolay Borisov 提交于
      This is in preparation to refactor the giant loop in
      __btrfs_run_delayed_refs. As a first step define a new function
      which implements acquiring a reference to a btrfs_delayed_refs_head and
      use it. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b1cdbcb5
    • L
      Btrfs: delayed-refs: use rb_first_cached for ref_tree · e3d03965
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href->ref_tree need to get the first entry, this
      converts href->ref_tree to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3d03965
    • L
      Btrfs: delayed-refs: use rb_first_cached for href_root · 5c9d028b
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href_root need to get the first entry, this
      converts href_root to use rb_first_cached().
      
      This patch is first in the sequenct of similar updates to other rbtrees
      and this is analysis of the expected behaviour and improvements.
      
      There's a common pattern:
      
      while (node = rb_first) {
              entry = rb_entry(node)
              next = rb_next(node)
              rb_erase(node)
              cleanup(entry)
      }
      
      rb_first needs to traverse the tree up to logN depth, rb_erase can
      completely reshuffle the tree. With the caching we'll skip the traversal
      in rb_first.  That's a cached memory access vs looped pointer
      dereference trade-off that IMHO has a clear winner.
      
      Measurements show there's not much difference in a sample tree with
      10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
      caching and pointer chasing are unpredictable though.
      
      Further optimzations can be done to avoid the expensive rb_erase step.
      In some cases it's ok to process the nodes in any order, so the tree can
      be traversed in post-order, not rebalancing the children nodes and just
      calling free. Care must be taken regarding the next node.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog from mail discussions ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c9d028b
    • J
      btrfs: wait on caching when putting the bg cache · 3aa7c7a3
      Josef Bacik 提交于
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3aa7c7a3
    • J
      btrfs: keep trim from interfering with transaction commits · fee7acc3
      Jeff Mahoney 提交于
      Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      fixed free space trimming, but introduced latency when it was running.
      This is due to it pinning the transaction using both a incremented
      refcount and holding the commit root sem for the duration of a single
      trim operation.
      
      This was to ensure safety but it's unnecessary.  We already hold the the
      chunk mutex so we know that the chunk we're using can't be allocated
      while we're trimming it.
      
      In order to check against chunks allocated already in this transaction,
      we need to check the pending chunks list.  To to that safely without
      joining the transaction (or attaching than then having to commit it) we
      need to ensure that the dev root's commit root doesn't change underneath
      us and the pending chunk lists stays around until we're done with it.
      
      We can ensure the former by holding the commit root sem and the latter
      by pinning the transaction.  We do this now, but the critical section
      covers the trim operation itself and we don't need to do that.
      
      This patch moves the pinning and unpinning logic into helpers and unpins
      the transaction after performing the search and check for pending
      chunks.
      
      Limiting the critical section of the transaction pinning improves the
      latency substantially on slower storage (e.g. image files over NFS).
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fee7acc3
    • J
      btrfs: don't attempt to trim devices that don't support it · 0be88e36
      Jeff Mahoney 提交于
      We check whether any device the file system is using supports discard in
      the ioctl call, but then we attempt to trim free extents on every device
      regardless of whether discard is supported.  Due to the way we mask off
      EOPNOTSUPP, we can end up issuing the trim operations on each free range
      on devices that don't support it, just wasting time.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0be88e36
    • J
      btrfs: iterate all devices during trim, instead of fs_devices::alloc_list · d4e329de
      Jeff Mahoney 提交于
      btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
      device_list_mutex.  The problem is that ->alloc_list is protected by the
      chunk mutex.  We don't want to hold the chunk mutex over the trim of the
      entire file system.  Fortunately, the ->dev_list list is protected by
      the dev_list mutex and while it will give us all devices, including
      read-only devices, we already just skip the read-only devices.  Then we
      can continue to take and release the chunk mutex while scanning each
      device.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e329de
    • Q
      btrfs: Ensure btrfs_trim_fs can trim the whole filesystem · 6ba9fc8e
      Qu Wenruo 提交于
      [BUG]
      fstrim on some btrfs only trims the unallocated space, not trimming any
      space in existing block groups.
      
      [CAUSE]
      Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
      range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
      able to trim block groups in range [0, super->total_bytes).
      
      While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
      uses its logical address space, there is nothing limiting the location
      where we put block groups.
      
      For filesystem with frequent balance, it's quite easy to relocate all
      block groups and bytenr of block groups will start beyond
      super->total_bytes.
      
      In that case, btrfs will not trim existing block groups.
      
      [FIX]
      Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
      can get the unmodified range, which is normally set to [0, U64_MAX].
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
      CC: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ba9fc8e
    • Q
      btrfs: Enhance btrfs_trim_fs function to handle error better · 93bba24d
      Qu Wenruo 提交于
      Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
      error happens when trimming existing block groups, it will skip the
      remaining blocks and continue to trim unallocated space for each device.
      
      The return value will only reflect the final error from device trimming.
      
      This patch will fix such behavior by:
      
      1) Recording the last error from block group or device trimming
         The return value will also reflect the last error during trimming.
         Make developer more aware of the problem.
      
      2) Continuing trimming if possible
         If we failed to trim one block group or device, we could still try
         the next block group or device.
      
      3) Report number of failures during block group and device trimming
         It would be less noisy, but still gives user a brief summary of
         what's going wrong.
      
      Such behavior can avoid confusion for cases like failure to trim the
      first block group and then only unallocated space is trimmed.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add bg_ret and dev_ret to the messages ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      93bba24d
    • M
      btrfs: remove redundant variable from btrfs_cross_ref_exist · 380fd066
      Misono Tomohiro 提交于
      Since commit d7df2c79 ("Btrfs attach delayed ref updates to delayed
      ref heads"), check_delayed_ref() won't return -ENOENT.
      
      In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
      originally used to handle -ENOENT error case.  Since the code is not
      needed anymore, let's just remove 'ret2'.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      380fd066
    • L
      Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes · 6aadd9eb
      Liu Bo 提交于
      Here we're not releasing any space, but transferring bytes from
      ->bytes_may_use to ->bytes_reserved. The last change to the code in
      commit 18513091 ("btrfs: update btrfs_space_info's
      bytes_may_use timely") removed a conditional tracepoint and the logic
      changed too but the tracepiont remained.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6aadd9eb
    • Q
      btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock · b72c3aba
      Qu Wenruo 提交于
      [BUG]
      For certain crafted image, whose csum root leaf has missing backref, if
      we try to trigger write with data csum, it could cause deadlock with the
      following kernel WARN_ON():
      
        WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
        CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
        Workqueue: btrfs-endio-write btrfs_endio_write_helper
        RIP: 0010:btrfs_tree_lock+0x3e2/0x400
        Call Trace:
         btrfs_alloc_tree_block+0x39f/0x770
         __btrfs_cow_block+0x285/0x9e0
         btrfs_cow_block+0x191/0x2e0
         btrfs_search_slot+0x492/0x1160
         btrfs_lookup_csum+0xec/0x280
         btrfs_csum_file_blocks+0x2be/0xa60
         add_pending_csums+0xaf/0xf0
         btrfs_finish_ordered_io+0x74b/0xc90
         finish_ordered_fn+0x15/0x20
         normal_work_helper+0xf6/0x500
         btrfs_endio_write_helper+0x12/0x20
         process_one_work+0x302/0x770
         worker_thread+0x81/0x6d0
         kthread+0x180/0x1d0
         ret_from_fork+0x35/0x40
      
      [CAUSE]
      That crafted image has missing backref for csum tree root leaf.  And
      when we try to allocate new tree block, since there is no
      EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
      and use it.
      
      The extent tree of the image looks like:
      
        Normal image                      |       This fuzzed image
        ----------------------------------+--------------------------------
        BG 29360128                       | BG 29360128
         One empty slot                   |  One empty slot
        29364224: backref to UUID tree    | 29364224: backref to UUID tree
         Two empty slots                  |  Two empty slots
        29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
        29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
        ...                               | ...
      
      Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
      alloc tree block, it's an valid slot for btrfs.
      
      And for finish_ordered_write, when we need to insert csum, we try to CoW
      csum tree root.
      
      By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
      BG_OFFSET + 12K is already used by tree block COW for other trees, the
      next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
      tree.
      
      But due to the bad type, btrfs can recognize it and still consider it as
      an empty slot, and will try to use it for csum tree CoW.
      
      Then in the following call trace, we will try to lock the new tree
      block, which turns out to be the old csum tree root which is already
      locked:
      
      btrfs_search_slot() called on csum tree root, which is at 29376512
      |- btrfs_cow_block()
         |- btrfs_set_lock_block()
         |  |- Now locks tree block 29376512 (old csum tree root)
         |- __btrfs_cow_block()
            |- btrfs_alloc_tree_block()
               |- btrfs_reserve_extent()
                  | Now it returns tree block 29376512, which extent tree
                  | shows its empty slot, but it's already hold by csum tree
                  |- btrfs_init_new_buffer()
                     |- btrfs_tree_lock()
                        | Triggers WARN_ON(eb->lock_owner == current->pid)
                        |- wait_event()
                           Wait lock owner to release the lock, but it's
                           locked by ourself, so it will deadlock
      
      [FIX]
      This patch will do the lock_owner and current->pid check at
      btrfs_init_new_buffer().
      So above deadlock can be avoided.
      
      Since such problem can only happen in crafted image, we will still
      trigger kernel warning for later aborted transaction, but with a little
      more meaningful warning message.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b72c3aba
    • Q
      btrfs: Handle owner mismatch gracefully when walking up tree · 65c6e82b
      Qu Wenruo 提交于
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65c6e82b