1. 10 5月, 2022 1 次提交
  2. 06 5月, 2022 1 次提交
    • Q
      btrfs: force v2 space cache usage for subpage mount · 9f73f1ae
      Qu Wenruo 提交于
      [BUG]
      For a 4K sector sized btrfs with v1 cache enabled and only mounted on
      systems with 4K page size, if it's mounted on subpage (64K page size)
      systems, it can cause the following warning on v1 space cache:
      
       BTRFS error (device dm-1): csum mismatch on free space cache
       BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now
      
      Although not a big deal, as kernel can rebuild it without problem, such
      warning will bother end users, especially if they want to switch the
      same btrfs seamlessly between different page sized systems.
      
      [CAUSE]
      V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
      like BITS_PER_BITMAP.
      
      Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
      cache size to checksum.
      
      Thus kernel will always reject v1 cache with a different PAGE_SIZE with
      csum mismatch.
      
      [FIX]
      Although we should fix v1 cache, it's already going to be marked
      deprecated soon.
      
      And we have v2 cache based on metadata (which is already fully subpage
      compatible), and it has almost everything superior than v1 cache.
      
      So just force subpage mount to use v2 cache on mount.
      Reported-by: NMatt Corallo <blnxfsl@bluematt.me>
      CC: stable@vger.kernel.org # 5.15+
      Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f73f1ae
  3. 21 4月, 2022 1 次提交
    • N
      btrfs: zoned: use dedicated lock for data relocation · 5f0addf7
      Naohiro Aota 提交于
      Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive
      writeback of the relocation data inode in
      btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock
      in the following path.
      
      Thread A takes btrfs_inode_lock() and waits for metadata reservation by
      e.g, waiting for writeback:
      
      prealloc_file_extent_cluster()
        - btrfs_inode_lock(&inode->vfs_inode, 0);
        - btrfs_prealloc_file_range()
        ...
          - btrfs_replace_file_extents()
            - btrfs_start_transaction
            ...
              - btrfs_reserve_metadata_bytes()
      
      Thread B (e.g, doing a writeback work) needs to wait for the inode lock to
      continue writeback process:
      
      do_writepages
        - btrfs_writepages
          - extent_writpages
            - btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
              - btrfs_inode_lock()
      
      The deadlock is caused by relying on the vfs_inode's lock. By using it, we
      introduced unnecessary exclusion of writeback and
      btrfs_prealloc_file_range(). Also, the lock at this point is useless as we
      don't have any dirty pages in the inode yet.
      
      Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive
      writeback.
      
      Fixes: 35156d85 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
      CC: stable@vger.kernel.org # 5.16.x: 869f4cdc: btrfs: zoned: encapsulate inode locking for zoned relocation
      CC: stable@vger.kernel.org # 5.16.x
      CC: stable@vger.kernel.org # 5.17
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f0addf7
  4. 06 4月, 2022 1 次提交
  5. 15 3月, 2022 2 次提交
  6. 14 3月, 2022 7 次提交
  7. 02 3月, 2022 1 次提交
    • J
      btrfs: do not start relocation until in progress drops are done · b4be6aef
      Josef Bacik 提交于
      We hit a bug with a recovering relocation on mount for one of our file
      systems in production.  I reproduced this locally by injecting errors
      into snapshot delete with balance running at the same time.  This
      presented as an error while looking up an extent item
      
        WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
        CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
        RIP: 0010:lookup_inline_extent_backref+0x647/0x680
        RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
        RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
        R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
        R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
        FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         insert_inline_extent_backref+0x46/0xd0
         __btrfs_inc_extent_ref.isra.0+0x5f/0x200
         ? btrfs_merge_delayed_refs+0x164/0x190
         __btrfs_run_delayed_refs+0x561/0xfa0
         ? btrfs_search_slot+0x7b4/0xb30
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_run_delayed_refs+0x73/0x1f0
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_commit_transaction+0x50/0xa50
         ? btrfs_update_reloc_root+0x122/0x220
         prepare_to_merge+0x29f/0x320
         relocate_block_group+0x2b8/0x550
         btrfs_relocate_block_group+0x1a6/0x350
         btrfs_relocate_chunk+0x27/0xe0
         btrfs_balance+0x777/0xe60
         balance_kthread+0x35/0x50
         ? btrfs_balance+0xe60/0xe60
         kthread+0x16b/0x190
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x22/0x30
         </TASK>
      
      Normally snapshot deletion and relocation are excluded from running at
      the same time by the fs_info->cleaner_mutex.  However if we had a
      pending balance waiting to get the ->cleaner_mutex, and a snapshot
      deletion was running, and then the box crashed, we would come up in a
      state where we have a half deleted snapshot.
      
      Again, in the normal case the snapshot deletion needs to complete before
      relocation can start, but in this case relocation could very well start
      before the snapshot deletion completes, as we simply add the root to the
      dead roots list and wait for the next time the cleaner runs to clean up
      the snapshot.
      
      Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
      had a pending drop_progress key.  If they do then we know we were in the
      middle of the drop operation and set a flag on the fs_info.  Then
      balance can wait until this flag is cleared to start up again.
      
      If there are DEAD_ROOT's that don't have a drop_progress set then we're
      safe to start balance right away as we'll be properly protected by the
      cleaner_mutex.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4be6aef
  8. 02 2月, 2022 2 次提交
  9. 07 1月, 2022 3 次提交
    • Q
      btrfs: output more debug messages for uncommitted transaction · 36c86a9e
      Qu Wenruo 提交于
      Print extra information about how many dirty bytes an uncommitted
      has at the end of mount.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36c86a9e
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  10. 03 1月, 2022 12 次提交
  11. 14 12月, 2021 1 次提交
    • F
      btrfs: fix double free of anon_dev after failure to create subvolume · 33fab972
      Filipe Manana 提交于
      When creating a subvolume, at create_subvol(), we allocate an anonymous
      device and later call btrfs_get_new_fs_root(), which in turn just calls
      btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
      the anonymous device to the root, but if after that call there's an error,
      when we jump to 'fail' label, we call btrfs_put_root(), which frees the
      anonymous device and then returns an error that is propagated back to
      create_subvol(). Than create_subvol() frees the anonymous device again.
      
      When this happens, if the anonymous device was not reallocated after
      the first time it was freed with btrfs_put_root(), we get a kernel
      message like the following:
      
        (...)
        [13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
        [13950.283027] ida_free called for id=65 which is not allocated.
        [13950.285974] BTRFS info (device dm-0): forced readonly
        (...)
      
      If the anonymous device gets reallocated by another btrfs filesystem
      or any other kernel subsystem, then bad things can happen.
      
      So fix this by setting the root's anonymous device to 0 at
      btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
      happened.
      
      Fixes: 2dfb1e43 ("btrfs: preallocate anon block device at first phase of snapshot creation")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      33fab972
  12. 16 11月, 2021 1 次提交
    • W
      btrfs: check-integrity: fix a warning on write caching disabled disk · a91cf0ff
      Wang Yugui 提交于
      When a disk has write caching disabled, we skip submission of a bio with
      flush and sync requests before writing the superblock, since it's not
      needed. However when the integrity checker is enabled, this results in
      reports that there are metadata blocks referred by a superblock that
      were not properly flushed. So don't skip the bio submission only when
      the integrity checker is enabled for the sake of simplicity, since this
      is a debug tool and not meant for use in non-debug builds.
      
      fstests/btrfs/220 trigger a check-integrity warning like the following
      when CONFIG_BTRFS_FS_CHECK_INTEGRITY=y and the disk with WCE=0.
      
        btrfs: attempt to write superblock which references block M @5242880 (sdb2/5242880/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
        ------------[ cut here ]------------
        WARNING: CPU: 28 PID: 843680 at fs/btrfs/check-integrity.c:2196 btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
        CPU: 28 PID: 843680 Comm: umount Not tainted 5.15.0-0.rc5.39.el8.x86_64 #1
        Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019
        RIP: 0010:btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
        RSP: 0018:ffffb642afb47940 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
        RDX: 00000000ffffffff RSI: ffff8b722fc97d00 RDI: ffff8b722fc97d00
        RBP: ffff8b5601c00000 R08: 0000000000000000 R09: c0000000ffff7fff
        R10: 0000000000000001 R11: ffffb642afb476f8 R12: ffffffffffffffff
        R13: ffffb642afb47974 R14: ffff8b5499254c00 R15: 0000000000000003
        FS:  00007f00a06d4080(0000) GS:ffff8b722fc80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fff5cff5ff0 CR3: 00000001c0c2a006 CR4: 00000000001706e0
        Call Trace:
         btrfsic_process_written_block+0x2f7/0x850 [btrfs]
         __btrfsic_submit_bio.part.19+0x310/0x330 [btrfs]
         ? bio_associate_blkg_from_css+0xa4/0x2c0
         btrfsic_submit_bio+0x18/0x30 [btrfs]
         write_dev_supers+0x81/0x2a0 [btrfs]
         ? find_get_pages_range_tag+0x219/0x280
         ? pagevec_lookup_range_tag+0x24/0x30
         ? __filemap_fdatawait_range+0x6d/0xf0
         ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
         ? find_first_extent_bit+0x9b/0x160 [btrfs]
         ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
         write_all_supers+0x1b3/0xa70 [btrfs]
         ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
         btrfs_commit_transaction+0x59d/0xac0 [btrfs]
         close_ctree+0x11d/0x339 [btrfs]
         generic_shutdown_super+0x71/0x110
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0xb8/0x140
         task_work_run+0x6d/0xb0
         exit_to_user_mode_prepare+0x1f0/0x200
         syscall_exit_to_user_mode+0x12/0x30
         do_syscall_64+0x46/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f009f711dfb
        RSP: 002b:00007fff5cff7928 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 000055b68c6c9970 RCX: 00007f009f711dfb
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b68c6c9b50
        RBP: 0000000000000000 R08: 000055b68c6ca900 R09: 00007f009f795580
        R10: 0000000000000000 R11: 0000000000000246 R12: 000055b68c6c9b50
        R13: 00007f00a04bf184 R14: 0000000000000000 R15: 00000000ffffffff
        ---[ end trace 2c4b82abcef9eec4 ]---
        S-65536(sdb2/65536/1)
         -->
        M-1064960(sdb2/1064960/1)
      Reviewed-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a91cf0ff
  13. 29 10月, 2021 1 次提交
    • A
      btrfs: call btrfs_check_rw_degradable only if there is a missing device · 5c78a5e7
      Anand Jain 提交于
      In open_ctree() in btrfs_check_rw_degradable() [1], we check each block
      group individually if at least the minimum number of devices is available
      for that profile. If all the devices are available, then we don't have to
      check degradable.
      
      [1]
      open_ctree()
      ::
      3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) {
      
      Also before calling btrfs_check_rw_degradable() in open_ctee() at the
      line number shown below [2] we call btrfs_read_chunk_tree() and down to
      add_missing_dev() to record number of missing devices.
      
      [2]
      open_ctree()
      ::
      3454         ret = btrfs_read_chunk_tree(fs_info);
      
      btrfs_read_chunk_tree()
        read_one_chunk() / read_one_dev()
          add_missing_dev()
      
      So, check if there is any missing device before btrfs_check_rw_degradable()
      in open_ctree().
      
      Also, with this the mount command could save ~16ms.[3] in the most
      common case, that is no device is missing.
      
      [3]
       1) * 16934.96 us | btrfs_check_rw_degradable [btrfs]();
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c78a5e7
  14. 27 10月, 2021 6 次提交
    • A
      btrfs: fix comment about sector sizes supported in 64K systems · 50780d9b
      Anand Jain 提交于
      Commit 95ea0486 ("btrfs: allow read-write for 4K sectorsize on 64K
      page size systems") added write support for 4K sectorsize on a 64K
      systems. Fix the now stale comments.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50780d9b
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • F
      btrfs: assert that extent buffers are write locked instead of only locked · 49d0c642
      Filipe Manana 提交于
      We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and
      that checks that we hold a lock either in read mode or write mode.
      
      However in all contexts we use btrfs_assert_tree_locked(), we actually
      want to check if we are holding a write lock on the extent buffer's rw
      semaphore - it would be a bug if in any of those contexts we were holding
      a read lock instead.
      
      So change btrfs_assert_tree_locked() to use lockdep_assert_held_write()
      instead and, to make it more explicit, rename btrfs_assert_tree_locked()
      to btrfs_assert_tree_write_locked(), so that it's clear we want to check
      we are holding a write lock.
      
      For now there are no contexts where we want to assert that we must have
      a read lock, but in case that is needed in the future, we can add a new
      helper function that just calls out lockdep_assert_held_read().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49d0c642
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • J
      btrfs: zoned: add a dedicated data relocation block group · c2707a25
      Johannes Thumshirn 提交于
      Relocation in a zoned filesystem can fail with a transaction abort with
      error -22 (EINVAL). This happens because the relocation code assumes that
      the extents we relocated the data to have the same size the source extents
      had and ensures this by preallocating the extents.
      
      But in a zoned filesystem we currently can't preallocate the extents as
      this would break the sequential write required rule. Therefore it can
      happen that the writeback process kicks in while we're still adding pages
      to a delalloc range and starts writing out dirty pages.
      
      This then creates destination extents that are smaller than the source
      extents, triggering the following safety check in get_new_location():
      
       1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
       1035                 ret = -EINVAL;
       1036                 goto out;
       1037         }
      
      Temporarily create a dedicated block group for the relocation process, so
      no non-relocation data writes can interfere with the relocation writes.
      
      This is needed that we can switch the relocation process on a zoned
      filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
      like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
      
      Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem")
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2707a25
    • J
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn 提交于
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f00a6d