1. 16 5月, 2022 10 次提交
  2. 06 4月, 2022 1 次提交
  3. 25 3月, 2022 1 次提交
    • Q
      btrfs: remove device item and update super block in the same transaction · bbac5869
      Qu Wenruo 提交于
      [BUG]
      There is a report that a btrfs has a bad super block num devices.
      
      This makes btrfs to reject the fs completely.
      
        BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
        BTRFS error (device sdd3): failed to read chunk tree: -22
        BTRFS error (device sdd3): open_ctree failed
      
      [CAUSE]
      During btrfs device removal, chunk tree and super block num devs are
      updated in two different transactions:
      
        btrfs_rm_device()
        |- btrfs_rm_dev_item(device)
        |  |- trans = btrfs_start_transaction()
        |  |  Now we got transaction X
        |  |
        |  |- btrfs_del_item()
        |  |  Now device item is removed from chunk tree
        |  |
        |  |- btrfs_commit_transaction()
        |     Transaction X got committed, super num devs untouched,
        |     but device item removed from chunk tree.
        |     (AKA, super num devs is already incorrect)
        |
        |- cur_devices->num_devices--;
        |- cur_devices->total_devices--;
        |- btrfs_set_super_num_devices()
           All those operations are not in transaction X, thus it will
           only be written back to disk in next transaction.
      
      So after the transaction X in btrfs_rm_dev_item() committed, but before
      transaction X+1 (which can be minutes away), a power loss happen, then
      we got the super num mismatch.
      
      [FIX]
      Instead of starting and committing a transaction inside
      btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
      pass it to btrfs_rm_dev_item().
      
      And only commit the transaction after everything is done.
      Reported-by: NLuca Béla Palkovics <luca.bela.palkovics@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbac5869
  4. 14 3月, 2022 10 次提交
  5. 07 1月, 2022 3 次提交
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • J
      btrfs: zoned: sink zone check into btrfs_repair_one_zone · 554aed7d
      Johannes Thumshirn 提交于
      Sink zone check into btrfs_repair_one_zone() so we don't need to do it
      in all callers.
      
      Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
      a boolean function and return false in case it got called on a non-zoned
      filesystem and true on a zoned filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      554aed7d
    • N
      btrfs: introduce exclusive operation BALANCE_PAUSED state · efc0e69c
      Nikolay Borisov 提交于
      Current set of exclusive operation states is not sufficient to handle
      all practical use cases. In particular there is a need to be able to add
      a device to a filesystem that have paused balance. Currently there is no
      way to distinguish between a running and a paused balance. Fix this by
      introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
      occasions:
      
      1. When a filesystem is mounted with skip_balance and there is an
         unfinished balance it will now be into BALANCE_PAUSED instead of
         simply BALANCE state.
      
      2. When a running balance is paused.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      efc0e69c
  6. 03 1月, 2022 6 次提交
    • J
      btrfs: don't use the extent root in btrfs_chunk_alloc_add_chunk_item · fd51eb2f
      Josef Bacik 提交于
      We're just using the extent_root to set the chunk owner to
      root_key->objectid, which is BTRFS_EXTENT_TREE_OBJECTID, so use that
      directly instead of using the root.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd51eb2f
    • Q
      btrfs: don't check stripe length if the profile is not stripe based · bf08387f
      Qu Wenruo 提交于
      [BUG]
      When debugging calc_bio_boundaries(), I found that even for RAID1
      metadata, we're following stripe length to calculate stripe boundary.
      
        # mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12]
        # mount /dev/test/scratch /mnt/btrfs
        # xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file
        # umount
      
      Above very basic operations will make calc_bio_boundaries() to report
      the following result:
      
        submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152
        submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536
        ...
        submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384
        submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384
        submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536
        submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536
        submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152
        submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768
      
      Where "r/i" is the rootid and inode, 1/1 means they metadata.
      The remaining names match the member used in kernel.
      
      Even all data/metadata are using RAID1, we're still following stripe
      length.
      
      [CAUSE]
      This behavior is caused by a wrong condition in btrfs_get_io_geometry():
      
      	if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
      		/* Fill using stripe_len */
      		len = min_t(u64, em->len - offset, max_len);
      	} else {
      		len = em->len - offset;
      	}
      
      This means, only for SINGLE we will not follow stripe_len.
      
      However for profiles like RAID1*, DUP, they don't need to bother
      stripe_len.
      
      This can lead to unnecessary bio split for RAID1*/DUP profiles, and can
      even be a blockage for future zoned RAID support.
      
      [FIX]
      Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and
      change the condition to only calculate the length using stripe length
      for stripe based profiles.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf08387f
    • N
      btrfs: zoned: cache reported zone during mount · 16beac87
      Naohiro Aota 提交于
      When mounting a device, we are reporting the zones twice: once for
      checking the zone attributes in btrfs_get_dev_zone_info and once for
      loading block groups' zone info in
      btrfs_load_block_group_zone_info(). With a lot of block groups, that
      leads to a lot of REPORT ZONE commands and slows down the mount
      process.
      
      This patch introduces a zone info cache in struct
      btrfs_zoned_device_info. The cache is populated while in
      btrfs_get_dev_zone_info() and used for
      btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
      commands. The zone cache is then released after loading the block
      groups, as it will not be much effective during the run time.
      
      Benchmark: Mount an HDD with 57,007 block groups
      Before patch: 171.368 seconds
      After patch: 64.064 seconds
      
      While it still takes a minute due to the slowness of loading all the
      block groups, the patch reduces the mount time by 1/3.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16beac87
    • A
      btrfs: consolidate device_list_mutex in prepare_sprout to its parent · 849eae5e
      Anand Jain 提交于
      btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
      so that its parent function btrfs_init_new_device() can add the new sprout
      device to fs_info->fs_devices.
      
      Both btrfs_prepare_sprout() and btrfs_init_new_device() need
      device_list_mutex. But they are holding it separately, thus create a
      small race window. Close it and hold device_list_mutex across both
      functions btrfs_init_new_device() and btrfs_prepare_sprout().
      
      Split btrfs_prepare_sprout() into btrfs_init_sprout() and
      btrfs_setup_sprout(). This split is essential because device_list_mutex
      must not be held for allocations in btrfs_init_sprout() but must be held
      for btrfs_setup_sprout(). So now a common device_list_mutex can be used
      between btrfs_init_new_device() and btrfs_setup_sprout().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      849eae5e
    • A
      btrfs: switch seeding_dev in init_new_device to bool · fd880809
      Anand Jain 提交于
      Declare int seeding_dev as a bool. Also, move its declaration a line
      below to adjust packing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd880809
    • J
      btrfs: drop the _nr from the item helpers · 3212fa14
      Josef Bacik 提交于
      Now that all call sites are using the slot number to modify item values,
      rename the SETGET helpers to raw_item_*(), and then rework the _nr()
      helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
      rename all of the callers to the new helpers.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3212fa14
  7. 16 12月, 2021 1 次提交
  8. 16 11月, 2021 1 次提交
    • F
      btrfs: silence lockdep when reading chunk tree during mount · 4d9380e0
      Filipe Manana 提交于
      Often some test cases like btrfs/161 trigger lockdep splats that complain
      about possible unsafe lock scenario due to the fact that during mount,
      when reading the chunk tree we end up calling blkdev_get_by_path() while
      holding a read lock on a leaf of the chunk tree. That produces a lockdep
      splat like the following:
      
      [ 3653.683975] ======================================================
      [ 3653.685148] WARNING: possible circular locking dependency detected
      [ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted
      [ 3653.687239] ------------------------------------------------------
      [ 3653.688400] mount/447465 is trying to acquire lock:
      [ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.691054]
                     but task is already holding lock:
      [ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.693978]
                     which lock already depends on the new lock.
      
      [ 3653.695510]
                     the existing dependency chain (in reverse order) is:
      [ 3653.696915]
                     -> #3 (btrfs-chunk-00){++++}-{3:3}:
      [ 3653.698053]        down_read_nested+0x4b/0x140
      [ 3653.698893]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.699988]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 3653.701205]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 3653.702234]        btrfs_insert_empty_items+0x32/0x70 [btrfs]
      [ 3653.703332]        btrfs_init_new_device+0x563/0x15b0 [btrfs]
      [ 3653.704439]        btrfs_ioctl+0x2110/0x3530 [btrfs]
      [ 3653.705405]        __x64_sys_ioctl+0x83/0xb0
      [ 3653.706215]        do_syscall_64+0x3b/0xc0
      [ 3653.706990]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.708040]
                     -> #2 (sb_internal#2){.+.+}-{0:0}:
      [ 3653.708994]        lock_release+0x13d/0x4a0
      [ 3653.709533]        up_write+0x18/0x160
      [ 3653.710017]        btrfs_sync_file+0x3f3/0x5b0 [btrfs]
      [ 3653.710699]        __loop_update_dio+0xbd/0x170 [loop]
      [ 3653.711360]        lo_ioctl+0x3b1/0x8a0 [loop]
      [ 3653.711929]        block_ioctl+0x48/0x50
      [ 3653.712442]        __x64_sys_ioctl+0x83/0xb0
      [ 3653.712991]        do_syscall_64+0x3b/0xc0
      [ 3653.713519]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.714233]
                     -> #1 (&lo->lo_mutex){+.+.}-{3:3}:
      [ 3653.715026]        __mutex_lock+0x92/0x900
      [ 3653.715648]        lo_open+0x28/0x60 [loop]
      [ 3653.716275]        blkdev_get_whole+0x28/0x90
      [ 3653.716867]        blkdev_get_by_dev.part.0+0x142/0x320
      [ 3653.717537]        blkdev_open+0x5e/0xa0
      [ 3653.718043]        do_dentry_open+0x163/0x390
      [ 3653.718604]        path_openat+0x3f0/0xa80
      [ 3653.719128]        do_filp_open+0xa9/0x150
      [ 3653.719652]        do_sys_openat2+0x97/0x160
      [ 3653.720197]        __x64_sys_openat+0x54/0x90
      [ 3653.720766]        do_syscall_64+0x3b/0xc0
      [ 3653.721285]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.721986]
                     -> #0 (&disk->open_mutex){+.+.}-{3:3}:
      [ 3653.722775]        __lock_acquire+0x130e/0x2210
      [ 3653.723348]        lock_acquire+0xd7/0x310
      [ 3653.723867]        __mutex_lock+0x92/0x900
      [ 3653.724394]        blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.725041]        blkdev_get_by_path+0xb8/0xd0
      [ 3653.725614]        btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
      [ 3653.726332]        open_fs_devices+0xd7/0x2c0 [btrfs]
      [ 3653.726999]        btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
      [ 3653.727739]        open_ctree+0xb8e/0x17bf [btrfs]
      [ 3653.728384]        btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 3653.729130]        legacy_get_tree+0x30/0x50
      [ 3653.729676]        vfs_get_tree+0x28/0xc0
      [ 3653.730192]        vfs_kern_mount.part.0+0x71/0xb0
      [ 3653.730800]        btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 3653.731427]        legacy_get_tree+0x30/0x50
      [ 3653.731970]        vfs_get_tree+0x28/0xc0
      [ 3653.732486]        path_mount+0x2d4/0xbe0
      [ 3653.732997]        __x64_sys_mount+0x103/0x140
      [ 3653.733560]        do_syscall_64+0x3b/0xc0
      [ 3653.734080]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.734782]
                     other info that might help us debug this:
      
      [ 3653.735784] Chain exists of:
                       &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00
      
      [ 3653.737123]  Possible unsafe locking scenario:
      
      [ 3653.737865]        CPU0                    CPU1
      [ 3653.738435]        ----                    ----
      [ 3653.739007]   lock(btrfs-chunk-00);
      [ 3653.739449]                                lock(sb_internal#2);
      [ 3653.740193]                                lock(btrfs-chunk-00);
      [ 3653.740955]   lock(&disk->open_mutex);
      [ 3653.741431]
                      *** DEADLOCK ***
      
      [ 3653.742176] 3 locks held by mount/447465:
      [ 3653.742739]  #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0
      [ 3653.744114]  #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs]
      [ 3653.745563]  #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.747066]
                     stack backtrace:
      [ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1
      [ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 3653.750592] Call Trace:
      [ 3653.750967]  dump_stack_lvl+0x57/0x72
      [ 3653.751526]  check_noncircular+0xf3/0x110
      [ 3653.752136]  ? stack_trace_save+0x4b/0x70
      [ 3653.752748]  __lock_acquire+0x130e/0x2210
      [ 3653.753356]  lock_acquire+0xd7/0x310
      [ 3653.753898]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.754596]  ? lock_is_held_type+0xe8/0x140
      [ 3653.755125]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.755729]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.756338]  __mutex_lock+0x92/0x900
      [ 3653.756794]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.757400]  ? do_raw_spin_unlock+0x4b/0xa0
      [ 3653.757930]  ? _raw_spin_unlock+0x29/0x40
      [ 3653.758437]  ? bd_prepare_to_claim+0x129/0x150
      [ 3653.758999]  ? trace_module_get+0x2b/0xd0
      [ 3653.759508]  ? try_module_get.part.0+0x50/0x80
      [ 3653.760072]  blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.760661]  ? devcgroup_check_permission+0xc1/0x1f0
      [ 3653.761288]  blkdev_get_by_path+0xb8/0xd0
      [ 3653.761797]  btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
      [ 3653.762454]  open_fs_devices+0xd7/0x2c0 [btrfs]
      [ 3653.763055]  ? clone_fs_devices+0x8f/0x170 [btrfs]
      [ 3653.763689]  btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
      [ 3653.764370]  ? kvm_sched_clock_read+0x14/0x40
      [ 3653.764922]  open_ctree+0xb8e/0x17bf [btrfs]
      [ 3653.765493]  ? super_setup_bdi_name+0x79/0xd0
      [ 3653.766043]  btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 3653.766780]  ? rcu_read_lock_sched_held+0x3f/0x80
      [ 3653.767488]  ? kfree+0x1f2/0x3c0
      [ 3653.767979]  legacy_get_tree+0x30/0x50
      [ 3653.768548]  vfs_get_tree+0x28/0xc0
      [ 3653.769076]  vfs_kern_mount.part.0+0x71/0xb0
      [ 3653.769718]  btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 3653.770381]  ? rcu_read_lock_sched_held+0x3f/0x80
      [ 3653.771086]  ? kfree+0x1f2/0x3c0
      [ 3653.771574]  legacy_get_tree+0x30/0x50
      [ 3653.772136]  vfs_get_tree+0x28/0xc0
      [ 3653.772673]  path_mount+0x2d4/0xbe0
      [ 3653.773201]  __x64_sys_mount+0x103/0x140
      [ 3653.773793]  do_syscall_64+0x3b/0xc0
      [ 3653.774333]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.775094] RIP: 0033:0x7f648bc45aaa
      
      This happens because through btrfs_read_chunk_tree(), which is called only
      during mount, ends up acquiring the mutex open_mutex of a block device
      while holding a read lock on a leaf of the chunk tree while other paths
      need to acquire other locks before locking extent buffers of the chunk
      tree.
      
      Since at mount time when we call btrfs_read_chunk_tree() we know that
      we don't have other tasks running in parallel and modifying the chunk
      tree, we can simply skip locking of chunk tree extent buffers. So do
      that and move the assertion that checks the fs is not yet mounted to the
      top block of btrfs_read_chunk_tree(), with a comment before doing it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d9380e0
  9. 29 10月, 2021 1 次提交
    • L
      btrfs: clear MISSING device status bit in btrfs_close_one_device · 5d03dbeb
      Li Zhang 提交于
      Reported bug: https://github.com/kdave/btrfs-progs/issues/389
      
      There's a problem with scrub reporting aborted status but returning
      error code 0, on a filesystem with missing and readded device.
      
      Roughly these steps:
      
      - mkfs -d raid1 dev1 dev2
      - fill with data
      - unmount
      - make dev1 disappear
      - mount -o degraded
      - copy more data
      - make dev1 appear again
      
      Running scrub afterwards reports that the command was aborted, but the
      system log message says the exit code was 0.
      
      It seems that the cause of the error is decrementing
      fs_devices->missing_devices but not clearing device->dev_state.  Every
      time we umount filesystem, it would call close_ctree, And it would
      eventually involve btrfs_close_one_device to close the device, but it
      only decrements fs_devices->missing_devices but does not clear the
      device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
      Overflow, because every time umount, fs_devices->missing_devices will
      decrease. If fs_devices->missing_devices value hit 0, it would overflow.
      
      With added debugging:
      
         loop1: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 0
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
      
      If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
      
      After apply this patch, the fs_devices->missing_devices seems to be
      right:
      
        $ truncate -s 10g test1
        $ truncate -s 10g test2
        $ losetup /dev/loop1 test1
        $ losetup /dev/loop2 test2
        $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
        $ losetup -d /dev/loop2
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ dmesg
      
         loop1: detected capacity change from 0 to 20971520
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): checking UUID tree
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NLi Zhang <zhanglikernel@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d03dbeb
  10. 27 10月, 2021 6 次提交
    • J
      btrfs: update device path inode time instead of bd_inode · 54fde91f
      Josef Bacik 提交于
      Christoph pointed out that I'm updating bdev->bd_inode for the device
      time when we remove block devices from a btrfs file system, however this
      isn't actually exposed to anything.  The inode we want to update is the
      one that's associated with the path to the device, usually on devtmpfs,
      so that blkid notices the difference.
      
      We still don't want to do the blkdev_open, so use kern_path() to get the
      path to the given device and do the update time on that inode.
      
      Fixes: 8f96a5bf ("btrfs: update the bdev time directly when closing")
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54fde91f
    • F
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana 提交于
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • J
      btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls · 1a15eb72
      Josef Bacik 提交于
      For device removal and replace we call btrfs_find_device_by_devspec,
      which if we give it a device path and nothing else will call
      btrfs_get_dev_args_from_path, which opens the block device and reads the
      super block and then looks up our device based on that.
      
      However at this point we're holding the sb write "lock", so reading the
      block device pulls in the dependency of ->open_mutex, which produces the
      following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #405 Not tainted
      ------------------------------------------------------
      losetup/11576 is trying to acquire lock:
      ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_get_by_path+0x98/0xa0
             btrfs_get_bdev_and_sb+0x1b/0xb0
             btrfs_find_device_by_devspec+0x12b/0x1c0
             btrfs_rm_device+0x127/0x610
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11576:
       #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f31b02404cb
      
      Instead what we want to do is populate our device lookup args before we
      grab any locks, and then pass these args into btrfs_rm_device().  From
      there we can find the device and do the appropriate removal.
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a15eb72
    • J
      btrfs: add a btrfs_get_dev_args_from_path helper · faa775c4
      Josef Bacik 提交于
      We are going to want to populate our device lookup args outside of any
      locks and then do the actual device lookup later, so add a helper to do
      this work and make btrfs_find_device_by_devspec() use this helper for
      now.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      faa775c4
    • J
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik 提交于
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      562d7b15
    • J
      btrfs: do not call close_fs_devices in btrfs_rm_device · 8b41393f
      Josef Bacik 提交于
      There's a subtle case where if we're removing the seed device from a
      file system we need to free its private copy of the fs_devices.  However
      we do not need to call close_fs_devices(), because at this point there
      are no devices left to close as we've closed the last one.  The only
      thing that close_fs_devices() does is decrement ->opened, which should
      be 1.  We want to avoid calling close_fs_devices() here because it has a
      lockdep_assert_held(&uuid_mutex), and we are going to stop holding the
      uuid_mutex in this path.
      
      So simply decrement the  ->opened counter like we should, and then clean
      up like normal.  Also add a comment explaining what we're doing here as
      I initially removed this code erroneously.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b41393f