1. 14 3月, 2022 10 次提交
  2. 07 1月, 2022 3 次提交
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • J
      btrfs: zoned: sink zone check into btrfs_repair_one_zone · 554aed7d
      Johannes Thumshirn 提交于
      Sink zone check into btrfs_repair_one_zone() so we don't need to do it
      in all callers.
      
      Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
      a boolean function and return false in case it got called on a non-zoned
      filesystem and true on a zoned filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      554aed7d
    • N
      btrfs: introduce exclusive operation BALANCE_PAUSED state · efc0e69c
      Nikolay Borisov 提交于
      Current set of exclusive operation states is not sufficient to handle
      all practical use cases. In particular there is a need to be able to add
      a device to a filesystem that have paused balance. Currently there is no
      way to distinguish between a running and a paused balance. Fix this by
      introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
      occasions:
      
      1. When a filesystem is mounted with skip_balance and there is an
         unfinished balance it will now be into BALANCE_PAUSED instead of
         simply BALANCE state.
      
      2. When a running balance is paused.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      efc0e69c
  3. 03 1月, 2022 6 次提交
    • J
      btrfs: don't use the extent root in btrfs_chunk_alloc_add_chunk_item · fd51eb2f
      Josef Bacik 提交于
      We're just using the extent_root to set the chunk owner to
      root_key->objectid, which is BTRFS_EXTENT_TREE_OBJECTID, so use that
      directly instead of using the root.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd51eb2f
    • Q
      btrfs: don't check stripe length if the profile is not stripe based · bf08387f
      Qu Wenruo 提交于
      [BUG]
      When debugging calc_bio_boundaries(), I found that even for RAID1
      metadata, we're following stripe length to calculate stripe boundary.
      
        # mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12]
        # mount /dev/test/scratch /mnt/btrfs
        # xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file
        # umount
      
      Above very basic operations will make calc_bio_boundaries() to report
      the following result:
      
        submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152
        submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536
        ...
        submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384
        submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384
        submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536
        submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536
        submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152
        submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768
      
      Where "r/i" is the rootid and inode, 1/1 means they metadata.
      The remaining names match the member used in kernel.
      
      Even all data/metadata are using RAID1, we're still following stripe
      length.
      
      [CAUSE]
      This behavior is caused by a wrong condition in btrfs_get_io_geometry():
      
      	if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
      		/* Fill using stripe_len */
      		len = min_t(u64, em->len - offset, max_len);
      	} else {
      		len = em->len - offset;
      	}
      
      This means, only for SINGLE we will not follow stripe_len.
      
      However for profiles like RAID1*, DUP, they don't need to bother
      stripe_len.
      
      This can lead to unnecessary bio split for RAID1*/DUP profiles, and can
      even be a blockage for future zoned RAID support.
      
      [FIX]
      Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and
      change the condition to only calculate the length using stripe length
      for stripe based profiles.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf08387f
    • N
      btrfs: zoned: cache reported zone during mount · 16beac87
      Naohiro Aota 提交于
      When mounting a device, we are reporting the zones twice: once for
      checking the zone attributes in btrfs_get_dev_zone_info and once for
      loading block groups' zone info in
      btrfs_load_block_group_zone_info(). With a lot of block groups, that
      leads to a lot of REPORT ZONE commands and slows down the mount
      process.
      
      This patch introduces a zone info cache in struct
      btrfs_zoned_device_info. The cache is populated while in
      btrfs_get_dev_zone_info() and used for
      btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
      commands. The zone cache is then released after loading the block
      groups, as it will not be much effective during the run time.
      
      Benchmark: Mount an HDD with 57,007 block groups
      Before patch: 171.368 seconds
      After patch: 64.064 seconds
      
      While it still takes a minute due to the slowness of loading all the
      block groups, the patch reduces the mount time by 1/3.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16beac87
    • A
      btrfs: consolidate device_list_mutex in prepare_sprout to its parent · 849eae5e
      Anand Jain 提交于
      btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
      so that its parent function btrfs_init_new_device() can add the new sprout
      device to fs_info->fs_devices.
      
      Both btrfs_prepare_sprout() and btrfs_init_new_device() need
      device_list_mutex. But they are holding it separately, thus create a
      small race window. Close it and hold device_list_mutex across both
      functions btrfs_init_new_device() and btrfs_prepare_sprout().
      
      Split btrfs_prepare_sprout() into btrfs_init_sprout() and
      btrfs_setup_sprout(). This split is essential because device_list_mutex
      must not be held for allocations in btrfs_init_sprout() but must be held
      for btrfs_setup_sprout(). So now a common device_list_mutex can be used
      between btrfs_init_new_device() and btrfs_setup_sprout().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      849eae5e
    • A
      btrfs: switch seeding_dev in init_new_device to bool · fd880809
      Anand Jain 提交于
      Declare int seeding_dev as a bool. Also, move its declaration a line
      below to adjust packing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd880809
    • J
      btrfs: drop the _nr from the item helpers · 3212fa14
      Josef Bacik 提交于
      Now that all call sites are using the slot number to modify item values,
      rename the SETGET helpers to raw_item_*(), and then rework the _nr()
      helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
      rename all of the callers to the new helpers.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3212fa14
  4. 16 12月, 2021 1 次提交
  5. 16 11月, 2021 1 次提交
    • F
      btrfs: silence lockdep when reading chunk tree during mount · 4d9380e0
      Filipe Manana 提交于
      Often some test cases like btrfs/161 trigger lockdep splats that complain
      about possible unsafe lock scenario due to the fact that during mount,
      when reading the chunk tree we end up calling blkdev_get_by_path() while
      holding a read lock on a leaf of the chunk tree. That produces a lockdep
      splat like the following:
      
      [ 3653.683975] ======================================================
      [ 3653.685148] WARNING: possible circular locking dependency detected
      [ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted
      [ 3653.687239] ------------------------------------------------------
      [ 3653.688400] mount/447465 is trying to acquire lock:
      [ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.691054]
                     but task is already holding lock:
      [ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.693978]
                     which lock already depends on the new lock.
      
      [ 3653.695510]
                     the existing dependency chain (in reverse order) is:
      [ 3653.696915]
                     -> #3 (btrfs-chunk-00){++++}-{3:3}:
      [ 3653.698053]        down_read_nested+0x4b/0x140
      [ 3653.698893]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.699988]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 3653.701205]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 3653.702234]        btrfs_insert_empty_items+0x32/0x70 [btrfs]
      [ 3653.703332]        btrfs_init_new_device+0x563/0x15b0 [btrfs]
      [ 3653.704439]        btrfs_ioctl+0x2110/0x3530 [btrfs]
      [ 3653.705405]        __x64_sys_ioctl+0x83/0xb0
      [ 3653.706215]        do_syscall_64+0x3b/0xc0
      [ 3653.706990]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.708040]
                     -> #2 (sb_internal#2){.+.+}-{0:0}:
      [ 3653.708994]        lock_release+0x13d/0x4a0
      [ 3653.709533]        up_write+0x18/0x160
      [ 3653.710017]        btrfs_sync_file+0x3f3/0x5b0 [btrfs]
      [ 3653.710699]        __loop_update_dio+0xbd/0x170 [loop]
      [ 3653.711360]        lo_ioctl+0x3b1/0x8a0 [loop]
      [ 3653.711929]        block_ioctl+0x48/0x50
      [ 3653.712442]        __x64_sys_ioctl+0x83/0xb0
      [ 3653.712991]        do_syscall_64+0x3b/0xc0
      [ 3653.713519]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.714233]
                     -> #1 (&lo->lo_mutex){+.+.}-{3:3}:
      [ 3653.715026]        __mutex_lock+0x92/0x900
      [ 3653.715648]        lo_open+0x28/0x60 [loop]
      [ 3653.716275]        blkdev_get_whole+0x28/0x90
      [ 3653.716867]        blkdev_get_by_dev.part.0+0x142/0x320
      [ 3653.717537]        blkdev_open+0x5e/0xa0
      [ 3653.718043]        do_dentry_open+0x163/0x390
      [ 3653.718604]        path_openat+0x3f0/0xa80
      [ 3653.719128]        do_filp_open+0xa9/0x150
      [ 3653.719652]        do_sys_openat2+0x97/0x160
      [ 3653.720197]        __x64_sys_openat+0x54/0x90
      [ 3653.720766]        do_syscall_64+0x3b/0xc0
      [ 3653.721285]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.721986]
                     -> #0 (&disk->open_mutex){+.+.}-{3:3}:
      [ 3653.722775]        __lock_acquire+0x130e/0x2210
      [ 3653.723348]        lock_acquire+0xd7/0x310
      [ 3653.723867]        __mutex_lock+0x92/0x900
      [ 3653.724394]        blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.725041]        blkdev_get_by_path+0xb8/0xd0
      [ 3653.725614]        btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
      [ 3653.726332]        open_fs_devices+0xd7/0x2c0 [btrfs]
      [ 3653.726999]        btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
      [ 3653.727739]        open_ctree+0xb8e/0x17bf [btrfs]
      [ 3653.728384]        btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 3653.729130]        legacy_get_tree+0x30/0x50
      [ 3653.729676]        vfs_get_tree+0x28/0xc0
      [ 3653.730192]        vfs_kern_mount.part.0+0x71/0xb0
      [ 3653.730800]        btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 3653.731427]        legacy_get_tree+0x30/0x50
      [ 3653.731970]        vfs_get_tree+0x28/0xc0
      [ 3653.732486]        path_mount+0x2d4/0xbe0
      [ 3653.732997]        __x64_sys_mount+0x103/0x140
      [ 3653.733560]        do_syscall_64+0x3b/0xc0
      [ 3653.734080]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.734782]
                     other info that might help us debug this:
      
      [ 3653.735784] Chain exists of:
                       &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00
      
      [ 3653.737123]  Possible unsafe locking scenario:
      
      [ 3653.737865]        CPU0                    CPU1
      [ 3653.738435]        ----                    ----
      [ 3653.739007]   lock(btrfs-chunk-00);
      [ 3653.739449]                                lock(sb_internal#2);
      [ 3653.740193]                                lock(btrfs-chunk-00);
      [ 3653.740955]   lock(&disk->open_mutex);
      [ 3653.741431]
                      *** DEADLOCK ***
      
      [ 3653.742176] 3 locks held by mount/447465:
      [ 3653.742739]  #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0
      [ 3653.744114]  #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs]
      [ 3653.745563]  #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 3653.747066]
                     stack backtrace:
      [ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1
      [ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 3653.750592] Call Trace:
      [ 3653.750967]  dump_stack_lvl+0x57/0x72
      [ 3653.751526]  check_noncircular+0xf3/0x110
      [ 3653.752136]  ? stack_trace_save+0x4b/0x70
      [ 3653.752748]  __lock_acquire+0x130e/0x2210
      [ 3653.753356]  lock_acquire+0xd7/0x310
      [ 3653.753898]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.754596]  ? lock_is_held_type+0xe8/0x140
      [ 3653.755125]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.755729]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.756338]  __mutex_lock+0x92/0x900
      [ 3653.756794]  ? blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.757400]  ? do_raw_spin_unlock+0x4b/0xa0
      [ 3653.757930]  ? _raw_spin_unlock+0x29/0x40
      [ 3653.758437]  ? bd_prepare_to_claim+0x129/0x150
      [ 3653.758999]  ? trace_module_get+0x2b/0xd0
      [ 3653.759508]  ? try_module_get.part.0+0x50/0x80
      [ 3653.760072]  blkdev_get_by_dev.part.0+0xe7/0x320
      [ 3653.760661]  ? devcgroup_check_permission+0xc1/0x1f0
      [ 3653.761288]  blkdev_get_by_path+0xb8/0xd0
      [ 3653.761797]  btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
      [ 3653.762454]  open_fs_devices+0xd7/0x2c0 [btrfs]
      [ 3653.763055]  ? clone_fs_devices+0x8f/0x170 [btrfs]
      [ 3653.763689]  btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
      [ 3653.764370]  ? kvm_sched_clock_read+0x14/0x40
      [ 3653.764922]  open_ctree+0xb8e/0x17bf [btrfs]
      [ 3653.765493]  ? super_setup_bdi_name+0x79/0xd0
      [ 3653.766043]  btrfs_mount_root.cold+0x12/0xde [btrfs]
      [ 3653.766780]  ? rcu_read_lock_sched_held+0x3f/0x80
      [ 3653.767488]  ? kfree+0x1f2/0x3c0
      [ 3653.767979]  legacy_get_tree+0x30/0x50
      [ 3653.768548]  vfs_get_tree+0x28/0xc0
      [ 3653.769076]  vfs_kern_mount.part.0+0x71/0xb0
      [ 3653.769718]  btrfs_mount+0x11d/0x3a0 [btrfs]
      [ 3653.770381]  ? rcu_read_lock_sched_held+0x3f/0x80
      [ 3653.771086]  ? kfree+0x1f2/0x3c0
      [ 3653.771574]  legacy_get_tree+0x30/0x50
      [ 3653.772136]  vfs_get_tree+0x28/0xc0
      [ 3653.772673]  path_mount+0x2d4/0xbe0
      [ 3653.773201]  __x64_sys_mount+0x103/0x140
      [ 3653.773793]  do_syscall_64+0x3b/0xc0
      [ 3653.774333]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3653.775094] RIP: 0033:0x7f648bc45aaa
      
      This happens because through btrfs_read_chunk_tree(), which is called only
      during mount, ends up acquiring the mutex open_mutex of a block device
      while holding a read lock on a leaf of the chunk tree while other paths
      need to acquire other locks before locking extent buffers of the chunk
      tree.
      
      Since at mount time when we call btrfs_read_chunk_tree() we know that
      we don't have other tasks running in parallel and modifying the chunk
      tree, we can simply skip locking of chunk tree extent buffers. So do
      that and move the assertion that checks the fs is not yet mounted to the
      top block of btrfs_read_chunk_tree(), with a comment before doing it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d9380e0
  6. 29 10月, 2021 1 次提交
    • L
      btrfs: clear MISSING device status bit in btrfs_close_one_device · 5d03dbeb
      Li Zhang 提交于
      Reported bug: https://github.com/kdave/btrfs-progs/issues/389
      
      There's a problem with scrub reporting aborted status but returning
      error code 0, on a filesystem with missing and readded device.
      
      Roughly these steps:
      
      - mkfs -d raid1 dev1 dev2
      - fill with data
      - unmount
      - make dev1 disappear
      - mount -o degraded
      - copy more data
      - make dev1 appear again
      
      Running scrub afterwards reports that the command was aborted, but the
      system log message says the exit code was 0.
      
      It seems that the cause of the error is decrementing
      fs_devices->missing_devices but not clearing device->dev_state.  Every
      time we umount filesystem, it would call close_ctree, And it would
      eventually involve btrfs_close_one_device to close the device, but it
      only decrements fs_devices->missing_devices but does not clear the
      device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
      Overflow, because every time umount, fs_devices->missing_devices will
      decrease. If fs_devices->missing_devices value hit 0, it would overflow.
      
      With added debugging:
      
         loop1: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 0
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
      
      If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
      
      After apply this patch, the fs_devices->missing_devices seems to be
      right:
      
        $ truncate -s 10g test1
        $ truncate -s 10g test2
        $ losetup /dev/loop1 test1
        $ losetup /dev/loop2 test2
        $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
        $ losetup -d /dev/loop2
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ dmesg
      
         loop1: detected capacity change from 0 to 20971520
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): checking UUID tree
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NLi Zhang <zhanglikernel@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d03dbeb
  7. 27 10月, 2021 16 次提交
    • J
      btrfs: update device path inode time instead of bd_inode · 54fde91f
      Josef Bacik 提交于
      Christoph pointed out that I'm updating bdev->bd_inode for the device
      time when we remove block devices from a btrfs file system, however this
      isn't actually exposed to anything.  The inode we want to update is the
      one that's associated with the path to the device, usually on devtmpfs,
      so that blkid notices the difference.
      
      We still don't want to do the blkdev_open, so use kern_path() to get the
      path to the given device and do the update time on that inode.
      
      Fixes: 8f96a5bf ("btrfs: update the bdev time directly when closing")
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54fde91f
    • F
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana 提交于
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • J
      btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls · 1a15eb72
      Josef Bacik 提交于
      For device removal and replace we call btrfs_find_device_by_devspec,
      which if we give it a device path and nothing else will call
      btrfs_get_dev_args_from_path, which opens the block device and reads the
      super block and then looks up our device based on that.
      
      However at this point we're holding the sb write "lock", so reading the
      block device pulls in the dependency of ->open_mutex, which produces the
      following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #405 Not tainted
      ------------------------------------------------------
      losetup/11576 is trying to acquire lock:
      ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_get_by_path+0x98/0xa0
             btrfs_get_bdev_and_sb+0x1b/0xb0
             btrfs_find_device_by_devspec+0x12b/0x1c0
             btrfs_rm_device+0x127/0x610
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11576:
       #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f31b02404cb
      
      Instead what we want to do is populate our device lookup args before we
      grab any locks, and then pass these args into btrfs_rm_device().  From
      there we can find the device and do the appropriate removal.
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a15eb72
    • J
      btrfs: add a btrfs_get_dev_args_from_path helper · faa775c4
      Josef Bacik 提交于
      We are going to want to populate our device lookup args outside of any
      locks and then do the actual device lookup later, so add a helper to do
      this work and make btrfs_find_device_by_devspec() use this helper for
      now.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      faa775c4
    • J
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik 提交于
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      562d7b15
    • J
      btrfs: do not call close_fs_devices in btrfs_rm_device · 8b41393f
      Josef Bacik 提交于
      There's a subtle case where if we're removing the seed device from a
      file system we need to free its private copy of the fs_devices.  However
      we do not need to call close_fs_devices(), because at this point there
      are no devices left to close as we've closed the last one.  The only
      thing that close_fs_devices() does is decrement ->opened, which should
      be 1.  We want to avoid calling close_fs_devices() here because it has a
      lockdep_assert_held(&uuid_mutex), and we are going to stop holding the
      uuid_mutex in this path.
      
      So simply decrement the  ->opened counter like we should, and then clean
      up like normal.  Also add a comment explaining what we're doing here as
      I initially removed this code erroneously.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b41393f
    • A
      btrfs: use num_device to check for the last surviving seed device · 8e906945
      Anand Jain 提交于
      For both sprout and seed fsids,
       btrfs_fs_devices::num_devices provides device count including missing
       btrfs_fs_devices::open_devices provides device count excluding missing
      
      We create a dummy struct btrfs_device for the missing device, so
      num_devices != open_devices when there is a missing device.
      
      In btrfs_rm_devices() we wrongly check for %cur_devices->open_devices
      before freeing the seed fs_devices. Instead we should check for
      %cur_devices->num_devices.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e906945
    • Q
      btrfs: remove btrfs_raid_bio::fs_info member · 6a258d72
      Qu Wenruo 提交于
      We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is
      always passed into alloc_rbio(), and only get released when the raid bio
      is released.
      
      Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info
      parameters for alloc_rbio() callers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a258d72
    • Q
      btrfs: make sure btrfs_io_context::fs_info is always initialized · 731ccf15
      Qu Wenruo 提交于
      Currently btrfs_io_context::fs_info is only initialized in
      btrfs_map_bio, but there are call sites like btrfs_map_sblock() which
      calls __btrfs_map_block() directly, leaving bioc::fs_info uninitialized
      (NULL).
      
      Currently this is fine, but later cleanup will rely on bioc::fs_info to
      grab fs_info, and this can be a hidden problem for such usage.
      
      This patch will remove such hidden uninitialized member by always
      assigning bioc::fs_info at alloc_btrfs_io_context().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      731ccf15
    • J
      btrfs: do not take the uuid_mutex in btrfs_rm_device · 8ef9dc0f
      Josef Bacik 提交于
      We got the following lockdep splat while running fstests (specifically
      btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
      by 87579e9b ("loop: use worker per cgroup instead of kworker") which
      converted loop to using workqueues, which comes with lockdep
      annotations that don't exist with kworkers.  The lockdep splat is as
      follows:
      
        WARNING: possible circular locking dependency detected
        5.14.0-rc2-custom+ #34 Not tainted
        ------------------------------------------------------
        losetup/156417 is trying to acquire lock:
        ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600
      
        but task is already holding lock:
        ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 lo_open+0x28/0x60 [loop]
      	 blkdev_get_whole+0x28/0xf0
      	 blkdev_get_by_dev.part.0+0x168/0x3c0
      	 blkdev_open+0xd2/0xe0
      	 do_dentry_open+0x163/0x3a0
      	 path_openat+0x74d/0xa40
      	 do_filp_open+0x9c/0x140
      	 do_sys_openat2+0xb1/0x170
      	 __x64_sys_openat+0x54/0x90
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #4 (&disk->open_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 blkdev_get_by_dev.part.0+0xd1/0x3c0
      	 blkdev_get_by_path+0xc0/0xd0
      	 btrfs_scan_one_device+0x52/0x1f0 [btrfs]
      	 btrfs_control_ioctl+0xac/0x170 [btrfs]
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #3 (uuid_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 btrfs_rm_device+0x48/0x6a0 [btrfs]
      	 btrfs_ioctl+0x2d1c/0x3110 [btrfs]
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #2 (sb_writers#11){.+.+}-{0:0}:
      	 lo_write_bvec+0x112/0x290 [loop]
      	 loop_process_work+0x25f/0xcb0 [loop]
      	 process_one_work+0x28f/0x5d0
      	 worker_thread+0x55/0x3c0
      	 kthread+0x140/0x170
      	 ret_from_fork+0x22/0x30
      
        -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
      	 process_one_work+0x266/0x5d0
      	 worker_thread+0x55/0x3c0
      	 kthread+0x140/0x170
      	 ret_from_fork+0x22/0x30
      
        -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
      	 __lock_acquire+0x1130/0x1dc0
      	 lock_acquire+0xf5/0x320
      	 flush_workqueue+0xae/0x600
      	 drain_workqueue+0xa0/0x110
      	 destroy_workqueue+0x36/0x250
      	 __loop_clr_fd+0x9a/0x650 [loop]
      	 lo_ioctl+0x29d/0x780 [loop]
      	 block_ioctl+0x3f/0x50
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        other info that might help us debug this:
        Chain exists of:
          (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
         Possible unsafe locking scenario:
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&lo->lo_mutex);
      				 lock(&disk->open_mutex);
      				 lock(&lo->lo_mutex);
          lock((wq_completion)loop0);
      
         *** DEADLOCK ***
        1 lock held by losetup/156417:
         #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
      
        stack backtrace:
        CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x57/0x72
         check_noncircular+0x10a/0x120
         __lock_acquire+0x1130/0x1dc0
         lock_acquire+0xf5/0x320
         ? flush_workqueue+0x84/0x600
         flush_workqueue+0xae/0x600
         ? flush_workqueue+0x84/0x600
         drain_workqueue+0xa0/0x110
         destroy_workqueue+0x36/0x250
         __loop_clr_fd+0x9a/0x650 [loop]
         lo_ioctl+0x29d/0x780 [loop]
         ? __lock_acquire+0x3a0/0x1dc0
         ? update_dl_rq_load_avg+0x152/0x360
         ? lock_is_held_type+0xa5/0x120
         ? find_held_lock.constprop.0+0x2b/0x80
         block_ioctl+0x3f/0x50
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f645884de6b
      
      Usually the uuid_mutex exists to protect the fs_devices that map
      together all of the devices that match a specific uuid.  In rm_device
      we're messing with the uuid of a device, so it makes sense to protect
      that here.
      
      However in doing that it pulls in a whole host of lockdep dependencies,
      as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
      we end up with the dependency chain under the uuid_mutex being added
      under the normal sb write dependency chain, which causes problems with
      loop devices.
      
      We don't need the uuid mutex here however.  If we call
      btrfs_scan_one_device() before we scratch the super block we will find
      the fs_devices and not find the device itself and return EBUSY because
      the fs_devices is open.  If we call it after the scratch happens it will
      not appear to be a valid btrfs file system.
      
      We do not need to worry about other fs_devices modifying operations here
      because we're protected by the exclusive operations locking.
      
      So drop the uuid_mutex here in order to fix the lockdep splat.
      
      A more detailed explanation from the discussion:
      
      We are worried about rm and scan racing with each other, before this
      change we'll zero the device out under the UUID mutex so when scan does
      run it'll make sure that it can go through the whole device scan thing
      without rm messing with us.
      
      We aren't worried if the scratch happens first, because the result is we
      don't think this is a btrfs device and we bail out.
      
      The only case we are concerned with is we scratch _after_ scan is able
      to read the superblock and gets a seemingly valid super block, so lets
      consider this case.
      
      Scan will call device_list_add() with the device we're removing.  We'll
      call find_fsid_with_metadata_uuid() and get our fs_devices for this
      UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
      what protects us in this case, but we have two cases here.
      
      1. We aren't to the device removal part of the RM.  We found our device,
         and device name matches our path, we go down and we set total_devices
         to our super number of devices, which doesn't affect anything because
         we haven't done the remove yet.
      
      2. We are past the device removal part, which is protected by the
         device_list_mutex.  Scan doesn't find the device, it goes down and
         does the
      
         if (fs_devices->opened)
      	   return -EBUSY;
      
         check and we bail out.
      
      Nothing about this situation is ideal, but the lockdep splat is real,
      and the fix is safe, tho admittedly a bit scary looking.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ copy more from the discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ef9dc0f
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
    • A
      btrfs: remove stale comment about the btrfs_show_devname · cdccc03a
      Anand Jain 提交于
      There were few lockdep warnings because btrfs_show_devname() was using
      device_list_mutex as recorded in the commits:
      
        0ccd0528 ("btrfs: fix a possible umount deadlock")
        779bf3fe ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex")
      
      And finally, commit 88c14590 ("btrfs: use RCU in btrfs_show_devname
      for device list traversal") removed the device_list_mutex from
      btrfs_show_devname for performance reasons.
      
      This patch removes a stale comment about the function
      btrfs_show_devname and device_list_mutex.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cdccc03a
    • A
      btrfs: update latest_dev when we create a sprout device · b7cb29e6
      Anand Jain 提交于
      When we add a device to the seed filesystem (sprouting) it is a new
      filesystem (and fsid) on the device added. Update the latest_dev so
      that /proc/self/mounts shows the correct device.
      
      Example:
      
        $ btrfstune -S1 /dev/vg/seed
        $ mount /dev/vg/seed /btrfs
        mount: /btrfs: WARNING: device write-protected, mounted read-only.
      
        $ cat /proc/self/mounts | grep btrfs
        /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
      
        $ btrfs dev add -f /dev/vg/new /btrfs
      
      Before:
      
        $ cat /proc/self/mounts | grep btrfs
        /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
      
      After:
      
        $ cat /proc/self/mounts | grep btrfs
        /dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7cb29e6
    • A
      btrfs: convert latest_bdev type to btrfs_device and rename · d24fa5c1
      Anand Jain 提交于
      In preparation to fix a bug in btrfs_show_devname().
      
      Convert fs_devices::latest_bdev type from struct block_device to struct
      btrfs_device and, rename the member to fs_devices::latest_dev.
      So that btrfs_show_devname() can use fs_devices::latest_dev::name.
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d24fa5c1
    • A
      btrfs: rename and switch to bool btrfs_chunk_readonly · a09f23c3
      Anand Jain 提交于
      btrfs_chunk_readonly() checks if the given chunk is writeable. It
      returns 1 for readonly, and 0 for writeable. So the return argument type
      bool shall suffice instead of the current type int.
      
      Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
      check if the bg is writeable, and helps to keep the logic at the parent
      function simpler to understand.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a09f23c3
  8. 26 10月, 2021 2 次提交