1. 07 10月, 2020 25 次提交
    • A
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain 提交于
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96c2e067
    • J
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik 提交于
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e26df4
    • J
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik 提交于
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      124604eb
    • A
      btrfs: simplify gotos in open_seed_device · c83b60c0
      Anand Jain 提交于
      The function does not have a common exit block and returns immediatelly
      so there's no point having the goto. Remove the two cases.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83b60c0
    • A
      btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device() · e493e8f9
      Anand Jain 提交于
      We can check the argument value directly, no need for the temporary
      variable.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e493e8f9
    • A
      btrfs: use sprout device_list_mutex in btrfs_init_devices_late · e17125b5
      Anand Jain 提交于
      On a mounted sprout filesystem, all threads now are using the
      sprout::device_list_mutex, and this is the only code using the
      seed::device_list_mutex. This patch converts to use the sprouts
      fs_info->fs_devices->device_list_mutex.
      
      The same reasoning holds true here, that device delete is holding
      the sprout::device_list_mutex.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e17125b5
    • A
      btrfs: split and refactor btrfs_sysfs_remove_devices_dir · 53f8a74c
      Anand Jain 提交于
      Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
      btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
      argument to indicate whether to free all devices or just one device.
      
      Export btrfs_sysfs_remove_device() as device operations outside of
      sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
      
      btrfs_sysfs_remove_devices_dir() is renamed to
      btrfs_sysfs_remove_fs_devices() to suite its new role.
      
      Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
      so it is redeclared s static. And the same function had to be moved
      before its first caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f8a74c
    • A
      btrfs: simplify parameters of btrfs_sysfs_add_devices_dir · cd36da2e
      Anand Jain 提交于
      When we add a device we need to add it to sysfs, so instead of using the
      btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
      add a device or all of fs_devices, call the helper function directly
      btrfs_sysfs_add_device() and thus make it non-static.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd36da2e
    • A
      btrfs: improve device scanning messages · 79dae17d
      Anand Jain 提交于
      Systems booting without the initramfs seems to scan an unusual kind
      of device path (/dev/root). And at a later time, the device is updated
      to the correct path. We generally print the process name and PID of the
      process scanning the device but we don't capture the same information if
      the device path is rescanned with a different pathname.
      
      The current message is too long, so drop the unnecessary UUID and add
      process name and PID.
      
      While at this also update the duplicate device warning to include the
      process name and PID so the messages are consistent
      
      CC: stable@vger.kernel.org # 4.19+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79dae17d
    • G
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues 提交于
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • J
      btrfs: sysfs: init devices outside of the chunk_mutex · ca10845a
      Josef Bacik 提交于
      While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #4 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_link+0x63/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x81/0x130
      	 btrfs_init_new_device+0x67f/0x1250
      	 btrfs_ioctl+0x1ef/0x2e20
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_insert_delayed_items+0x90/0x4f0
      	 btrfs_commit_inode_delayed_items+0x93/0x140
      	 btrfs_log_inode+0x5de/0x2020
      	 btrfs_log_inode_parent+0x429/0xc90
      	 btrfs_log_new_name+0x95/0x9b
      	 btrfs_rename2+0xbb9/0x1800
      	 vfs_rename+0x64f/0x9f0
      	 do_renameat2+0x320/0x4e0
      	 __x64_sys_rename+0x1f/0x30
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we are holding the chunk_mutex at the time of
      adding in a new device.  However we only need to hold the
      device_list_mutex, as we're going to iterate over the fs_devices
      devices.  Move the sysfs init stuff outside of the chunk_mutex to get
      rid of this lockdep splat.
      
      CC: stable@vger.kernel.org # 4.4.x: f3cd2c58: btrfs: sysfs, rename device_link add/remove functions
      CC: stable@vger.kernel.org # 4.4.x
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca10845a
    • N
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov 提交于
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9ba017f
    • N
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov 提交于
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10 ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ae312e9
    • N
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov 提交于
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44cab9ba
    • N
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov 提交于
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4cfa9bd
    • J
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik 提交于
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      425c6ed6
    • J
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik 提交于
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62cf5391
    • N
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov 提交于
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      68abf360
    • N
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov 提交于
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      427c8fdd
    • N
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov 提交于
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      944d3f9f
    • N
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov 提交于
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4989c2f
    • N
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov 提交于
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54eed6ae
    • N
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov 提交于
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3712ccb7
    • Q
      btrfs: add owner and fs_info to alloc_state io_tree · 154f7cb8
      Qu Wenruo 提交于
      Commit 1c11b63e ("btrfs: replace pending/pinned chunks lists with io
      tree") introduced btrfs_device::alloc_state extent io tree, but it
      doesn't initialize the fs_info and owner member.
      
      This means the following features are not properly supported:
      
      - Fs owner report for insert_state() error
        Without fs_info initialized, although btrfs_err() won't panic, it
        won't output which fs is causing the error.
      
      - Wrong owner for trace events
        alloc_state will get the owner as pinned extents.
      
      Fix this by assiging proper fs_info and owner for
      btrfs_device::alloc_state.
      
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      154f7cb8
    • N
      btrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid · 8e560081
      Nikolay Borisov 提交于
      It can be accessed from 'fs_devices' as it's identical to
      fs_info->fs_devices. Also add a comment about why we are calling the
      function. No semantic changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e560081
  2. 01 10月, 2020 1 次提交
    • J
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · a466c85e
      Josef Bacik 提交于
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      
      There's a report from syzbot probably hitting one of the cases where
      the bd_mutex and device_list_mutex are taken in the wrong order, however
      it's not with device replace, like this patch fixes. As there's no
      reproducer available so far, we can't verify the fix.
      
      https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
      dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
      
        WARNING: possible circular locking dependency detected
        5.9.0-rc5-syzkaller #0 Not tainted
        ------------------------------------------------------
        syz-executor.0/6878 is trying to acquire lock:
        ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
      
        but task is already holding lock:
        ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
      	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
      	 btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
      	 btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
      	 __btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
      	 find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
      	 find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
      	 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
      	 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
      	 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
      	 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
      	 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
      	 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
      	 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
      	 do_writepages+0xec/0x290 mm/page-writeback.c:2352
      	 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
      	 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
      	 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
      	 wb_do_writeback fs/fs-writeback.c:2039 [inline]
      	 wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
      	 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
      	 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
      	 kthread+0x3b5/0x4a0 kernel/kthread.c:292
      	 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
        -> #3 (sb_internal#2){.+.+}-{0:0}:
      	 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
      	 __sb_start_write+0x234/0x470 fs/super.c:1672
      	 sb_start_intwrite include/linux/fs.h:1690 [inline]
      	 start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
      	 find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
      	 find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
      	 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
      	 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
      	 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
      	 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
      	 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
      	 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
      	 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
      	 do_writepages+0xec/0x290 mm/page-writeback.c:2352
      	 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
      	 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
      	 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
      	 wb_do_writeback fs/fs-writeback.c:2039 [inline]
      	 wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
      	 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
      	 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
      	 kthread+0x3b5/0x4a0 kernel/kthread.c:292
      	 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
        -> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
      	 __flush_work+0x60e/0xac0 kernel/workqueue.c:3041
      	 wb_shutdown+0x180/0x220 mm/backing-dev.c:355
      	 bdi_unregister+0x174/0x590 mm/backing-dev.c:872
      	 del_gendisk+0x820/0xa10 block/genhd.c:933
      	 loop_remove drivers/block/loop.c:2192 [inline]
      	 loop_control_ioctl drivers/block/loop.c:2291 [inline]
      	 loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
      	 vfs_ioctl fs/ioctl.c:48 [inline]
      	 __do_sys_ioctl fs/ioctl.c:753 [inline]
      	 __se_sys_ioctl fs/ioctl.c:739 [inline]
      	 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
      	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (loop_ctl_mutex){+.+.}-{3:3}:
      	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
      	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
      	 lo_open+0x19/0xd0 drivers/block/loop.c:1893
      	 __blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
      	 blkdev_get fs/block_dev.c:1639 [inline]
      	 blkdev_open+0x227/0x300 fs/block_dev.c:1753
      	 do_dentry_open+0x4b9/0x11b0 fs/open.c:817
      	 do_open fs/namei.c:3251 [inline]
      	 path_openat+0x1b9a/0x2730 fs/namei.c:3368
      	 do_filp_open+0x17e/0x3c0 fs/namei.c:3395
      	 do_sys_openat2+0x16d/0x420 fs/open.c:1168
      	 do_sys_open fs/open.c:1184 [inline]
      	 __do_sys_open fs/open.c:1192 [inline]
      	 __se_sys_open fs/open.c:1188 [inline]
      	 __x64_sys_open+0x119/0x1c0 fs/open.c:1188
      	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
      	 check_prev_add kernel/locking/lockdep.c:2496 [inline]
      	 check_prevs_add kernel/locking/lockdep.c:2601 [inline]
      	 validate_chain kernel/locking/lockdep.c:3218 [inline]
      	 __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
      	 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
      	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
      	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
      	 blkdev_put+0x30/0x520 fs/block_dev.c:1804
      	 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
      	 btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
      	 btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
      	 close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
      	 close_fs_devices fs/btrfs/volumes.c:1193 [inline]
      	 btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
      	 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
      	 generic_shutdown_super+0x144/0x370 fs/super.c:464
      	 kill_anon_super+0x36/0x60 fs/super.c:1108
      	 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
      	 deactivate_locked_super+0x94/0x160 fs/super.c:335
      	 deactivate_super+0xad/0xd0 fs/super.c:366
      	 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
      	 task_work_run+0xdd/0x190 kernel/task_work.c:141
      	 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
      	 exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
      	 exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
      	 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_devs->device_list_mutex);
      				 lock(sb_internal#2);
      				 lock(&fs_devs->device_list_mutex);
          lock(&bdev->bd_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by syz-executor.0/6878:
         #0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
         #1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
         #2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
      
        stack backtrace:
        CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x198/0x1fd lib/dump_stack.c:118
         check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
         check_prev_add kernel/locking/lockdep.c:2496 [inline]
         check_prevs_add kernel/locking/lockdep.c:2601 [inline]
         validate_chain kernel/locking/lockdep.c:3218 [inline]
         __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
         lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
         __mutex_lock_common kernel/locking/mutex.c:956 [inline]
         __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
         blkdev_put+0x30/0x520 fs/block_dev.c:1804
         btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
         btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
         btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
         close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
         close_fs_devices fs/btrfs/volumes.c:1193 [inline]
         btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
         close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
         generic_shutdown_super+0x144/0x370 fs/super.c:464
         kill_anon_super+0x36/0x60 fs/super.c:1108
         btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
         deactivate_locked_super+0x94/0x160 fs/super.c:335
         deactivate_super+0xad/0xd0 fs/super.c:366
         cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
         task_work_run+0xdd/0x190 kernel/task_work.c:141
         tracehook_notify_resume include/linux/tracehook.h:188 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
         exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
         syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x460027
        RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
        RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
        RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
        R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
        R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add syzbot reference ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a466c85e
  3. 25 9月, 2020 1 次提交
  4. 07 9月, 2020 1 次提交
    • J
      btrfs: fix lockdep splat in add_missing_dev · fccc0007
      Josef Bacik 提交于
      Nikolay reported a lockdep splat in generic/476 that I could reproduce
      with btrfs/187.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc2+ #1 Tainted: G        W
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc_trace+0x3a/0x1a0
      	 btrfs_alloc_device+0x43/0x210
      	 add_missing_dev+0x20/0x90
      	 read_one_chunk+0x301/0x430
      	 btrfs_read_sys_array+0x17b/0x1b0
      	 open_ctree+0xa62/0x1896
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x379
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 vfs_rmdir.part.0+0x149/0x160
      	 do_rmdir+0x136/0x1a0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1184/0x1fa0
      	 lock_acquire+0xa4/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(&fs_info->chunk_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Tainted: G        W         5.9.0-rc2+ #1
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x92/0xc8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x1184/0x1fa0
         lock_acquire+0xa4/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa4/0x3d0
         ? btrfs_evict_inode+0x11e/0x500
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x46/0x60
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This is because we are holding the chunk_mutex when we call
      btrfs_alloc_device, which does a GFP_KERNEL allocation.  We don't want
      to switch that to a GFP_NOFS lock because this is the only place where
      it matters.  So instead use memalloc_nofs_save() around the allocation
      in order to avoid the lockdep splat.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fccc0007
  5. 27 8月, 2020 1 次提交
    • J
      btrfs: drop path before adding new uuid tree entry · 9771a5cf
      Josef Bacik 提交于
      With the conversion of the tree locks to rwsem I got the following
      lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
        ------------------------------------------------------
        btrfs-uuid/7955 is trying to acquire lock:
        ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        but task is already holding lock:
        ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-uuid-00){++++}-{3:3}:
      	 down_read_nested+0x3e/0x140
      	 __btrfs_tree_read_lock+0x39/0x180
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x4bd/0x990
      	 btrfs_uuid_tree_add+0x89/0x2d0
      	 btrfs_uuid_scan_kthread+0x330/0x390
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 down_read_nested+0x3e/0x140
      	 __btrfs_tree_read_lock+0x39/0x180
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x4bd/0x990
      	 btrfs_find_root+0x45/0x1b0
      	 btrfs_read_tree_root+0x61/0x100
      	 btrfs_get_root_ref.part.50+0x143/0x630
      	 btrfs_uuid_tree_iterate+0x207/0x314
      	 btrfs_uuid_rescan_kthread+0x12/0x50
      	 kthread+0x133/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-uuid-00);
      				 lock(btrfs-root-00);
      				 lock(btrfs-uuid-00);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        1 lock held by btrfs-uuid/7955:
         #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
      
        stack backtrace:
        CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? __btrfs_tree_read_lock+0x39/0x180
         ? btrfs_root_node+0x1c/0x1d0
         down_read_nested+0x3e/0x140
         ? __btrfs_tree_read_lock+0x39/0x180
         __btrfs_tree_read_lock+0x39/0x180
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x4bd/0x990
         btrfs_find_root+0x45/0x1b0
         btrfs_read_tree_root+0x61/0x100
         btrfs_get_root_ref.part.50+0x143/0x630
         btrfs_uuid_tree_iterate+0x207/0x314
         ? btree_readpage+0x20/0x20
         btrfs_uuid_rescan_kthread+0x12/0x50
         kthread+0x133/0x150
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This problem exists because we have two different rescan threads,
      btrfs_uuid_scan_kthread which creates the uuid tree, and
      btrfs_uuid_tree_iterate that goes through and updates or deletes any out
      of date roots.  The problem is they both do things in different order.
      btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
      into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
      then does a btrfs_get_fs_root() which can read from the tree_root.
      
      It's actually easy enough to not be holding the path in
      btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
      it further down and re-start the search when we loop.  So simply move
      the path release before we add our entry to the uuid tree.
      
      This also fixes a problem where we're holding a path open after we do
      btrfs_end_transaction(), which has it's own problems.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9771a5cf
  6. 12 8月, 2020 1 次提交
    • Q
      btrfs: trim: fix underflow in trim length to prevent access beyond device boundary · c57dd1f2
      Qu Wenruo 提交于
      [BUG]
      The following script can lead to tons of beyond device boundary access:
      
        mkfs.btrfs -f $dev -b 10G
        mount $dev $mnt
        trimfs $mnt
        btrfs filesystem resize 1:-1G $mnt
        trimfs $mnt
      
      [CAUSE]
      Since commit 929be17a ("btrfs: Switch btrfs_trim_free_extents to
      find_first_clear_extent_bit"), we try to avoid trimming ranges that's
      already trimmed.
      
      So we check device->alloc_state by finding the first range which doesn't
      have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
      
      But if we shrunk the device, that bits are not cleared, thus we could
      easily got a range starts beyond the shrunk device size.
      
      This results the returned @start and @end are all beyond device size,
      then we call "end = min(end, device->total_bytes -1);" making @end
      smaller than device size.
      
      Then finally we goes "len = end - start + 1", totally underflow the
      result, and lead to the beyond-device-boundary access.
      
      [FIX]
      This patch will fix the problem in two ways:
      
      - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
        This is the root fix
      
      - Add extra safety check when trimming free device extents
        We check and warn if the returned range is already beyond current
        device.
      
      Link: https://github.com/kdave/btrfs-progs/issues/282
      Fixes: 929be17a ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c57dd1f2
  7. 27 7月, 2020 8 次提交
    • J
      btrfs: move the chunk_mutex in btrfs_read_chunk_tree · 01d01caf
      Josef Bacik 提交于
      We are currently getting this lockdep splat in btrfs/161:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G            E
        ------------------------------------------------------
        mount/678048 is trying to acquire lock:
        ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
      
        but task is already holding lock:
        ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x8b/0x8f0
      	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
      	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
      	 ksys_ioctl+0x87/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 __mutex_lock+0x8b/0x8f0
      	 clone_fs_devices+0x4d/0x170 [btrfs]
      	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
      	 open_ctree+0xb7c/0x18ce [btrfs]
      	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 do_mount+0x7de/0xb30
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->chunk_mutex);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->chunk_mutex);
          lock(&fs_devs->device_list_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by mount/678048:
         #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
         #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? asm_sysvec_apic_timer_interrupt+0x12/0x20
         lock_acquire+0xab/0x360
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         __mutex_lock+0x8b/0x8f0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? rcu_read_lock_sched_held+0x52/0x60
         ? cpumask_next+0x16/0x20
         ? module_assert_mutex_or_preempt+0x14/0x40
         ? __module_address+0x28/0xf0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? static_obj+0x4f/0x60
         ? lockdep_init_map_waits+0x43/0x200
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         clone_fs_devices+0x4d/0x170 [btrfs]
         btrfs_read_chunk_tree+0x330/0x800 [btrfs]
         open_ctree+0xb7c/0x18ce [btrfs]
         ? super_setup_bdi_name+0x79/0xd0
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? vfs_parse_fs_string+0x84/0xb0
         ? rcu_read_lock_sched_held+0x52/0x60
         ? kfree+0x2b5/0x310
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         fc_mount+0xe/0x40
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x13b/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? rcu_read_lock_sched_held+0x52/0x60
         ? legacy_get_tree+0x30/0x50
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         do_mount+0x7de/0xb30
         ? memdup_user+0x4e/0x90
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
      then read the device, which takes the device_list_mutex.  The
      device_list_mutex needs to be taken before the chunk_mutex, so this is a
      problem.  We only really need the chunk mutex around adding the chunk,
      so move the mutex around read_one_chunk.
      
      An argument could be made that we don't even need the chunk_mutex here
      as it's during mount, and we are protected by various other locks.
      However we already have special rules for ->device_list_mutex, and I'd
      rather not have another special case for ->chunk_mutex.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      01d01caf
    • J
      btrfs: open device without device_list_mutex · 18c850fd
      Josef Bacik 提交于
      There's long existed a lockdep splat because we open our bdev's under
      the ->device_list_mutex at mount time, which acquires the bd_mutex.
      Usually this goes unnoticed, but if you do loopback devices at all
      suddenly the bd_mutex comes with a whole host of other dependencies,
      which results in the splat when you mount a btrfs file system.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
      ------------------------------------------------------
      systemd-journal/509 is trying to acquire lock:
      ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
      
      but task is already holding lock:
      ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
       -> #6 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x13e/0x220
             btrfs_page_mkwrite+0x59/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
       -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x60/0x80
             _copy_from_user+0x20/0xb0
             get_sg_io_hdr+0x9a/0xb0
             scsi_cmd_ioctl+0x1ea/0x2f0
             cdrom_ioctl+0x3c/0x12b4
             sr_block_ioctl+0xa4/0xd0
             block_ioctl+0x3f/0x50
             ksys_ioctl+0x82/0xc0
             __x64_sys_ioctl+0x16/0x20
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #4 (&cd->lock){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             sr_block_open+0xa2/0x180
             __blkdev_get+0xdd/0x550
             blkdev_get+0x38/0x150
             do_dentry_open+0x16b/0x3e0
             path_openat+0x3c9/0xa00
             do_filp_open+0x75/0x100
             do_sys_openat2+0x8a/0x140
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             __blkdev_get+0x6a/0x550
             blkdev_get+0x85/0x150
             blkdev_get_by_path+0x2c/0x70
             btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
             open_fs_devices+0x88/0x240 [btrfs]
             btrfs_open_devices+0x92/0xa0 [btrfs]
             btrfs_mount_root+0x250/0x490 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             vfs_kern_mount.part.0+0x71/0xb0
             btrfs_mount+0x119/0x380 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             do_mount+0x8c6/0xca0
             __x64_sys_mount+0x8e/0xd0
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_run_dev_stats+0x36/0x420 [btrfs]
             commit_cowonly_roots+0x91/0x2d0 [btrfs]
             btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1241/0x20c0
             lock_acquire+0xb0/0x400
             __mutex_lock+0x7b/0x820
             btrfs_record_root_in_trans+0x44/0x70 [btrfs]
             start_transaction+0xd2/0x500 [btrfs]
             btrfs_dirty_inode+0x44/0xd0 [btrfs]
             file_update_time+0xc6/0x120
             btrfs_page_mkwrite+0xda/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
       lock(sb_pagefaults);
                                   lock(&mm->mmap_lock#2);
                                   lock(sb_pagefaults);
       lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by systemd-journal/509:
       #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
       #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
       #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x92/0xc8
       check_noncircular+0x134/0x150
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       __mutex_lock+0x7b/0x820
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? kvm_sched_clock_read+0x14/0x30
       ? sched_clock+0x5/0x10
       ? sched_clock_cpu+0xc/0xb0
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       ? sched_clock+0x5/0x10
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7fa3972fdbfe
      Code: Bad RIP value.
      
      Fix this by not holding the ->device_list_mutex at this point.  The
      device_list_mutex exists to protect us from modifying the device list
      while the file system is running.
      
      However it can also be modified by doing a scan on a device.  But this
      action is specifically protected by the uuid_mutex, which we are holding
      here.  We cannot race with opening at this point because we have the
      ->s_mount lock held during the mount.  Not having the
      ->device_list_mutex here is perfectly safe as we're not going to change
      the devices at this point.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18c850fd
    • Q
      btrfs: relocation: review the call sites which can be interrupted by signal · 44d354ab
      Qu Wenruo 提交于
      Since most metadata reservation calls can return -EINTR when get
      interrupted by fatal signal, we need to review the all the metadata
      reservation call sites.
      
      In relocation code, the metadata reservation happens in the following
      sites:
      
      - btrfs_block_rsv_refill() in merge_reloc_root()
        merge_reloc_root() is a pretty critical section, we don't want to be
        interrupted by signal, so change the flush status to
        BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
        Since such change can be ENPSPC-prone, also shrink the amount of
        metadata to reserve least amount avoid deadly ENOSPC there.
      
      - btrfs_block_rsv_refill() in reserve_metadata_space()
        It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
        by signal.
      
      - btrfs_block_rsv_refill() in prepare_to_relocate()
      
      - btrfs_block_rsv_add() in prepare_to_relocate()
      
      - btrfs_block_rsv_refill() in relocate_block_group()
      
      - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
      
      - btrfs_start_transaction() in relocate_block_group()
      
      - btrfs_start_transaction() in create_reloc_inode()
        Can be interrupted by fatal signal and we can handle it easily.
        For these call sites, just catch the -EINTR value in btrfs_balance()
        and count them as canceled.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44d354ab
    • D
      btrfs: prefetch chunk tree leaves at mount · d85327b1
      David Sterba 提交于
      The whole chunk tree is read at mount time so we can utilize readahead
      to get the tree blocks to memory before we read the items. The idea is
      from Robbie, but instead of updating search slot readahead, this patch
      implements the chunk tree readahead manually from nodes on level 1.
      
      We've decided to do specific readahead optimizations and then unify them
      under a common API so we don't break everything by changing the search
      slot readahead logic.
      
      Higher chunk trees grow on large filesystems (many terabytes), and
      prefetching just level 1 seems to be sufficient. Provided example was
      from a 200TiB filesystem with chunk tree level 2.
      
      CC: Robbie Ko <robbieko@synology.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d85327b1
    • N
      btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers · 608769a4
      Nikolay Borisov 提交于
      Since btrfs_bio always contains the extra space for the tgtdev_map and
      raid_maps it's pointless to make the assignment iff specific conditions
      are met.
      
      Instead, always assign the pointers to their correct value at allocation
      time. To accommodate this change also move code a bit in
      __btrfs_map_block so that btrfs_bio::stripes array is always initialized
      before the raid_map, subsequently move the call to sort_parity_stripes
      in the 'if' building the raid_map, retaining the old behavior.
      
      To better understand the change, there are 2 aspects to this:
      
      1. The original code is harder to grasp because the calculations for
         initializing raid_map/tgtdev ponters are apart from the initial
         allocation of memory. Having them predicated on 2 separate checks
         doesn't help that either... So by moving the initialisation in
         alloc_btrfs_bio puts everything together.
      
      2. tgtdev/raid_maps are now always initialized despite sometimes they
         might be equal i.e __btrfs_map_block_for_discard calls
         alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
         on external checks i.e. just because those pointers are non-null
         doesn't mean they are valid per-se. And actually while taking another
         look at __btrfs_map_block I saw a discrepancy:
      
         Original code initialised tgtdev_map if the following check is true:
      
      	   if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
      
         However, further down tgtdev_map is only used if the following check
         is true:
      
      	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
      
        e.g. the additional need_full_stripe(op) predicate is there.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ copy more details from mail discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      608769a4
    • N
      btrfs: don't check for btrfs_device::bdev in btrfs_end_bio · 3eee86c8
      Nikolay Borisov 提交于
      btrfs_map_bio ensures that all submitted bios to devices have valid
      btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
      check was added in june 2012 597a60fa ("Btrfs: don't count I/O
      statistic read errors for missing devices") but then in October of the
      same year another commit de1ee92a ("Btrfs: recheck bio against
      block device when we map the bio") started checking for the presence of
      btrfs_device::bdev before actually issuing the bio.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3eee86c8
    • N
      btrfs: record btrfs_device directly in btrfs_io_bio · c31efbdf
      Nikolay Borisov 提交于
      Instead of recording stripe_index and using that to access correct
      btrfs_device from btrfs_bio::stripes record the btrfs_device in
      btrfs_io_bio. This will enable endio handlers to increment device
      error counters on checksum errors.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c31efbdf
    • D
      btrfs: allow use of global block reserve for balance item deletion · 3502a8c0
      David Sterba 提交于
      On a filesystem with exhausted metadata, but still enough to start
      balance, it's possible to hit this error:
      
      [324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
      [324402.060769] BTRFS info (device loop0): balance: ended with status: -28
      [324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
      
      It fails inside reset_balance_state and turns the filesystem to
      read-only, which is unnecessary and should be fixed too, but the problem
      is caused by lack for space when the balance item is deleted. This is a
      one-time operation and from the same rank as unlink that is allowed to
      use the global block reserve. So do the same for the balance item.
      
      Status of the filesystem (100GiB) just after the balance fails:
      
      $ btrfs fi df mnt
      Data, single: total=80.01GiB, used=38.58GiB
      System, single: total=4.00MiB, used=16.00KiB
      Metadata, single: total=19.99GiB, used=19.48GiB
      GlobalReserve, single: total=512.00MiB, used=50.11MiB
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3502a8c0
  8. 22 7月, 2020 1 次提交
    • B
      btrfs: fix mount failure caused by race with umount · 48cfa61b
      Boris Burkov 提交于
      It is possible to cause a btrfs mount to fail by racing it with a slow
      umount. The crux of the sequence is generic_shutdown_super not yet
      calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
      If that occurs, btrfs_open_devices will decide the opened counter is
      non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
      0. From here, mount will call sget which will result in grab_super
      trying to take the super block umount semaphore. That semaphore will be
      held by the slow umount, so mount will block. Before up-ing the
      semaphore, umount will delete the super block, resulting in mount's sget
      reliably allocating a new one, which causes the mount path to dutifully
      fill it out, and increment total_rw_bytes a second time, which causes
      the mount to fail, as we see double the expected bytes.
      
      Here is the sequence laid out in greater detail:
      
      CPU0                                                    CPU1
      down_write sb->s_umount
      btrfs_kill_super
        kill_anon_super(sb)
          generic_shutdown_super(sb);
            shrink_dcache_for_umount(sb);
            sync_filesystem(sb);
            evict_inodes(sb); // SLOW
      
                                                    btrfs_mount_root
                                                      btrfs_scan_one_device
                                                      fs_devices = device->fs_devices
                                                      fs_info->fs_devices = fs_devices
                                                      // fs_devices-opened makes this a no-op
                                                      btrfs_open_devices(fs_devices, mode, fs_type)
                                                      s = sget(fs_type, test, set, flags, fs_info);
                                                        find sb in s_instances
                                                        grab_super(sb);
                                                          down_write(&s->s_umount); // blocks
      
            sop->put_super(sb)
              // sb->fs_devices->opened == 2; no-op
            spin_lock(&sb_lock);
            hlist_del_init(&sb->s_instances);
            spin_unlock(&sb_lock);
            up_write(&sb->s_umount);
                                                          return 0;
                                                        retry lookup
                                                        don't find sb in s_instances (deleted by CPU0)
                                                        s = alloc_super
                                                        return s;
                                                      btrfs_fill_super(s, fs_devices, data)
                                                        open_ctree // fs_devices total_rw_bytes improperly set!
                                                          btrfs_read_chunk_tree
                                                            read_one_dev // increment total_rw_bytes again!!
                                                            super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
      
      To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
      before the calls to read_one_dev, while holding the sb umount semaphore
      and the uuid mutex.
      
      To reproduce, it is sufficient to dirty a decent number of inodes, then
      quickly umount and mount.
      
        for i in $(seq 0 500)
        do
          dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
        done
        umount /mnt/foo&
        mount /mnt/foo
      
      does the trick for me.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48cfa61b
  9. 25 5月, 2020 1 次提交