1. 12 8月, 2020 1 次提交
    • Q
      btrfs: trim: fix underflow in trim length to prevent access beyond device boundary · c57dd1f2
      Qu Wenruo 提交于
      [BUG]
      The following script can lead to tons of beyond device boundary access:
      
        mkfs.btrfs -f $dev -b 10G
        mount $dev $mnt
        trimfs $mnt
        btrfs filesystem resize 1:-1G $mnt
        trimfs $mnt
      
      [CAUSE]
      Since commit 929be17a ("btrfs: Switch btrfs_trim_free_extents to
      find_first_clear_extent_bit"), we try to avoid trimming ranges that's
      already trimmed.
      
      So we check device->alloc_state by finding the first range which doesn't
      have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
      
      But if we shrunk the device, that bits are not cleared, thus we could
      easily got a range starts beyond the shrunk device size.
      
      This results the returned @start and @end are all beyond device size,
      then we call "end = min(end, device->total_bytes -1);" making @end
      smaller than device size.
      
      Then finally we goes "len = end - start + 1", totally underflow the
      result, and lead to the beyond-device-boundary access.
      
      [FIX]
      This patch will fix the problem in two ways:
      
      - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
        This is the root fix
      
      - Add extra safety check when trimming free device extents
        We check and warn if the returned range is already beyond current
        device.
      
      Link: https://github.com/kdave/btrfs-progs/issues/282
      Fixes: 929be17a ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c57dd1f2
  2. 27 7月, 2020 8 次提交
    • J
      btrfs: move the chunk_mutex in btrfs_read_chunk_tree · 01d01caf
      Josef Bacik 提交于
      We are currently getting this lockdep splat in btrfs/161:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc5+ #20 Tainted: G            E
        ------------------------------------------------------
        mount/678048 is trying to acquire lock:
        ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
      
        but task is already holding lock:
        ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x8b/0x8f0
      	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
      	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
      	 ksys_ioctl+0x87/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1240/0x2460
      	 lock_acquire+0xab/0x360
      	 __mutex_lock+0x8b/0x8f0
      	 clone_fs_devices+0x4d/0x170 [btrfs]
      	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
      	 open_ctree+0xb7c/0x18ce [btrfs]
      	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 do_mount+0x7de/0xb30
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x52/0xb0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->chunk_mutex);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->chunk_mutex);
          lock(&fs_devs->device_list_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by mount/678048:
         #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
         #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
        Call Trace:
         dump_stack+0x96/0xd0
         check_noncircular+0x162/0x180
         __lock_acquire+0x1240/0x2460
         ? asm_sysvec_apic_timer_interrupt+0x12/0x20
         lock_acquire+0xab/0x360
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         __mutex_lock+0x8b/0x8f0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? rcu_read_lock_sched_held+0x52/0x60
         ? cpumask_next+0x16/0x20
         ? module_assert_mutex_or_preempt+0x14/0x40
         ? __module_address+0x28/0xf0
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         ? static_obj+0x4f/0x60
         ? lockdep_init_map_waits+0x43/0x200
         ? clone_fs_devices+0x4d/0x170 [btrfs]
         clone_fs_devices+0x4d/0x170 [btrfs]
         btrfs_read_chunk_tree+0x330/0x800 [btrfs]
         open_ctree+0xb7c/0x18ce [btrfs]
         ? super_setup_bdi_name+0x79/0xd0
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? vfs_parse_fs_string+0x84/0xb0
         ? rcu_read_lock_sched_held+0x52/0x60
         ? kfree+0x2b5/0x310
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         fc_mount+0xe/0x40
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x13b/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? rcu_read_lock_sched_held+0x52/0x60
         ? legacy_get_tree+0x30/0x50
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         do_mount+0x7de/0xb30
         ? memdup_user+0x4e/0x90
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x52/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
      then read the device, which takes the device_list_mutex.  The
      device_list_mutex needs to be taken before the chunk_mutex, so this is a
      problem.  We only really need the chunk mutex around adding the chunk,
      so move the mutex around read_one_chunk.
      
      An argument could be made that we don't even need the chunk_mutex here
      as it's during mount, and we are protected by various other locks.
      However we already have special rules for ->device_list_mutex, and I'd
      rather not have another special case for ->chunk_mutex.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      01d01caf
    • J
      btrfs: open device without device_list_mutex · 18c850fd
      Josef Bacik 提交于
      There's long existed a lockdep splat because we open our bdev's under
      the ->device_list_mutex at mount time, which acquires the bd_mutex.
      Usually this goes unnoticed, but if you do loopback devices at all
      suddenly the bd_mutex comes with a whole host of other dependencies,
      which results in the splat when you mount a btrfs file system.
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
      ------------------------------------------------------
      systemd-journal/509 is trying to acquire lock:
      ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
      
      but task is already holding lock:
      ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
       -> #6 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x13e/0x220
             btrfs_page_mkwrite+0x59/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
       -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x60/0x80
             _copy_from_user+0x20/0xb0
             get_sg_io_hdr+0x9a/0xb0
             scsi_cmd_ioctl+0x1ea/0x2f0
             cdrom_ioctl+0x3c/0x12b4
             sr_block_ioctl+0xa4/0xd0
             block_ioctl+0x3f/0x50
             ksys_ioctl+0x82/0xc0
             __x64_sys_ioctl+0x16/0x20
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #4 (&cd->lock){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             sr_block_open+0xa2/0x180
             __blkdev_get+0xdd/0x550
             blkdev_get+0x38/0x150
             do_dentry_open+0x16b/0x3e0
             path_openat+0x3c9/0xa00
             do_filp_open+0x75/0x100
             do_sys_openat2+0x8a/0x140
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             __blkdev_get+0x6a/0x550
             blkdev_get+0x85/0x150
             blkdev_get_by_path+0x2c/0x70
             btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
             open_fs_devices+0x88/0x240 [btrfs]
             btrfs_open_devices+0x92/0xa0 [btrfs]
             btrfs_mount_root+0x250/0x490 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             vfs_kern_mount.part.0+0x71/0xb0
             btrfs_mount+0x119/0x380 [btrfs]
             legacy_get_tree+0x30/0x50
             vfs_get_tree+0x28/0xc0
             do_mount+0x8c6/0xca0
             __x64_sys_mount+0x8e/0xd0
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_run_dev_stats+0x36/0x420 [btrfs]
             commit_cowonly_roots+0x91/0x2d0 [btrfs]
             btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7b/0x820
             btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
             btrfs_sync_file+0x38a/0x480 [btrfs]
             __x64_sys_fdatasync+0x47/0x80
             do_syscall_64+0x52/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1241/0x20c0
             lock_acquire+0xb0/0x400
             __mutex_lock+0x7b/0x820
             btrfs_record_root_in_trans+0x44/0x70 [btrfs]
             start_transaction+0xd2/0x500 [btrfs]
             btrfs_dirty_inode+0x44/0xd0 [btrfs]
             file_update_time+0xc6/0x120
             btrfs_page_mkwrite+0xda/0x560 [btrfs]
             do_page_mkwrite+0x4f/0x130
             do_wp_page+0x3b0/0x4f0
             handle_mm_fault+0xf47/0x1850
             do_user_addr_fault+0x1fc/0x4b0
             exc_page_fault+0x88/0x300
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
       lock(sb_pagefaults);
                                   lock(&mm->mmap_lock#2);
                                   lock(sb_pagefaults);
       lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by systemd-journal/509:
       #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
       #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
       #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x92/0xc8
       check_noncircular+0x134/0x150
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? lock_acquire+0xb0/0x400
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       __mutex_lock+0x7b/0x820
       ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       ? kvm_sched_clock_read+0x14/0x30
       ? sched_clock+0x5/0x10
       ? sched_clock_cpu+0xc/0xb0
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       ? sched_clock+0x5/0x10
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7fa3972fdbfe
      Code: Bad RIP value.
      
      Fix this by not holding the ->device_list_mutex at this point.  The
      device_list_mutex exists to protect us from modifying the device list
      while the file system is running.
      
      However it can also be modified by doing a scan on a device.  But this
      action is specifically protected by the uuid_mutex, which we are holding
      here.  We cannot race with opening at this point because we have the
      ->s_mount lock held during the mount.  Not having the
      ->device_list_mutex here is perfectly safe as we're not going to change
      the devices at this point.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add some comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18c850fd
    • Q
      btrfs: relocation: review the call sites which can be interrupted by signal · 44d354ab
      Qu Wenruo 提交于
      Since most metadata reservation calls can return -EINTR when get
      interrupted by fatal signal, we need to review the all the metadata
      reservation call sites.
      
      In relocation code, the metadata reservation happens in the following
      sites:
      
      - btrfs_block_rsv_refill() in merge_reloc_root()
        merge_reloc_root() is a pretty critical section, we don't want to be
        interrupted by signal, so change the flush status to
        BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
        Since such change can be ENPSPC-prone, also shrink the amount of
        metadata to reserve least amount avoid deadly ENOSPC there.
      
      - btrfs_block_rsv_refill() in reserve_metadata_space()
        It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
        by signal.
      
      - btrfs_block_rsv_refill() in prepare_to_relocate()
      
      - btrfs_block_rsv_add() in prepare_to_relocate()
      
      - btrfs_block_rsv_refill() in relocate_block_group()
      
      - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
      
      - btrfs_start_transaction() in relocate_block_group()
      
      - btrfs_start_transaction() in create_reloc_inode()
        Can be interrupted by fatal signal and we can handle it easily.
        For these call sites, just catch the -EINTR value in btrfs_balance()
        and count them as canceled.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44d354ab
    • D
      btrfs: prefetch chunk tree leaves at mount · d85327b1
      David Sterba 提交于
      The whole chunk tree is read at mount time so we can utilize readahead
      to get the tree blocks to memory before we read the items. The idea is
      from Robbie, but instead of updating search slot readahead, this patch
      implements the chunk tree readahead manually from nodes on level 1.
      
      We've decided to do specific readahead optimizations and then unify them
      under a common API so we don't break everything by changing the search
      slot readahead logic.
      
      Higher chunk trees grow on large filesystems (many terabytes), and
      prefetching just level 1 seems to be sufficient. Provided example was
      from a 200TiB filesystem with chunk tree level 2.
      
      CC: Robbie Ko <robbieko@synology.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d85327b1
    • N
      btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers · 608769a4
      Nikolay Borisov 提交于
      Since btrfs_bio always contains the extra space for the tgtdev_map and
      raid_maps it's pointless to make the assignment iff specific conditions
      are met.
      
      Instead, always assign the pointers to their correct value at allocation
      time. To accommodate this change also move code a bit in
      __btrfs_map_block so that btrfs_bio::stripes array is always initialized
      before the raid_map, subsequently move the call to sort_parity_stripes
      in the 'if' building the raid_map, retaining the old behavior.
      
      To better understand the change, there are 2 aspects to this:
      
      1. The original code is harder to grasp because the calculations for
         initializing raid_map/tgtdev ponters are apart from the initial
         allocation of memory. Having them predicated on 2 separate checks
         doesn't help that either... So by moving the initialisation in
         alloc_btrfs_bio puts everything together.
      
      2. tgtdev/raid_maps are now always initialized despite sometimes they
         might be equal i.e __btrfs_map_block_for_discard calls
         alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
         on external checks i.e. just because those pointers are non-null
         doesn't mean they are valid per-se. And actually while taking another
         look at __btrfs_map_block I saw a discrepancy:
      
         Original code initialised tgtdev_map if the following check is true:
      
      	   if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
      
         However, further down tgtdev_map is only used if the following check
         is true:
      
      	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
      
        e.g. the additional need_full_stripe(op) predicate is there.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ copy more details from mail discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      608769a4
    • N
      btrfs: don't check for btrfs_device::bdev in btrfs_end_bio · 3eee86c8
      Nikolay Borisov 提交于
      btrfs_map_bio ensures that all submitted bios to devices have valid
      btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
      check was added in june 2012 597a60fa ("Btrfs: don't count I/O
      statistic read errors for missing devices") but then in October of the
      same year another commit de1ee92a ("Btrfs: recheck bio against
      block device when we map the bio") started checking for the presence of
      btrfs_device::bdev before actually issuing the bio.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3eee86c8
    • N
      btrfs: record btrfs_device directly in btrfs_io_bio · c31efbdf
      Nikolay Borisov 提交于
      Instead of recording stripe_index and using that to access correct
      btrfs_device from btrfs_bio::stripes record the btrfs_device in
      btrfs_io_bio. This will enable endio handlers to increment device
      error counters on checksum errors.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c31efbdf
    • D
      btrfs: allow use of global block reserve for balance item deletion · 3502a8c0
      David Sterba 提交于
      On a filesystem with exhausted metadata, but still enough to start
      balance, it's possible to hit this error:
      
      [324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
      [324402.060769] BTRFS info (device loop0): balance: ended with status: -28
      [324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
      
      It fails inside reset_balance_state and turns the filesystem to
      read-only, which is unnecessary and should be fixed too, but the problem
      is caused by lack for space when the balance item is deleted. This is a
      one-time operation and from the same rank as unlink that is allowed to
      use the global block reserve. So do the same for the balance item.
      
      Status of the filesystem (100GiB) just after the balance fails:
      
      $ btrfs fi df mnt
      Data, single: total=80.01GiB, used=38.58GiB
      System, single: total=4.00MiB, used=16.00KiB
      Metadata, single: total=19.99GiB, used=19.48GiB
      GlobalReserve, single: total=512.00MiB, used=50.11MiB
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3502a8c0
  3. 22 7月, 2020 1 次提交
    • B
      btrfs: fix mount failure caused by race with umount · 48cfa61b
      Boris Burkov 提交于
      It is possible to cause a btrfs mount to fail by racing it with a slow
      umount. The crux of the sequence is generic_shutdown_super not yet
      calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
      If that occurs, btrfs_open_devices will decide the opened counter is
      non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
      0. From here, mount will call sget which will result in grab_super
      trying to take the super block umount semaphore. That semaphore will be
      held by the slow umount, so mount will block. Before up-ing the
      semaphore, umount will delete the super block, resulting in mount's sget
      reliably allocating a new one, which causes the mount path to dutifully
      fill it out, and increment total_rw_bytes a second time, which causes
      the mount to fail, as we see double the expected bytes.
      
      Here is the sequence laid out in greater detail:
      
      CPU0                                                    CPU1
      down_write sb->s_umount
      btrfs_kill_super
        kill_anon_super(sb)
          generic_shutdown_super(sb);
            shrink_dcache_for_umount(sb);
            sync_filesystem(sb);
            evict_inodes(sb); // SLOW
      
                                                    btrfs_mount_root
                                                      btrfs_scan_one_device
                                                      fs_devices = device->fs_devices
                                                      fs_info->fs_devices = fs_devices
                                                      // fs_devices-opened makes this a no-op
                                                      btrfs_open_devices(fs_devices, mode, fs_type)
                                                      s = sget(fs_type, test, set, flags, fs_info);
                                                        find sb in s_instances
                                                        grab_super(sb);
                                                          down_write(&s->s_umount); // blocks
      
            sop->put_super(sb)
              // sb->fs_devices->opened == 2; no-op
            spin_lock(&sb_lock);
            hlist_del_init(&sb->s_instances);
            spin_unlock(&sb_lock);
            up_write(&sb->s_umount);
                                                          return 0;
                                                        retry lookup
                                                        don't find sb in s_instances (deleted by CPU0)
                                                        s = alloc_super
                                                        return s;
                                                      btrfs_fill_super(s, fs_devices, data)
                                                        open_ctree // fs_devices total_rw_bytes improperly set!
                                                          btrfs_read_chunk_tree
                                                            read_one_dev // increment total_rw_bytes again!!
                                                            super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
      
      To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
      before the calls to read_one_dev, while holding the sb umount semaphore
      and the uuid mutex.
      
      To reproduce, it is sufficient to dirty a decent number of inodes, then
      quickly umount and mount.
      
        for i in $(seq 0 500)
        do
          dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
        done
        umount /mnt/foo&
        mount /mnt/foo
      
      does the trick for me.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48cfa61b
  4. 25 5月, 2020 5 次提交
    • A
      btrfs: drop stale reference to volume_mutex · ae3e715f
      Anand Jain 提交于
      Commit dccdb07b ("btrfs: kill btrfs_fs_info::volume_mutex") removed
      the last use of the volume_mutex, forgetting to update the comment.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae3e715f
    • A
      btrfs: free alien device after device add · 7f551d96
      Anand Jain 提交于
      When an old device has new fsid through 'btrfs device add -f <dev>' our
      fs_devices list has an alien device in one of the fs_devices lists.
      
      By having an alien device in fs_devices, we have two issues so far
      
      1. missing device does not not show as missing in the userland
      
      2. degraded mount will fail
      
      Both issues are caused by the fact that there's an alien device in the
      fs_devices list. (Alien means that it does not belong to the filesystem,
      identified by fsid, or does not contain btrfs filesystem at all, eg. due
      to overwrite).
      
      A device can be scanned/added through the control device ioctls
      SCAN_DEV, DEVICES_READY or by ADD_DEV.
      
      And device coming through the control device is checked against the all
      other devices in the lists, but this was not the case for ADD_DEV.
      
      This patch fixes both issues above by removing the alien device.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f551d96
    • A
      btrfs: include non-missing as a qualifier for the latest_bdev · 998a0671
      Anand Jain 提交于
      btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
      the bdev with greatest device::generation number.  For a typical-missing
      device the generation number is zero so fs_devices::latest_bdev will
      never point to it.
      
      But if the missing device is due to alienation [1], then
      device::generation is not zero and if it is greater or equal to the rest
      of device  generations in the list, then fs_devices::latest_bdev ends up
      pointing to the missing device and reports the error like [2].
      
      [1] We maintain devices of a fsid (as in fs_device::fsid) in the
      fs_devices::devices list, a device is considered as an alien device
      if its fsid does not match with the fs_device::fsid
      
      Consider a working filesystem with raid1:
      
        $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
        $ mount /dev/sda /mnt-raid1
        $ umount /mnt-raid1
      
      While mnt-raid1 was unmounted the user force-adds one of its devices to
      another btrfs filesystem:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt-single
        $ btrfs dev add -f /dev/sda /mnt-single
      
      Now the original mnt-raid1 fails to mount in degraded mode, because
      fs_devices::latest_bdev is pointing to the alien device.
      
        $ mount -o degraded /dev/sdb /mnt-raid1
      
      [2]
      mount: wrong fs type, bad option, bad superblock on /dev/sdb,
             missing codepage or helper program, or other error
      
             In some cases useful info is found in syslog - try
             dmesg | tail or so.
      
        kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
        kernel: BTRFS error (device sdb): failed to read devices
        kernel: BTRFS error (device sdb): open_ctree failed
      
      Fix the root cause by checking if the device is not missing before it
      can be considered for the fs_devices::latest_bdev.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      998a0671
    • A
      btrfs: drop useless goto in open_fs_devices · 1ed802c9
      Anand Jain 提交于
      There is no need of goto out in open_fs_devices() as there is nothing
      special done there.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1ed802c9
    • N
      btrfs: make btrfs_read_disk_super return struct btrfs_disk_super · b335eab8
      Nikolay Borisov 提交于
      Instead of returning both the page and the super block structure, make
      btrfs_read_disk_super just return a pointer to struct btrfs_disk_super.
      As a result the function signature is simplified. Also,
      read_cache_page_gfp can never return NULL so check its return value only
      for IS_ERR.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b335eab8
  5. 24 3月, 2020 25 次提交