1. 21 6月, 2021 3 次提交
    • Q
      btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE · 8df507cb
      Qu Wenruo 提交于
      [BUG]
      For the following file layout, scrub will not be able to repair all
      these two repairable error, but in fact make one corruption even
      unrepairable:
      
      	  inode offset 0      4k     8K
      Mirror 1               |XXXXXX|      |
      Mirror 2               |      |XXXXXX|
      
      [CAUSE]
      The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
      go crazy for subpage.
      
      For above case, when reading the first sector, we use PAGE_SIZE other
      than sectorsize to read, which makes us to read the full range [0, 64K).
      In fact, after 8K there may be no data at all, we can just get some
      garbage.
      
      Then when doing the repair, we also writeback a full page from mirror 2,
      this means, we will also writeback the corrupted data in mirror 2 back
      to mirror 1, leaving the range [4K, 8K) unrepairable.
      
      [FIX]
      This patch will modify the following PAGE_SIZE use with sectorsize:
      
      - scrub_print_warning_inode()
        Remove the min() and replace PAGE_SIZE with sectorsize.
        The min() makes no sense, as csum is done for the full sector with
        padding.
      
        This fixes a bug that subpage report extra length like:
         checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
         physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
      
        Where the error is only 1 sector.
      
      - scrub_handle_errored_block()
        Comments with PAGE|page involved, all changed to sector.
      
      - scrub_setup_recheck_block()
      - scrub_repair_page_from_good_copy()
      - scrub_add_page_to_wr_bio()
      - scrub_wr_submit()
      - scrub_add_page_to_rd_bio()
      - scrub_block_complete()
        Replace PAGE_SIZE with sectorsize.
        This solves several problems where we read/write extra range for
        subpage case.
      
      RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
      usage, and is not really safe enough.
      Thus we will reject RAID56 for subpage in later commit.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8df507cb
    • D
      btrfs: scrub: factor out common scrub_stripe constraints · 7735cd75
      David Sterba 提交于
      There are common values set for the stripe constraints, some of them
      are already factored out. Do that for increment and mirror_num as well.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7735cd75
    • D
      btrfs: scrub: per-device bandwidth control · eb3b5053
      David Sterba 提交于
      Add sysfs interface to limit io during scrub. We relied on the ionice
      interface to do that, eg. the idle class let the system usable while
      scrub was running. This has changed when mq-deadline got widespread and
      did not implement the scheduling classes. That was a CFQ thing that got
      deleted. We've got numerous complaints from users about degraded
      performance.
      
      Currently only BFQ supports that but it's not a common scheduler and we
      can't ask everybody to switch to it.
      
      Alternatively the cgroup io limiting can be used but that also a
      non-trivial setup (v2 required, the controller must be enabled on the
      system). This can still be used if desired.
      
      Other ideas that have been explored: piggy-back on ionice (that is set
      per-process and is accessible) and interpret the class and classdata as
      bandwidth limits, but this does not have enough flexibility as there are
      only 8 allowed and we'd have to map fixed limits to each value. Also
      adjusting the value would need to lookup the process that currently runs
      scrub on the given device, and the value is not sticky so would have to
      be adjusted each time scrub runs.
      
      Running out of options, sysfs does not look that bad:
      
      - it's accessible from scripts, or udev rules
      - the name is similar to what MD-RAID has
        (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
      - the value is sticky at least for filesystem mount time
      - adjusting the value has immediate effect
      - sysfs is available in constrained environments (eg. system rescue)
      - the limit also applies to device replace
      
      Sysfs:
      
      - raw value is in bytes
      - values written to the file accept suffixes like K, M
      - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
      - 0 means use default priority of IO
      
      The scheduler is a simple deadline one and the accuracy is up to nearest
      128K.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb3b5053
  2. 21 4月, 2021 1 次提交
  3. 19 4月, 2021 1 次提交
  4. 11 3月, 2021 1 次提交
  5. 23 2月, 2021 1 次提交
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
  6. 09 2月, 2021 4 次提交
    • N
      btrfs: zoned: relocate block group to repair IO failure in zoned filesystems · f7ef5287
      Naohiro Aota 提交于
      When a bad checksum is found and if the filesystem has a mirror of the
      damaged data, we read the correct data from the mirror and writes it to
      damaged blocks. This however, violates the sequential write constraints
      of a zoned block device.
      
      We can consider three methods to repair an IO failure in zoned filesystems:
      
      (1) Reset and rewrite the damaged zone
      (2) Allocate new device extent and replace the damaged device extent to
          the new extent
      (3) Relocate the corresponding block group
      
      Method (1) is most similar to a behavior done with regular devices.
      However, it also wipes non-damaged data in the same device extent, and
      so it unnecessary degrades non-damaged data.
      
      Method (2) is much like device replacing but done in the same device. It
      is safe because it keeps the device extent until the replacing finish.
      However, extending device replacing is non-trivial. It assumes
      "src_dev->physical == dst_dev->physical". Also, the extent mapping
      replacing function should be extended to support replacing device extent
      position in one device.
      
      Method (3) invokes relocation of the damaged block group and is
      straightforward to implement. It relocates all the mirrored device
      extents, so it potentially is a more costly operation than method (1) or
      (2). But it relocates only used extents which reduce the total IO size.
      
      Let's apply method (3) for now. In the future, we can extend device-replace
      and apply method (2).
      
      For protecting a block group gets relocated multiple time with multiple
      IO errors, this commit introduces "relocating_repair" bit to show it's
      now relocating to repair IO failures. Also it uses a new kthread
      "btrfs-relocating-repair", not to block IO path with relocating process.
      
      This commit also supports repairing in the scrub process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7ef5287
    • N
      btrfs: zoned: support dev-replace in zoned filesystems · 7db1c5d1
      Naohiro Aota 提交于
      This is 4/4 patch to implement device-replace on zoned filesystems.
      
      Even after the copying is done, the write pointers of the source device
      and the destination device may not be synchronized. For example, when
      the last allocated extent is freed before device-replace process, the
      extent is not copied, leaving a hole there.
      
      Synchronize the write pointers by writing zeroes to the destination
      device.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7db1c5d1
    • N
      btrfs: zoned: implement copying for zoned device-replace · de17addc
      Naohiro Aota 提交于
      This is 3/4 patch to implement device-replace on zoned filesystems.
      
      This commit implements copying. To do this, it tracks the write pointer
      during the device replace process. As device-replace's copy process is
      smart enough to only copy used extents on the source device, we have to
      fill the gap to honor the sequential write requirement in the target
      device.
      
      The device-replace process on zoned filesystems must copy or clone all
      the extents in the source device exactly once. So, we need to ensure
      allocations started just before the dev-replace process to have their
      corresponding extent information in the B-trees.
      finish_extent_writes_for_zoned() implements that functionality, which
      basically is the removed code in the commit 042528f8 ("Btrfs: fix
      block group remaining RO forever after error during device replace").
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de17addc
    • N
      btrfs: zoned: mark block groups to copy for device-replace · 78ce9fc2
      Naohiro Aota 提交于
      This is the 1/4 patch to support device-replace on zoned filesystems.
      
      We have two types of IOs during the device replace process. One is an IO
      to "copy" (by the scrub functions) all the device extents from the source
      device to the destination device. The other one is an IO to "clone" (by
      handle_ops_on_dev_replace()) new incoming write IOs from users to the
      source device into the target device.
      
      Cloning incoming IOs can break the sequential write rule in on target
      device. When a write is mapped in the middle of a block group, the IO is
      directed to the middle of a target device zone, which breaks the
      sequential write requirement.
      
      However, the cloning function cannot be disabled since incoming IOs
      targeting already copied device extents must be cloned so that the IO is
      executed on the target device.
      
      We cannot use dev_replace->cursor_{left,right} to determine whether a bio
      is going to a not yet copied region. Since we have a time gap between
      finishing btrfs_scrub_dev() and rewriting the mapping tree in
      btrfs_dev_replace_finishing(), we can have a newly allocated device extent
      which is never cloned nor copied.
      
      So the point is to copy only already existing device extents. This patch
      introduces mark_block_group_to_copy() to mark existing block groups as a
      target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
      check the flag to do their job.
      
      Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
      is the last stripe in the block group. With the last stripe copied,
      the to_copy flag is finally disabled. Afterwards we can safely clone
      incoming IOs on this block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78ce9fc2
  7. 25 1月, 2021 1 次提交
  8. 10 12月, 2020 6 次提交
    • Q
      btrfs: scrub: allow scrub to work with subpage sectorsize · b42fe98c
      Qu Wenruo 提交于
      Since btrfs scrub is utilizing its own infrastructure to submit
      read/write, scrub is independent from all other routines.
      
      This brings one very neat feature, allow us to read 4K data into offset
      0 of a 64K page.  So is the writeback routine.
      
      This makes scrub on subpage sector size much easier to implement, and
      thanks to previous commits which just changed the implementation to
      always do scrub based on sector size, now scrub can handle subpage
      filesystem without any problem.
      
      This patch will just remove the restriction on
      (sectorsize != PAGE_SIZE), to make scrub finally work on subpage
      filesystems.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b42fe98c
    • Q
      btrfs: scrub: support subpage data scrub · b29dca44
      Qu Wenruo 提交于
      Btrfs scrub is more flexible than buffered data write path, as we can
      read an unaligned subpage data into page offset 0.
      
      This ability makes subpage support much easier, we just need to check
      each scrub_page::page_len and ensure we only calculate hash for [0,
      page_len) of a page.
      
      There is a small thing to notice: for subpage case, we still do sector
      by sector scrub.  This means we will submit a read bio for each sector
      to scrub, resulting in the same amount of read bios, just like on the 4K
      page systems.
      
      This behavior can be considered as a good thing, if we want everything
      to be the same as 4K page systems.  But this also means, we're wasting
      the possibility to submit larger bio using 64K page size.  This is
      another problem to consider in the future.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b29dca44
    • Q
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo 提交于
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f3251d
    • Q
      btrfs: scrub: always allocate one full page for one sector for RAID56 · d0a7a9c0
      Qu Wenruo 提交于
      For scrub_pages() and scrub_pages_for_parity(), we currently allocate
      one scrub_page structure for one page.
      
      This is fine if we only read/write one sector one time.  But for cases
      like scrubbing RAID56, we need to read/write the full stripe, which is
      in 64K size for now.
      
      For subpage size, we will submit the read in just one page, which is
      normally a good thing, but for RAID56 case, it only expects to see one
      sector, not the full stripe in its endio function.
      This could lead to wrong parity checksum for RAID56 on subpage.
      
      To make the existing code work well for subpage case, here we take a
      shortcut by always allocating a full page for one sector.
      
      This should provide the base to make RAID56 work for subpage case.
      
      The cost is pretty obvious now, for one RAID56 stripe now we always need
      16 pages. For support subpage situation (64K page size, 4K sector size),
      this means we need full one megabyte to scrub just one RAID56 stripe.
      
      And for data scrub, each 4K sector will also need one 64K page.
      
      This is mostly just a workaround, the proper fix for this is a much
      larger project, using scrub_block to replace scrub_page, and allow
      scrub_block to handle multi pages, csums, and csum_bitmap to avoid
      allocating one page for each sector.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0a7a9c0
    • Q
      btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits · fa485d21
      Qu Wenruo 提交于
      Btrfs on-disk format chose to use u64 for almost everything, but there
      are a other restrictions that won't let us use more than u32 for things
      like extent length (the maximum length is 128MiB for non-hole extents),
      or stripe length (we have device number limit).
      
      This means if we don't have extra handling to convert u64 to u32, we
      will always have some questionable operations like
      "u32 = u64 >> sectorsize_bits" in the code.
      
      This patch will try to address the problem by reducing the width for the
      following members/parameters:
      
      - scrub_parity::stripe_len
      - @len of scrub_pages()
      - @extent_len of scrub_remap_extent()
      - @len of scrub_parity_mark_sectors_error()
      - @len of scrub_parity_mark_sectors_data()
      - @len of scrub_extent()
      - @len of scrub_pages_for_parity()
      - @len of scrub_extent_for_parity()
      
      For members extracted from on-disk structure, like map->stripe_len, they
      will be kept as is. Since that modification would require on-disk format
      change.
      
      There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
      sites, extra ASSERT() is added to be extra safe for debug builds.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa485d21
    • N
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota 提交于
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12659251
  9. 08 12月, 2020 9 次提交
  10. 05 11月, 2020 1 次提交
  11. 07 10月, 2020 1 次提交
  12. 27 8月, 2020 1 次提交
    • J
      btrfs: allocate scrub workqueues outside of locks · e89c4a9c
      Josef Bacik 提交于
      I got the following lockdep splat while testing:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00172-g021118712e59 #932 Not tainted
        ------------------------------------------------------
        btrfs/229626 is trying to acquire lock:
        ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
      
        but task is already holding lock:
        ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_scrub_dev+0x11c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_run_dev_stats+0x49/0x480
      	 commit_cowonly_roots+0xb5/0x2a0
      	 btrfs_commit_transaction+0x516/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_commit_transaction+0x4bb/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_record_root_in_trans+0x43/0x70
      	 start_transaction+0xd1/0x5d0
      	 btrfs_dirty_inode+0x42/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 perf_read+0x141/0x2c0
      	 vfs_read+0xad/0x1b0
      	 ksys_read+0x5f/0xe0
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x88/0x150
      	 perf_event_init+0x1db/0x20b
      	 start_kernel+0x3ae/0x53c
      	 secondary_startup_64+0xa4/0xb0
      
        -> #1 (pmus_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x4f/0x150
      	 cpuhp_invoke_callback+0xb1/0x900
      	 _cpu_up.constprop.26+0x9f/0x130
      	 cpu_up+0x7b/0xc0
      	 bringup_nonboot_cpus+0x4f/0x60
      	 smp_init+0x26/0x71
      	 kernel_init_freeable+0x110/0x258
      	 kernel_init+0xa/0x103
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (cpu_hotplug_lock){++++}-{0:0}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 cpus_read_lock+0x39/0xb0
      	 alloc_workqueue+0x378/0x450
      	 __btrfs_alloc_workqueue+0x15d/0x200
      	 btrfs_alloc_workqueue+0x51/0x160
      	 scrub_workers_get+0x5a/0x170
      	 btrfs_scrub_dev+0x18c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->scrub_lock);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->scrub_lock);
          lock(cpu_hotplug_lock);
      
         *** DEADLOCK ***
      
        2 locks held by btrfs/229626:
         #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
         #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        stack backtrace:
        CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? alloc_workqueue+0x378/0x450
         cpus_read_lock+0x39/0xb0
         ? alloc_workqueue+0x378/0x450
         alloc_workqueue+0x378/0x450
         ? rcu_read_lock_sched_held+0x52/0x80
         __btrfs_alloc_workqueue+0x15d/0x200
         btrfs_alloc_workqueue+0x51/0x160
         scrub_workers_get+0x5a/0x170
         btrfs_scrub_dev+0x18c/0x630
         ? start_transaction+0xd1/0x5d0
         btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
         btrfs_ioctl+0x2799/0x30a0
         ? do_sigaction+0x102/0x250
         ? lockdep_hardirqs_on_prepare+0xca/0x160
         ? _raw_spin_unlock_irq+0x24/0x30
         ? trace_hardirqs_on+0x1c/0xe0
         ? _raw_spin_unlock_irq+0x24/0x30
         ? do_sigaction+0x102/0x250
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because we're allocating the scrub workqueues under the
      scrub and device list mutex, which brings in a whole host of other
      dependencies.
      
      Because the work queue allocation is done with GFP_KERNEL, it can
      trigger reclaim, which can lead to a transaction commit, which in turns
      needs the device_list_mutex, it can lead to a deadlock. A different
      problem for which this fix is a solution.
      
      Fix this by moving the actual allocation outside of the
      scrub lock, and then only take the lock once we're ready to actually
      assign them to the fs_info.  We'll now have to cleanup the workqueues in
      a few more places, so I've added a helper to do the refcount dance to
      safely free the workqueues.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e89c4a9c
  13. 27 7月, 2020 10 次提交
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_tree_block · 100aa5d9
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      100aa5d9
    • D
      btrfs: scrub: simplify tree block checksum calculation · 521e1022
      David Sterba 提交于
      Use a simpler iteration over tree block pages, same what csum_tree_block
      does: first page always exists, loop over the rest.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      521e1022
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_data · d41ebef2
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d41ebef2
    • D
      btrfs: scrub: simplify data block checksum calculation · 771aba0d
      David Sterba 提交于
      We have sectorsize same as PAGE_SIZE, the checksum can be calculated in
      one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      771aba0d
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_super · c7460541
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7460541
    • D
      btrfs: scrub: remove temporary csum array in scrub_checksum_super · 74710cf1
      David Sterba 提交于
      The page contents with the checksum is available during the entire
      function so we don't need to make a copy.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74710cf1
    • D
      btrfs: scrub: simplify superblock checksum calculation · 83cf6d5e
      David Sterba 提交于
      BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
      architectures, so we can calculate the checksum in one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83cf6d5e
    • D
      btrfs: scrub: unify naming of page address variables · b0485252
      David Sterba 提交于
      As the page mapping has been removed, rename the variables to 'kaddr'
      that we use everywhere else. The type is changed to 'char *' so pointer
      arithmetic works without casts.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0485252
    • D
      btrfs: scrub: remove kmap/kunmap of pages · a8b3a890
      David Sterba 提交于
      All pages that scrub uses in the scrub_block::pagev array are allocated
      with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
      we only need to know the page address.
      
      In scrub_write_page_to_dev_replace we don't even need to call
      flush_dcache_page because of the same reason as above.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8b3a890