1. 19 4月, 2021 1 次提交
  2. 11 3月, 2021 1 次提交
  3. 23 2月, 2021 1 次提交
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
  4. 09 2月, 2021 4 次提交
    • N
      btrfs: zoned: relocate block group to repair IO failure in zoned filesystems · f7ef5287
      Naohiro Aota 提交于
      When a bad checksum is found and if the filesystem has a mirror of the
      damaged data, we read the correct data from the mirror and writes it to
      damaged blocks. This however, violates the sequential write constraints
      of a zoned block device.
      
      We can consider three methods to repair an IO failure in zoned filesystems:
      
      (1) Reset and rewrite the damaged zone
      (2) Allocate new device extent and replace the damaged device extent to
          the new extent
      (3) Relocate the corresponding block group
      
      Method (1) is most similar to a behavior done with regular devices.
      However, it also wipes non-damaged data in the same device extent, and
      so it unnecessary degrades non-damaged data.
      
      Method (2) is much like device replacing but done in the same device. It
      is safe because it keeps the device extent until the replacing finish.
      However, extending device replacing is non-trivial. It assumes
      "src_dev->physical == dst_dev->physical". Also, the extent mapping
      replacing function should be extended to support replacing device extent
      position in one device.
      
      Method (3) invokes relocation of the damaged block group and is
      straightforward to implement. It relocates all the mirrored device
      extents, so it potentially is a more costly operation than method (1) or
      (2). But it relocates only used extents which reduce the total IO size.
      
      Let's apply method (3) for now. In the future, we can extend device-replace
      and apply method (2).
      
      For protecting a block group gets relocated multiple time with multiple
      IO errors, this commit introduces "relocating_repair" bit to show it's
      now relocating to repair IO failures. Also it uses a new kthread
      "btrfs-relocating-repair", not to block IO path with relocating process.
      
      This commit also supports repairing in the scrub process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7ef5287
    • N
      btrfs: zoned: support dev-replace in zoned filesystems · 7db1c5d1
      Naohiro Aota 提交于
      This is 4/4 patch to implement device-replace on zoned filesystems.
      
      Even after the copying is done, the write pointers of the source device
      and the destination device may not be synchronized. For example, when
      the last allocated extent is freed before device-replace process, the
      extent is not copied, leaving a hole there.
      
      Synchronize the write pointers by writing zeroes to the destination
      device.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7db1c5d1
    • N
      btrfs: zoned: implement copying for zoned device-replace · de17addc
      Naohiro Aota 提交于
      This is 3/4 patch to implement device-replace on zoned filesystems.
      
      This commit implements copying. To do this, it tracks the write pointer
      during the device replace process. As device-replace's copy process is
      smart enough to only copy used extents on the source device, we have to
      fill the gap to honor the sequential write requirement in the target
      device.
      
      The device-replace process on zoned filesystems must copy or clone all
      the extents in the source device exactly once. So, we need to ensure
      allocations started just before the dev-replace process to have their
      corresponding extent information in the B-trees.
      finish_extent_writes_for_zoned() implements that functionality, which
      basically is the removed code in the commit 042528f8 ("Btrfs: fix
      block group remaining RO forever after error during device replace").
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de17addc
    • N
      btrfs: zoned: mark block groups to copy for device-replace · 78ce9fc2
      Naohiro Aota 提交于
      This is the 1/4 patch to support device-replace on zoned filesystems.
      
      We have two types of IOs during the device replace process. One is an IO
      to "copy" (by the scrub functions) all the device extents from the source
      device to the destination device. The other one is an IO to "clone" (by
      handle_ops_on_dev_replace()) new incoming write IOs from users to the
      source device into the target device.
      
      Cloning incoming IOs can break the sequential write rule in on target
      device. When a write is mapped in the middle of a block group, the IO is
      directed to the middle of a target device zone, which breaks the
      sequential write requirement.
      
      However, the cloning function cannot be disabled since incoming IOs
      targeting already copied device extents must be cloned so that the IO is
      executed on the target device.
      
      We cannot use dev_replace->cursor_{left,right} to determine whether a bio
      is going to a not yet copied region. Since we have a time gap between
      finishing btrfs_scrub_dev() and rewriting the mapping tree in
      btrfs_dev_replace_finishing(), we can have a newly allocated device extent
      which is never cloned nor copied.
      
      So the point is to copy only already existing device extents. This patch
      introduces mark_block_group_to_copy() to mark existing block groups as a
      target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
      check the flag to do their job.
      
      Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
      is the last stripe in the block group. With the last stripe copied,
      the to_copy flag is finally disabled. Afterwards we can safely clone
      incoming IOs on this block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78ce9fc2
  5. 25 1月, 2021 1 次提交
  6. 10 12月, 2020 6 次提交
    • Q
      btrfs: scrub: allow scrub to work with subpage sectorsize · b42fe98c
      Qu Wenruo 提交于
      Since btrfs scrub is utilizing its own infrastructure to submit
      read/write, scrub is independent from all other routines.
      
      This brings one very neat feature, allow us to read 4K data into offset
      0 of a 64K page.  So is the writeback routine.
      
      This makes scrub on subpage sector size much easier to implement, and
      thanks to previous commits which just changed the implementation to
      always do scrub based on sector size, now scrub can handle subpage
      filesystem without any problem.
      
      This patch will just remove the restriction on
      (sectorsize != PAGE_SIZE), to make scrub finally work on subpage
      filesystems.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b42fe98c
    • Q
      btrfs: scrub: support subpage data scrub · b29dca44
      Qu Wenruo 提交于
      Btrfs scrub is more flexible than buffered data write path, as we can
      read an unaligned subpage data into page offset 0.
      
      This ability makes subpage support much easier, we just need to check
      each scrub_page::page_len and ensure we only calculate hash for [0,
      page_len) of a page.
      
      There is a small thing to notice: for subpage case, we still do sector
      by sector scrub.  This means we will submit a read bio for each sector
      to scrub, resulting in the same amount of read bios, just like on the 4K
      page systems.
      
      This behavior can be considered as a good thing, if we want everything
      to be the same as 4K page systems.  But this also means, we're wasting
      the possibility to submit larger bio using 64K page size.  This is
      another problem to consider in the future.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b29dca44
    • Q
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo 提交于
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f3251d
    • Q
      btrfs: scrub: always allocate one full page for one sector for RAID56 · d0a7a9c0
      Qu Wenruo 提交于
      For scrub_pages() and scrub_pages_for_parity(), we currently allocate
      one scrub_page structure for one page.
      
      This is fine if we only read/write one sector one time.  But for cases
      like scrubbing RAID56, we need to read/write the full stripe, which is
      in 64K size for now.
      
      For subpage size, we will submit the read in just one page, which is
      normally a good thing, but for RAID56 case, it only expects to see one
      sector, not the full stripe in its endio function.
      This could lead to wrong parity checksum for RAID56 on subpage.
      
      To make the existing code work well for subpage case, here we take a
      shortcut by always allocating a full page for one sector.
      
      This should provide the base to make RAID56 work for subpage case.
      
      The cost is pretty obvious now, for one RAID56 stripe now we always need
      16 pages. For support subpage situation (64K page size, 4K sector size),
      this means we need full one megabyte to scrub just one RAID56 stripe.
      
      And for data scrub, each 4K sector will also need one 64K page.
      
      This is mostly just a workaround, the proper fix for this is a much
      larger project, using scrub_block to replace scrub_page, and allow
      scrub_block to handle multi pages, csums, and csum_bitmap to avoid
      allocating one page for each sector.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0a7a9c0
    • Q
      btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits · fa485d21
      Qu Wenruo 提交于
      Btrfs on-disk format chose to use u64 for almost everything, but there
      are a other restrictions that won't let us use more than u32 for things
      like extent length (the maximum length is 128MiB for non-hole extents),
      or stripe length (we have device number limit).
      
      This means if we don't have extra handling to convert u64 to u32, we
      will always have some questionable operations like
      "u32 = u64 >> sectorsize_bits" in the code.
      
      This patch will try to address the problem by reducing the width for the
      following members/parameters:
      
      - scrub_parity::stripe_len
      - @len of scrub_pages()
      - @extent_len of scrub_remap_extent()
      - @len of scrub_parity_mark_sectors_error()
      - @len of scrub_parity_mark_sectors_data()
      - @len of scrub_extent()
      - @len of scrub_pages_for_parity()
      - @len of scrub_extent_for_parity()
      
      For members extracted from on-disk structure, like map->stripe_len, they
      will be kept as is. Since that modification would require on-disk format
      change.
      
      There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
      sites, extra ASSERT() is added to be extra safe for debug builds.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa485d21
    • N
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota 提交于
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12659251
  7. 08 12月, 2020 9 次提交
  8. 05 11月, 2020 1 次提交
  9. 07 10月, 2020 1 次提交
  10. 27 8月, 2020 1 次提交
    • J
      btrfs: allocate scrub workqueues outside of locks · e89c4a9c
      Josef Bacik 提交于
      I got the following lockdep splat while testing:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00172-g021118712e59 #932 Not tainted
        ------------------------------------------------------
        btrfs/229626 is trying to acquire lock:
        ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
      
        but task is already holding lock:
        ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_scrub_dev+0x11c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_run_dev_stats+0x49/0x480
      	 commit_cowonly_roots+0xb5/0x2a0
      	 btrfs_commit_transaction+0x516/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_commit_transaction+0x4bb/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_record_root_in_trans+0x43/0x70
      	 start_transaction+0xd1/0x5d0
      	 btrfs_dirty_inode+0x42/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 perf_read+0x141/0x2c0
      	 vfs_read+0xad/0x1b0
      	 ksys_read+0x5f/0xe0
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x88/0x150
      	 perf_event_init+0x1db/0x20b
      	 start_kernel+0x3ae/0x53c
      	 secondary_startup_64+0xa4/0xb0
      
        -> #1 (pmus_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x4f/0x150
      	 cpuhp_invoke_callback+0xb1/0x900
      	 _cpu_up.constprop.26+0x9f/0x130
      	 cpu_up+0x7b/0xc0
      	 bringup_nonboot_cpus+0x4f/0x60
      	 smp_init+0x26/0x71
      	 kernel_init_freeable+0x110/0x258
      	 kernel_init+0xa/0x103
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (cpu_hotplug_lock){++++}-{0:0}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 cpus_read_lock+0x39/0xb0
      	 alloc_workqueue+0x378/0x450
      	 __btrfs_alloc_workqueue+0x15d/0x200
      	 btrfs_alloc_workqueue+0x51/0x160
      	 scrub_workers_get+0x5a/0x170
      	 btrfs_scrub_dev+0x18c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->scrub_lock);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->scrub_lock);
          lock(cpu_hotplug_lock);
      
         *** DEADLOCK ***
      
        2 locks held by btrfs/229626:
         #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
         #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        stack backtrace:
        CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? alloc_workqueue+0x378/0x450
         cpus_read_lock+0x39/0xb0
         ? alloc_workqueue+0x378/0x450
         alloc_workqueue+0x378/0x450
         ? rcu_read_lock_sched_held+0x52/0x80
         __btrfs_alloc_workqueue+0x15d/0x200
         btrfs_alloc_workqueue+0x51/0x160
         scrub_workers_get+0x5a/0x170
         btrfs_scrub_dev+0x18c/0x630
         ? start_transaction+0xd1/0x5d0
         btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
         btrfs_ioctl+0x2799/0x30a0
         ? do_sigaction+0x102/0x250
         ? lockdep_hardirqs_on_prepare+0xca/0x160
         ? _raw_spin_unlock_irq+0x24/0x30
         ? trace_hardirqs_on+0x1c/0xe0
         ? _raw_spin_unlock_irq+0x24/0x30
         ? do_sigaction+0x102/0x250
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because we're allocating the scrub workqueues under the
      scrub and device list mutex, which brings in a whole host of other
      dependencies.
      
      Because the work queue allocation is done with GFP_KERNEL, it can
      trigger reclaim, which can lead to a transaction commit, which in turns
      needs the device_list_mutex, it can lead to a deadlock. A different
      problem for which this fix is a solution.
      
      Fix this by moving the actual allocation outside of the
      scrub lock, and then only take the lock once we're ready to actually
      assign them to the fs_info.  We'll now have to cleanup the workqueues in
      a few more places, so I've added a helper to do the refcount dance to
      safely free the workqueues.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e89c4a9c
  11. 27 7月, 2020 10 次提交
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_tree_block · 100aa5d9
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      100aa5d9
    • D
      btrfs: scrub: simplify tree block checksum calculation · 521e1022
      David Sterba 提交于
      Use a simpler iteration over tree block pages, same what csum_tree_block
      does: first page always exists, loop over the rest.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      521e1022
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_data · d41ebef2
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d41ebef2
    • D
      btrfs: scrub: simplify data block checksum calculation · 771aba0d
      David Sterba 提交于
      We have sectorsize same as PAGE_SIZE, the checksum can be calculated in
      one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      771aba0d
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_super · c7460541
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7460541
    • D
      btrfs: scrub: remove temporary csum array in scrub_checksum_super · 74710cf1
      David Sterba 提交于
      The page contents with the checksum is available during the entire
      function so we don't need to make a copy.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74710cf1
    • D
      btrfs: scrub: simplify superblock checksum calculation · 83cf6d5e
      David Sterba 提交于
      BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
      architectures, so we can calculate the checksum in one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83cf6d5e
    • D
      btrfs: scrub: unify naming of page address variables · b0485252
      David Sterba 提交于
      As the page mapping has been removed, rename the variables to 'kaddr'
      that we use everywhere else. The type is changed to 'char *' so pointer
      arithmetic works without casts.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0485252
    • D
      btrfs: scrub: remove kmap/kunmap of pages · a8b3a890
      David Sterba 提交于
      All pages that scrub uses in the scrub_block::pagev array are allocated
      with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
      we only need to know the page address.
      
      In scrub_write_page_to_dev_replace we don't even need to call
      flush_dcache_page because of the same reason as above.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8b3a890
  12. 25 5月, 2020 4 次提交
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • F
      btrfs: scrub, only lookup for csums if we are dealing with a data extent · 89490303
      Filipe Manana 提交于
      When scrubbing a stripe, whenever we find an extent we lookup for its
      checksums in the checksum tree. However we do it even for metadata extents
      which don't have checksum items stored in the checksum tree, that is
      only for data extents.
      
      So make the lookup for checksums only if we are processing with a data
      extent.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89490303
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
    • F
      btrfs: fix a race between scrub and block group removal/allocation · 2473d24f
      Filipe Manana 提交于
      When scrub is verifying the extents of a block group for a device, it is
      possible that the corresponding block group gets removed and its logical
      address and device extents get used for a new block group allocation.
      When this happens scrub incorrectly reports that errors were detected
      and, if the the new block group has a different profile then the old one,
      deleted block group, we can crash due to a null pointer dereference.
      Possibly other unexpected and weird consequences can happen as well.
      
      Consider the following sequence of actions that leads to the null pointer
      dereference crash when scrub is running in parallel with balance:
      
      1) Balance sets block group X to read-only mode and starts relocating it.
         Block group X is a metadata block group, has a raid1 profile (two
         device extents, each one in a different device) and a logical address
         of 19424870400;
      
      2) Scrub is running and finds device extent E, which belongs to block
         group X. It enters scrub_stripe() to find all extents allocated to
         block group X, the search is done using the extent tree;
      
      3) Balance finishes relocating block group X and removes block group X;
      
      4) Balance starts relocating another block group and when trying to
         commit the current transaction as part of the preparation step
         (prepare_to_relocate()), it blocks because scrub is running;
      
      5) The scrub task finds the metadata extent at the logical address
         19425001472 and marks the pages of the extent to be read by a bio
         (struct scrub_bio). The extent item's flags, which have the bit
         BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
         scrub_page). It is these flags in the scrub pages that tells the
         bio's end io function (scrub_bio_end_io_worker) which type of extent
         it is dealing with. At this point we end up with 4 pages in a bio
         which is ready for submission (the metadata extent has a size of
         16Kb, so that gives 4 pages on x86);
      
      6) At the next iteration of scrub_stripe(), scrub checks that there is a
         pause request from the relocation task trying to commit a transaction,
         therefore it submits the pending bio and pauses, waiting for the
         transaction commit to complete before resuming;
      
      7) The relocation task commits the transaction. The device extent E, that
         was used by our block group X, is now available for allocation, since
         the commit root for the device tree was swapped by the transaction
         commit;
      
      8) Another task doing a direct IO write allocates a new data block group Y
         which ends using device extent E. This new block group Y also ends up
         getting the same logical address that block group X had: 19424870400.
         This happens because block group X was the block group with the highest
         logical address and, when allocating Y, find_next_chunk() returns the
         end offset of the current last block group to be used as the logical
         address for the new block group, which is
      
              18351128576 + 1073741824 = 19424870400
      
         So our new block group Y has the same logical address and device extent
         that block group X had. However Y is a data block group, while X was
         a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
      
      9) After allocating block group Y, the direct IO submits a bio to write
         to device extent E;
      
      10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
          extent E, which now correspond to the data written by the task that
          did a direct IO write. Then at the end io function associated with
          the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
          which calls scrub_checksum(). This later function checks the flags
          of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
          is set in the flags, so it assumes it has a metadata extent and
          then calls scrub_checksum_tree_block(). That functions returns an
          error, since interpreting data as a metadata extent causes the
          checksum verification to fail.
      
          So this makes scrub_checksum() call scrub_handle_errored_block(),
          which determines 'failed_mirror_index' to be 1, since the device
          extent E was allocated as the second mirror of block group X.
      
          It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
          'sblocks_for_recheck', and all the memory is initialized to zeroes by
          kcalloc().
      
          After that it calls scrub_setup_recheck_block(), which is responsible
          for filling each of those structures. However, when that function
          calls btrfs_map_sblock() against the logical address of the metadata
          extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
          the current block group Y. However block group Y has a raid0 profile
          and not a raid1 profile like X had, so the following call returns 1:
      
             scrub_nr_raid_mirrors(bbio)
      
          And as a result scrub_setup_recheck_block() only initializes the
          first (index 0) scrub_block structure in 'sblocks_for_recheck'.
      
          Then scrub_recheck_block() is called by scrub_handle_errored_block()
          with the second (index 1) scrub_block structure as the argument,
          because 'failed_mirror_index' was previously set to 1.
          This scrub_block was not initialized by scrub_setup_recheck_block(),
          so it has zero pages, its 'page_count' member is 0 and its 'pagev'
          page array has all members pointing to NULL.
      
          Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
          we have a NULL pointer dereference when accessing the flags of the first
          page, as pavev[0] is NULL:
      
          static void scrub_recheck_block_checksum(struct scrub_block *sblock)
          {
              (...)
              if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
                  scrub_checksum_data(sblock);
              (...)
          }
      
          Producing a stack trace like the following:
      
          [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
          [542998.010238] #PF: supervisor read access in kernel mode
          [542998.010878] #PF: error_code(0x0000) - not-present page
          [542998.011516] PGD 0 P4D 0
          [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
          [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G    B   W         5.6.0-rc7-btrfs-next-58 #1
          [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
          [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
          [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
          [542998.018474] Code: 4c 89 e6 ...
          [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
          [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
          [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
          [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
          [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
          [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
          [542998.027237] FS:  0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
          [542998.028327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
          [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          [542998.032380] Call Trace:
          [542998.032752]  scrub_recheck_block+0x162/0x400 [btrfs]
          [542998.033500]  ? __alloc_pages_nodemask+0x31e/0x460
          [542998.034228]  scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
          [542998.035170]  scrub_bio_end_io_worker+0x100/0x520 [btrfs]
          [542998.035991]  btrfs_work_helper+0xaa/0x720 [btrfs]
          [542998.036735]  process_one_work+0x26d/0x6a0
          [542998.037275]  worker_thread+0x4f/0x3e0
          [542998.037740]  ? process_one_work+0x6a0/0x6a0
          [542998.038378]  kthread+0x103/0x140
          [542998.038789]  ? kthread_create_worker_on_cpu+0x70/0x70
          [542998.039419]  ret_from_fork+0x3a/0x50
          [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
          [542998.047288] CR2: 0000000000000028
          [542998.047724] ---[ end trace bde186e176c7f96a ]---
      
      This issue has been around for a long time, possibly since scrub exists.
      The last time I ran into it was over 2 years ago. After recently fixing
      fstests to pass the "--full-balance" command line option to btrfs-progs
      when doing balance, several tests started to more heavily exercise balance
      with fsstress, scrub and other operations in parallel, and therefore
      started to hit this issue again (with btrfs/061 for example).
      
      Fix this by having scrub increment the 'trimming' counter of the block
      group, which pins the block group in such a way that it guarantees neither
      its logical address nor device extents can be reused by future block group
      allocations until we decrement the 'trimming' counter. Also make sure that
      on each iteration of scrub_stripe() we stop scrubbing the block group if
      it was removed already.
      
      A later patch in the series will rename the block group's 'trimming'
      counter and its helpers to a more generic name, since now it is not used
      exclusively for pinning while trimming anymore.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2473d24f