1. 10 12月, 2020 4 次提交
    • Q
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo 提交于
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f3251d
    • Q
      btrfs: scrub: always allocate one full page for one sector for RAID56 · d0a7a9c0
      Qu Wenruo 提交于
      For scrub_pages() and scrub_pages_for_parity(), we currently allocate
      one scrub_page structure for one page.
      
      This is fine if we only read/write one sector one time.  But for cases
      like scrubbing RAID56, we need to read/write the full stripe, which is
      in 64K size for now.
      
      For subpage size, we will submit the read in just one page, which is
      normally a good thing, but for RAID56 case, it only expects to see one
      sector, not the full stripe in its endio function.
      This could lead to wrong parity checksum for RAID56 on subpage.
      
      To make the existing code work well for subpage case, here we take a
      shortcut by always allocating a full page for one sector.
      
      This should provide the base to make RAID56 work for subpage case.
      
      The cost is pretty obvious now, for one RAID56 stripe now we always need
      16 pages. For support subpage situation (64K page size, 4K sector size),
      this means we need full one megabyte to scrub just one RAID56 stripe.
      
      And for data scrub, each 4K sector will also need one 64K page.
      
      This is mostly just a workaround, the proper fix for this is a much
      larger project, using scrub_block to replace scrub_page, and allow
      scrub_block to handle multi pages, csums, and csum_bitmap to avoid
      allocating one page for each sector.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0a7a9c0
    • Q
      btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits · fa485d21
      Qu Wenruo 提交于
      Btrfs on-disk format chose to use u64 for almost everything, but there
      are a other restrictions that won't let us use more than u32 for things
      like extent length (the maximum length is 128MiB for non-hole extents),
      or stripe length (we have device number limit).
      
      This means if we don't have extra handling to convert u64 to u32, we
      will always have some questionable operations like
      "u32 = u64 >> sectorsize_bits" in the code.
      
      This patch will try to address the problem by reducing the width for the
      following members/parameters:
      
      - scrub_parity::stripe_len
      - @len of scrub_pages()
      - @extent_len of scrub_remap_extent()
      - @len of scrub_parity_mark_sectors_error()
      - @len of scrub_parity_mark_sectors_data()
      - @len of scrub_extent()
      - @len of scrub_pages_for_parity()
      - @len of scrub_extent_for_parity()
      
      For members extracted from on-disk structure, like map->stripe_len, they
      will be kept as is. Since that modification would require on-disk format
      change.
      
      There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
      sites, extra ASSERT() is added to be extra safe for debug builds.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa485d21
    • N
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota 提交于
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12659251
  2. 08 12月, 2020 9 次提交
  3. 05 11月, 2020 1 次提交
  4. 07 10月, 2020 1 次提交
  5. 27 8月, 2020 1 次提交
    • J
      btrfs: allocate scrub workqueues outside of locks · e89c4a9c
      Josef Bacik 提交于
      I got the following lockdep splat while testing:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc7-00172-g021118712e59 #932 Not tainted
        ------------------------------------------------------
        btrfs/229626 is trying to acquire lock:
        ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
      
        but task is already holding lock:
        ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_scrub_dev+0x11c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_run_dev_stats+0x49/0x480
      	 commit_cowonly_roots+0xb5/0x2a0
      	 btrfs_commit_transaction+0x516/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_commit_transaction+0x4bb/0xa60
      	 sync_filesystem+0x6b/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0xe/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x29/0x60
      	 cleanup_mnt+0xb8/0x140
      	 task_work_run+0x6d/0xb0
      	 __prepare_exit_to_usermode+0x1cc/0x1e0
      	 do_syscall_64+0x5c/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 btrfs_record_root_in_trans+0x43/0x70
      	 start_transaction+0xd1/0x5d0
      	 btrfs_dirty_inode+0x42/0xd0
      	 touch_atime+0xa1/0xd0
      	 btrfs_file_mmap+0x3f/0x60
      	 mmap_region+0x3a4/0x640
      	 do_mmap+0x376/0x580
      	 vm_mmap_pgoff+0xd5/0x120
      	 ksys_mmap_pgoff+0x193/0x230
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
      	 __might_fault+0x68/0x90
      	 _copy_to_user+0x1e/0x80
      	 perf_read+0x141/0x2c0
      	 vfs_read+0xad/0x1b0
      	 ksys_read+0x5f/0xe0
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x88/0x150
      	 perf_event_init+0x1db/0x20b
      	 start_kernel+0x3ae/0x53c
      	 secondary_startup_64+0xa4/0xb0
      
        -> #1 (pmus_lock){+.+.}-{3:3}:
      	 __mutex_lock+0x9f/0x930
      	 perf_event_init_cpu+0x4f/0x150
      	 cpuhp_invoke_callback+0xb1/0x900
      	 _cpu_up.constprop.26+0x9f/0x130
      	 cpu_up+0x7b/0xc0
      	 bringup_nonboot_cpus+0x4f/0x60
      	 smp_init+0x26/0x71
      	 kernel_init_freeable+0x110/0x258
      	 kernel_init+0xa/0x103
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (cpu_hotplug_lock){++++}-{0:0}:
      	 __lock_acquire+0x1272/0x2310
      	 lock_acquire+0x9e/0x360
      	 cpus_read_lock+0x39/0xb0
      	 alloc_workqueue+0x378/0x450
      	 __btrfs_alloc_workqueue+0x15d/0x200
      	 btrfs_alloc_workqueue+0x51/0x160
      	 scrub_workers_get+0x5a/0x170
      	 btrfs_scrub_dev+0x18c/0x630
      	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
      	 btrfs_ioctl+0x2799/0x30a0
      	 ksys_ioctl+0x83/0xc0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->scrub_lock);
      				 lock(&fs_devs->device_list_mutex);
      				 lock(&fs_info->scrub_lock);
          lock(cpu_hotplug_lock);
      
         *** DEADLOCK ***
      
        2 locks held by btrfs/229626:
         #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
         #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
      
        stack backtrace:
        CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
        Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x165/0x180
         __lock_acquire+0x1272/0x2310
         lock_acquire+0x9e/0x360
         ? alloc_workqueue+0x378/0x450
         cpus_read_lock+0x39/0xb0
         ? alloc_workqueue+0x378/0x450
         alloc_workqueue+0x378/0x450
         ? rcu_read_lock_sched_held+0x52/0x80
         __btrfs_alloc_workqueue+0x15d/0x200
         btrfs_alloc_workqueue+0x51/0x160
         scrub_workers_get+0x5a/0x170
         btrfs_scrub_dev+0x18c/0x630
         ? start_transaction+0xd1/0x5d0
         btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
         btrfs_ioctl+0x2799/0x30a0
         ? do_sigaction+0x102/0x250
         ? lockdep_hardirqs_on_prepare+0xca/0x160
         ? _raw_spin_unlock_irq+0x24/0x30
         ? trace_hardirqs_on+0x1c/0xe0
         ? _raw_spin_unlock_irq+0x24/0x30
         ? do_sigaction+0x102/0x250
         ? ksys_ioctl+0x83/0xc0
         ksys_ioctl+0x83/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x50/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because we're allocating the scrub workqueues under the
      scrub and device list mutex, which brings in a whole host of other
      dependencies.
      
      Because the work queue allocation is done with GFP_KERNEL, it can
      trigger reclaim, which can lead to a transaction commit, which in turns
      needs the device_list_mutex, it can lead to a deadlock. A different
      problem for which this fix is a solution.
      
      Fix this by moving the actual allocation outside of the
      scrub lock, and then only take the lock once we're ready to actually
      assign them to the fs_info.  We'll now have to cleanup the workqueues in
      a few more places, so I've added a helper to do the refcount dance to
      safely free the workqueues.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e89c4a9c
  6. 27 7月, 2020 10 次提交
    • J
      btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases · fbabd4a3
      Josef Bacik 提交于
      Eric reported seeing this message while running generic/475
      
        BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      Full stack trace:
      
        BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        ------------[ cut here ]------------
        BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
        BTRFS: Transaction aborted (error -117)
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
        WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
        CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
        RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
        RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
        RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
        R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
        FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
        Call Trace:
         ? finish_wait+0x90/0x90
         ? __mutex_unlock_slowpath+0x45/0x2a0
         ? lock_acquire+0xa3/0x440
         ? lockref_put_or_lock+0x9/0x30
         ? dput+0x20/0x4a0
         ? dput+0x20/0x4a0
         ? do_raw_spin_unlock+0x4b/0xc0
         ? _raw_spin_unlock+0x1f/0x30
         btrfs_sync_file+0x335/0x490 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x50/0xe0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f6a6ef1b6e3
        Code: Bad RIP value.
        RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
        RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
        R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace af146e0e38433456 ]---
        BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
      
      This ret came from btrfs_write_marked_extents().  If we get an aborted
      transaction via EIO before, we'll see it in btree_write_cache_pages()
      and return EUCLEAN, which gets printed as "Filesystem corrupted".
      
      Except we shouldn't be returning EUCLEAN here, we need to be returning
      EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
      
      We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
      elsewhere, but we want to use EROFS for this particular case.  The
      original transaction abort has the real error code for why we ended up
      with an aborted transaction, all subsequent actions just need to return
      EROFS because they may not have a trans handle and have no idea about
      the original cause of the abort.
      
      After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
      stacktrace will not be dumped either.
      Reported-by: NEric Sandeen <esandeen@redhat.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add full test stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbabd4a3
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_tree_block · 100aa5d9
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      100aa5d9
    • D
      btrfs: scrub: simplify tree block checksum calculation · 521e1022
      David Sterba 提交于
      Use a simpler iteration over tree block pages, same what csum_tree_block
      does: first page always exists, loop over the rest.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      521e1022
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_data · d41ebef2
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d41ebef2
    • D
      btrfs: scrub: simplify data block checksum calculation · 771aba0d
      David Sterba 提交于
      We have sectorsize same as PAGE_SIZE, the checksum can be calculated in
      one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      771aba0d
    • D
      btrfs: scrub: clean up temporary page variables in scrub_checksum_super · c7460541
      David Sterba 提交于
      Add proper variable for the scrub page and use it instead of repeatedly
      dereferencing the other structures.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7460541
    • D
      btrfs: scrub: remove temporary csum array in scrub_checksum_super · 74710cf1
      David Sterba 提交于
      The page contents with the checksum is available during the entire
      function so we don't need to make a copy.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74710cf1
    • D
      btrfs: scrub: simplify superblock checksum calculation · 83cf6d5e
      David Sterba 提交于
      BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
      architectures, so we can calculate the checksum in one go.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83cf6d5e
    • D
      btrfs: scrub: unify naming of page address variables · b0485252
      David Sterba 提交于
      As the page mapping has been removed, rename the variables to 'kaddr'
      that we use everywhere else. The type is changed to 'char *' so pointer
      arithmetic works without casts.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0485252
    • D
      btrfs: scrub: remove kmap/kunmap of pages · a8b3a890
      David Sterba 提交于
      All pages that scrub uses in the scrub_block::pagev array are allocated
      with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
      we only need to know the page address.
      
      In scrub_write_page_to_dev_replace we don't even need to call
      flush_dcache_page because of the same reason as above.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8b3a890
  7. 25 5月, 2020 4 次提交
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • F
      btrfs: scrub, only lookup for csums if we are dealing with a data extent · 89490303
      Filipe Manana 提交于
      When scrubbing a stripe, whenever we find an extent we lookup for its
      checksums in the checksum tree. However we do it even for metadata extents
      which don't have checksum items stored in the checksum tree, that is
      only for data extents.
      
      So make the lookup for checksums only if we are processing with a data
      extent.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89490303
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
    • F
      btrfs: fix a race between scrub and block group removal/allocation · 2473d24f
      Filipe Manana 提交于
      When scrub is verifying the extents of a block group for a device, it is
      possible that the corresponding block group gets removed and its logical
      address and device extents get used for a new block group allocation.
      When this happens scrub incorrectly reports that errors were detected
      and, if the the new block group has a different profile then the old one,
      deleted block group, we can crash due to a null pointer dereference.
      Possibly other unexpected and weird consequences can happen as well.
      
      Consider the following sequence of actions that leads to the null pointer
      dereference crash when scrub is running in parallel with balance:
      
      1) Balance sets block group X to read-only mode and starts relocating it.
         Block group X is a metadata block group, has a raid1 profile (two
         device extents, each one in a different device) and a logical address
         of 19424870400;
      
      2) Scrub is running and finds device extent E, which belongs to block
         group X. It enters scrub_stripe() to find all extents allocated to
         block group X, the search is done using the extent tree;
      
      3) Balance finishes relocating block group X and removes block group X;
      
      4) Balance starts relocating another block group and when trying to
         commit the current transaction as part of the preparation step
         (prepare_to_relocate()), it blocks because scrub is running;
      
      5) The scrub task finds the metadata extent at the logical address
         19425001472 and marks the pages of the extent to be read by a bio
         (struct scrub_bio). The extent item's flags, which have the bit
         BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
         scrub_page). It is these flags in the scrub pages that tells the
         bio's end io function (scrub_bio_end_io_worker) which type of extent
         it is dealing with. At this point we end up with 4 pages in a bio
         which is ready for submission (the metadata extent has a size of
         16Kb, so that gives 4 pages on x86);
      
      6) At the next iteration of scrub_stripe(), scrub checks that there is a
         pause request from the relocation task trying to commit a transaction,
         therefore it submits the pending bio and pauses, waiting for the
         transaction commit to complete before resuming;
      
      7) The relocation task commits the transaction. The device extent E, that
         was used by our block group X, is now available for allocation, since
         the commit root for the device tree was swapped by the transaction
         commit;
      
      8) Another task doing a direct IO write allocates a new data block group Y
         which ends using device extent E. This new block group Y also ends up
         getting the same logical address that block group X had: 19424870400.
         This happens because block group X was the block group with the highest
         logical address and, when allocating Y, find_next_chunk() returns the
         end offset of the current last block group to be used as the logical
         address for the new block group, which is
      
              18351128576 + 1073741824 = 19424870400
      
         So our new block group Y has the same logical address and device extent
         that block group X had. However Y is a data block group, while X was
         a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
      
      9) After allocating block group Y, the direct IO submits a bio to write
         to device extent E;
      
      10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
          extent E, which now correspond to the data written by the task that
          did a direct IO write. Then at the end io function associated with
          the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
          which calls scrub_checksum(). This later function checks the flags
          of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
          is set in the flags, so it assumes it has a metadata extent and
          then calls scrub_checksum_tree_block(). That functions returns an
          error, since interpreting data as a metadata extent causes the
          checksum verification to fail.
      
          So this makes scrub_checksum() call scrub_handle_errored_block(),
          which determines 'failed_mirror_index' to be 1, since the device
          extent E was allocated as the second mirror of block group X.
      
          It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
          'sblocks_for_recheck', and all the memory is initialized to zeroes by
          kcalloc().
      
          After that it calls scrub_setup_recheck_block(), which is responsible
          for filling each of those structures. However, when that function
          calls btrfs_map_sblock() against the logical address of the metadata
          extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
          the current block group Y. However block group Y has a raid0 profile
          and not a raid1 profile like X had, so the following call returns 1:
      
             scrub_nr_raid_mirrors(bbio)
      
          And as a result scrub_setup_recheck_block() only initializes the
          first (index 0) scrub_block structure in 'sblocks_for_recheck'.
      
          Then scrub_recheck_block() is called by scrub_handle_errored_block()
          with the second (index 1) scrub_block structure as the argument,
          because 'failed_mirror_index' was previously set to 1.
          This scrub_block was not initialized by scrub_setup_recheck_block(),
          so it has zero pages, its 'page_count' member is 0 and its 'pagev'
          page array has all members pointing to NULL.
      
          Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
          we have a NULL pointer dereference when accessing the flags of the first
          page, as pavev[0] is NULL:
      
          static void scrub_recheck_block_checksum(struct scrub_block *sblock)
          {
              (...)
              if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
                  scrub_checksum_data(sblock);
              (...)
          }
      
          Producing a stack trace like the following:
      
          [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
          [542998.010238] #PF: supervisor read access in kernel mode
          [542998.010878] #PF: error_code(0x0000) - not-present page
          [542998.011516] PGD 0 P4D 0
          [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
          [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G    B   W         5.6.0-rc7-btrfs-next-58 #1
          [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
          [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
          [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
          [542998.018474] Code: 4c 89 e6 ...
          [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
          [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
          [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
          [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
          [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
          [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
          [542998.027237] FS:  0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
          [542998.028327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
          [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          [542998.032380] Call Trace:
          [542998.032752]  scrub_recheck_block+0x162/0x400 [btrfs]
          [542998.033500]  ? __alloc_pages_nodemask+0x31e/0x460
          [542998.034228]  scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
          [542998.035170]  scrub_bio_end_io_worker+0x100/0x520 [btrfs]
          [542998.035991]  btrfs_work_helper+0xaa/0x720 [btrfs]
          [542998.036735]  process_one_work+0x26d/0x6a0
          [542998.037275]  worker_thread+0x4f/0x3e0
          [542998.037740]  ? process_one_work+0x6a0/0x6a0
          [542998.038378]  kthread+0x103/0x140
          [542998.038789]  ? kthread_create_worker_on_cpu+0x70/0x70
          [542998.039419]  ret_from_fork+0x3a/0x50
          [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
          [542998.047288] CR2: 0000000000000028
          [542998.047724] ---[ end trace bde186e176c7f96a ]---
      
      This issue has been around for a long time, possibly since scrub exists.
      The last time I ran into it was over 2 years ago. After recently fixing
      fstests to pass the "--full-balance" command line option to btrfs-progs
      when doing balance, several tests started to more heavily exercise balance
      with fsstress, scrub and other operations in parallel, and therefore
      started to hit this issue again (with btrfs/061 for example).
      
      Fix this by having scrub increment the 'trimming' counter of the block
      group, which pins the block group in such a way that it guarantees neither
      its logical address nor device extents can be reused by future block group
      allocations until we decrement the 'trimming' counter. Also make sure that
      on each iteration of scrub_stripe() we stop scrubbing the block group if
      it was removed already.
      
      A later patch in the series will rename the block group's 'trimming'
      counter and its helpers to a more generic name, since now it is not used
      exclusively for pinning while trimming anymore.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2473d24f
  8. 24 3月, 2020 5 次提交
  9. 24 1月, 2020 1 次提交
    • Q
      btrfs: scrub: Require mandatory block group RO for dev-replace · 1bbb97b8
      Qu Wenruo 提交于
      [BUG]
      For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
      looped runs can lead to random failure, where scrub finds csum error.
      
      The possibility is not high, around 1/20 to 1/100, but it's causing data
      corruption.
      
      The bug is observable after commit b12de528 ("btrfs: scrub: Don't
      check free space before marking a block group RO")
      
      [CAUSE]
      Dev-replace has two source of writes:
      
      - Write duplication
        All writes to source device will also be duplicated to target device.
      
        Content:	Not yet persisted data/meta
      
      - Scrub copy
        Dev-replace reused scrub code to iterate through existing extents, and
        copy the verified data to target device.
      
        Content:	Previously persisted data and metadata
      
      The difference in contents makes the following race possible:
      	Regular Writer		|	Dev-replace
      -----------------------------------------------------------------
        ^                             |
        | Preallocate one data extent |
        | at bytenr X, len 1M		|
        v				|
        ^ Commit transaction		|
        | Now extent [X, X+1M) is in  |
        v commit root			|
       ================== Dev replace starts =========================
        				| ^
      				| | Scrub extent [X, X+1M)
      				| | Read [X, X+1M)
      				| | (The content are mostly garbage
      				| |  since it's preallocated)
        ^				| v
        | Write back happens for	|
        | extent [X, X+512K)		|
        | New data writes to both	|
        | source and target dev.	|
        v				|
      				| ^
      				| | Scrub writes back extent [X, X+1M)
      				| | to target device.
      				| | This will over write the new data in
      				| | [X, X+512K)
      				| v
      
      This race can only happen for nocow writes. Thus metadata and data cow
      writes are safe, as COW will never overwrite extents of previous
      transaction (in commit root).
      
      This behavior can be confirmed by disabling all fallocate related calls
      in fsstress (*), then all related tests can pass a 2000 run loop.
      
      *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
      		   -f collapse=0 -f punch=0 -f resvsp=0"
         I didn't expect resvsp ioctl will fallback to fallocate in VFS...
      
      [FIX]
      Make dev-replace to require mandatory block group RO, and wait for current
      nocow writes before calling scrub_chunk().
      
      This patch will mostly revert commit 76a8efa1 ("btrfs: Continue replace
      when set_block_ro failed") for dev-replace path.
      
      The side effect is, dev-replace can be more strict on avaialble space, but
      definitely worth to avoid data corruption.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 76a8efa1 ("btrfs: Continue replace when set_block_ro failed")
      Fixes: b12de528 ("btrfs: scrub: Don't check free space before marking a block group RO")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1bbb97b8
  10. 20 1月, 2020 1 次提交
    • D
      btrfs: handle empty block_group removal for async discard · 6e80d4f8
      Dennis Zhou 提交于
      block_group removal is a little tricky. It can race with the extent
      allocator, the cleaner thread, and balancing. The current path is for a
      block_group to be added to the unused_bgs list. Then, when the cleaner
      thread comes around, it starts a transaction and then proceeds with
      removing the block_group. Extents that are pinned are subsequently
      removed from the pinned trees and then eventually a discard is issued
      for the entire block_group.
      
      Async discard introduces another player into the game, the discard
      workqueue. While it has none of the racing issues, the new problem is
      ensuring we don't leave free space untrimmed prior to forgetting the
      block_group.  This is handled by placing fully free block_groups on a
      separate discard queue. This is necessary to maintain discarding order
      as in the future we will slowly trim even fully free block_groups. The
      ordering helps us make progress on the same block_group rather than say
      the last fully freed block_group or needing to search through the fully
      freed block groups at the beginning of a list and insert after.
      
      The new order of events is a fully freed block group gets placed on the
      unused discard queue first. Once it's processed, it will be placed on
      the unusued_bgs list and then the original sequence of events will
      happen, just without the final whole block_group discard.
      
      The mount flags can change when processing unused_bgs, so when flipping
      from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
      discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
      free block groups on the discard_list to the unused_bg queue which will
      do the final discard for us.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6e80d4f8
  11. 19 11月, 2019 3 次提交
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
    • Q
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b12de528
    • D
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba 提交于
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32da5386