1. 01 6月, 2023 1 次提交
    • Q
      btrfs: zoned: fix dev-replace after the scrub rework · b675df02
      Qu Wenruo 提交于
      [BUG]
      After commit e02ee89b ("btrfs: scrub: switch scrub_simple_mirror()
      to scrub_stripe infrastructure"), scrub no longer works for zoned device
      at all.
      
      Even an empty zoned btrfs cannot be replaced:
      
        # mkfs.btrfs -f /dev/nvme0n1
        # mount /dev/nvme0n1 /mnt/btrfs
        # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
        Resetting device zones /dev/nvme1n1 (160 zones) ...
        ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
      
      And we can hit kernel crash related to that:
      
        BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
        BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
        nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
        I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
        BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
        BUG: kernel NULL pointer dereference, address: 00000000000000a8
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
        RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
        Call Trace:
         <IRQ>
         btrfs_lookup_ordered_extent+0x31/0x190
         btrfs_record_physical_zoned+0x18/0x40
         btrfs_simple_end_io+0xaf/0xc0
         blk_update_request+0x153/0x4c0
         blk_mq_end_request+0x15/0xd0
         nvme_poll_cq+0x1d3/0x360
         nvme_irq+0x39/0x80
         __handle_irq_event_percpu+0x3b/0x190
         handle_irq_event+0x2f/0x70
         handle_edge_irq+0x7c/0x210
         __common_interrupt+0x34/0xa0
         common_interrupt+0x7d/0xa0
         </IRQ>
         <TASK>
         asm_common_interrupt+0x22/0x40
      
      [CAUSE]
      Dev-replace reuses scrub code to iterate all extents and write the
      existing content back to the new device.
      
      And for zoned devices, we call fill_writer_pointer_gap() to make sure
      all the writes into the zoned device is sequential, even if there may be
      some gaps between the writes.
      
      However we have several different bugs all related to zoned dev-replace:
      
      - We are using ZONE_APPEND operation for metadata style write back
        For zoned devices, btrfs has two ways to write data:
      
        * ZONE_APPEND for data
          This allows higher queue depth, but will not be able to know where
          the write would land.
          Thus needs to grab the real on-disk physical location in it's endio.
      
        * WRITE for metadata
          This requires single queue depth (new writes can only be submitted
          after previous one finished), and all writes must be sequential.
      
        For scrub, we go single queue depth, but still goes with ZONE_APPEND,
        which requires btrfs_bio::inode being populated.
        This is the cause of that crash.
      
      - No correct tracing of write_pointer
        After a write finished, we should forward sctx->write_pointer, or
        fill_writer_pointer_gap() would not work properly and cause more
        than necessary zero out, and fill the whole zone prematurely.
      
      - Incorrect physical bytenr passed to fill_writer_pointer_gap()
        In scrub_write_sectors(), one call site passes logical address, which
        is completely wrong.
      
        The other call site passes physical address of current sector, but
        we should pass the physical address of the btrfs_bio we're submitting.
      
        This is the cause of the -EIO errors.
      
      [FIX]
      - Do not use ZONE_APPEND for btrfs_submit_repair_write().
      
      - Manually forward sctx->write_pointer after successful writeback
      
      - Use the physical address of the to-be-submitted btrfs_bio for
        fill_writer_pointer_gap()
      
      Now zoned device replace would work as expected.
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b675df02
  2. 17 5月, 2023 1 次提交
    • Q
      btrfs: scrub: try harder to mark RAID56 block groups read-only · 7561551e
      Qu Wenruo 提交于
      Currently we allow a block group not to be marked read-only for scrub.
      
      But for RAID56 block groups if we require the block group to be
      read-only, then we're allowed to use cached content from scrub stripe to
      reduce unnecessary RAID56 reads.
      
      So this patch would:
      
      - Make btrfs_inc_block_group_ro() try harder
        During my tests, for cases like btrfs/061 and btrfs/064, we can hit
        ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
      
        The reason is if we only have one single data chunk, and trying to
        scrub it, we won't have any space left for any newer data writes.
      
        But this check should be done by the caller, especially for scrub
        cases we only temporarily mark the chunk read-only.
        And newer data writes would always try to allocate a new data chunk
        when needed.
      
      - Return error for scrub if we failed to mark a RAID56 chunk read-only
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7561551e
  3. 18 4月, 2023 24 次提交
    • Q
      btrfs: dev-replace: error out if we have unrepaired metadata error during · 8eb3dd17
      Qu Wenruo 提交于
      [BUG]
      Even before the scrub rework, if we have some corrupted metadata failed
      to be repaired during replace, we still continue replacing and let it
      finish just as there is nothing wrong:
      
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
       BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
       BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
       BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
      
      This can lead to unexpected problems for the resulting filesystem.
      
      [CAUSE]
      Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
      But unlike scrub, dev-replace doesn't really bother to check the scrub
      progress, which records all the errors found during replace.
      
      And even if we check the progress, we cannot really determine which
      errors are minor, which are critical just by the plain numbers.
      (remember we don't treat metadata/data checksum error differently).
      
      This behavior is there from the very beginning.
      
      [FIX]
      Instead of continuing the replace, just error out if we hit an
      unrepaired metadata sector.
      
      Now the dev-replace would be rejected with -EIO, to let the user know.
      Although it also means, the filesystem has some metadata error which
      cannot be repaired, the user would be upset anyway.
      
      The new dmesg would look like this:
      
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
       BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
       BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
       BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
       BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eb3dd17
    • Q
      btrfs: scrub: remove scrub_bio structure · 13a62fd9
      Qu Wenruo 提交于
      Since scrub path has been fully moved to scrub_stripe based facilities,
      no more scrub_bio would be submitted.
      Thus we can remove it completely, this involves:
      
      - SCRUB_SECTORS_PER_BIO macro
      - SCRUB_BIOS_PER_SCTX macro
      - SCRUB_MAX_PAGES macro
      - BTRFS_MAX_MIRRORS macro
      - scrub_bio structure
      - scrub_ctx::bios member
      - scrub_ctx::curr member
      - scrub_ctx::bios_in_flight member
      - scrub_ctx::workers_pending member
      - scrub_ctx::list_lock member
      - scrub_ctx::list_wait member
      
      - function scrub_bio_end_io_worker()
      - function scrub_pending_bio_inc()
      - function scrub_pending_bio_dec()
      - function scrub_throttle()
      - function scrub_submit()
      
      - function scrub_find_csum()
      - function drop_csum_range()
      
      - Some unnecessary flush and scrub pauses
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      13a62fd9
    • Q
      btrfs: scrub: remove scrub_block and scrub_sector structures · 001e3fc2
      Qu Wenruo 提交于
      Those two structures are used to represent a bunch of sectors for scrub,
      but now they are fully replaced by scrub_stripe in one go, so we can
      remove them. This involves:
      
      - structure scrub_block
      - structure scrub_sector
      
      - structure scrub_page_private
      - function attach_scrub_page_private()
      - function detach_scrub_page_private()
        Now we no longer need to use page::private to handle subpage.
      
      - function alloc_scrub_block()
      - function alloc_scrub_sector()
      - function scrub_sector_get_page()
      - function scrub_sector_get_page_offset()
      - function scrub_sector_get_kaddr()
      - function bio_add_scrub_sector()
      
      - function scrub_checksum_data()
      - function scrub_checksum_tree_block()
      - function scrub_checksum_super()
      - function scrub_check_fsid()
      - function scrub_block_get()
      - function scrub_block_put()
      - function scrub_sector_get()
      - function scrub_sector_put()
      - function scrub_bio_end_io()
      - function scrub_block_complete()
      - function scrub_add_sector_to_rd_bio()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      001e3fc2
    • Q
      btrfs: scrub: remove the old scrub recheck code · e9255d6c
      Qu Wenruo 提交于
      The old scrub code has different entrance to verify the content, and
      since we have removed the writeback path, now we can start removing the
      re-check part, including:
      
      - scrub_recover structure
      - scrub_sector::recover member
      - function scrub_setup_recheck_block()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_group_good_copy()
      - function scrub_repair_sector_from_good_copy()
      - function scrub_is_page_on_raid56()
      
      - function full_stripe_lock()
      - function search_full_stripe_lock()
      - function get_full_stripe_logical()
      - function insert_full_stripe_lock()
      - function lock_full_stripe()
      - function unlock_full_stripe()
      - btrfs_block_group::full_stripe_locks_root member
      - btrfs_full_stripe_locks_tree structure
        This infrastructure is to ensure RAID56 scrub is properly handling
        recovery and P/Q scrub correctly.
      
        This is no longer needed, before P/Q scrub we will wait for all
        the involved data stripes to be scrubbed first, and RAID56 code has
        internal lock to ensure no race in the same full stripe.
      
      - function scrub_print_warning()
      - function scrub_get_recover()
      - function scrub_put_recover()
      - function scrub_handle_errored_block()
      - function scrub_setup_recheck_block()
      - function scrub_bio_wait_endio()
      - function scrub_submit_raid56_bio_wait()
      - function scrub_recheck_block_on_raid56()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_from_good_copy()
      - function scrub_repair_sector_from_good_copy()
      
      And two more functions exported temporarily for later cleanup:
      
      - alloc_scrub_sector()
      - alloc_scrub_block()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9255d6c
    • Q
      btrfs: scrub: remove the old writeback infrastructure · 16f93993
      Qu Wenruo 提交于
      Since the whole scrub path has been switched to scrub_stripe based
      solution, the old writeback path can be removed completely, which
      involves:
      
      - scrub_ctx::wr_curr_bio member
      - scrub_ctx::flush_all_writes member
      - function scrub_write_block_to_dev_replace()
      - function scrub_write_sector_to_dev_replace()
      - function scrub_add_sector_to_wr_bio()
      - function scrub_wr_submit()
      - function scrub_wr_bio_end_io()
      - function scrub_wr_bio_end_io_worker()
      
      And one more function needs to be exported temporarily:
      
      - scrub_sector_get()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16f93993
    • Q
      btrfs: scrub: remove scrub_parity structure · 5dc96f8d
      Qu Wenruo 提交于
      The structure scrub_parity is used to indicate that some extents are
      scrubbed for the purpose of RAID56 P/Q scrubbing.
      
      Since the whole RAID56 P/Q scrubbing path has been replaced with new
      scrub_stripe infrastructure, and we no longer need to use scrub_parity
      to modify the behavior of data stripes, we can remove it completely.
      
      This removal involves:
      
      - scrub_parity_workers
        Now only one worker would be utilized, scrub_workers, to do the read
        and repair.
        All writeback would happen at the main scrub thread.
      
      - scrub_block::sparity member
      - scrub_parity structure
      - function scrub_parity_get()
      - function scrub_parity_put()
      - function scrub_free_parity()
      
      - function __scrub_mark_bitmap()
      - function scrub_parity_mark_sectors_error()
      - function scrub_parity_mark_sectors_data()
        These helpers are no longer needed, scrub_stripe has its bitmaps and
        we can use bitmap helpers to get the error/data status.
      
      - scrub_parity_bio_endio()
      - scrub_parity_check_and_repair()
      - function scrub_sectors_for_parity()
      - function scrub_extent_for_parity()
      - function scrub_raid56_data_stripe_for_parity()
      - function scrub_raid56_parity()
        The new code would reuse the scrub read-repair and writeback path.
        Just skip the dev-replace phase.
        And scrub_stripe infrastructure allows us to submit and wait for those
        data stripes before scrubbing P/Q, without extra infrastructure.
      
      The following two functions are temporarily exported for later cleanup:
      
      - scrub_find_csum()
      - scrub_add_sector_to_rd_bio()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5dc96f8d
    • Q
      btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub · 1009254b
      Qu Wenruo 提交于
      Implement the only missing part for scrub: RAID56 P/Q stripe scrub.
      
      The workflow is pretty straightforward for the new function,
      scrub_raid56_parity_stripe():
      
      - Go through the regular scrub path for each data stripe
      
      - Wait for the verification and repair to finish
      
      - Writeback the repaired sectors to data stripes
      
      - Make sure all stripes are properly repaired
        If we have sectors unrepaired, we cannot continue, or we could further
        corrupt the P/Q stripe.
      
      - Submit the rbio for P/Q stripe
        The dev-replace would be handled inside
        raid56_parity_submit_scrub_rbio() path.
      
      - Wait for the above bio to finish
      
      Although the old code is no longer used, we still keep the declaration,
      as the cleanup can be several times larger than this patch itself.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1009254b
    • Q
      btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure · e02ee89b
      Qu Wenruo 提交于
      Switch scrub_simple_mirror() to the new scrub_stripe infrastructure.
      
      Since scrub_simple_mirror() is the core part of scrub (only RAID56
      P/Q stripes don't utilize it), we can get rid of a big chunk of code,
      mostly scrub_extent(), scrub_sectors() and directly called functions.
      
      There is a functionality change:
      
      - Scrub speed throttle now only affects read on the scrubbing device
        Writes (for repair and replace), and reads from other mirrors won't
        be limited by the set limits.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e02ee89b
    • Q
      btrfs: scrub: introduce helper to queue a stripe for scrub · 54765392
      Qu Wenruo 提交于
      The new helper, queue_scrub_stripe(), would try to queue a stripe for
      scrub.  If all stripes are already in use, we will submit all the
      existing ones and wait for them to finish.
      
      Currently we would queue up to 8 stripes, to enlarge the blocksize to
      512KiB to improve the performance. Sectors repaired on zoned need to be
      relocated instead of in-place fix.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54765392
    • Q
      btrfs: scrub: introduce error reporting functionality for scrub_stripe · 00965807
      Qu Wenruo 提交于
      The new helper, scrub_stripe_report_errors(), will report the result of
      the scrub to system log.
      
      The main reporting is done by introducing a new helper,
      scrub_print_common_warning(), which is mostly the same content from
      scrub_print_wanring(), but without the need for a scrub_block.
      
      Since we're reporting the errors, it's the perfect time to update the
      scrub stats too.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00965807
    • Q
      btrfs: scrub: introduce a writeback helper for scrub_stripe · 058e09e6
      Qu Wenruo 提交于
      Add a new helper, scrub_write_sectors(), to submit write bios for
      specified sectors to the target disk.
      
      There are several differences compared to read path:
      
      - Utilize btrfs_submit_scrub_write()
        Now we still rely on the @mirror_num based writeback, but the
        requirement is also a little different than regular writeback or read,
        thus we have to call btrfs_submit_scrub_write().
      
      - We cannot write the full stripe back
        We can only write the sectors we have.  There will be two call sites
        later, one for repaired sectors, one for all utilized sectors of
        dev-replace.
      
        Thus the callers should specify their own write_bitmap.
      
      This function only submit the bios, will not wait for them unless for
      zoned case.
      
      Caller must explicitly wait for the IO to finish.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      058e09e6
    • Q
      btrfs: scrub: introduce the main read repair worker for scrub_stripe · 9ecb5ef5
      Qu Wenruo 提交于
      The new helper, scrub_stripe_read_repair_worker(), would handle the
      read-repair part:
      
      - Wait for the previous submitted read IO to finish
      
      - Verify the contents of the stripe
      
      - Go through the remaining mirrors, using as large blocksize as possible
        At this stage, we just read out all the failed sectors from each
        mirror and re-verify.
        If no more failed sector, we can exit.
      
      - Go through all mirrors again, sector-by-sector
        This time, we read sector by sector, this is to address cases where
        one bad sector mismatches the drive's internal checksum, and cause the
        whole read range to fail.
      
        We put this recovery method as the last resort, as sector-by-sector
        reading is slow, and reading from other mirrors may have already fixed
        the errors.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ecb5ef5
    • Q
      btrfs: scrub: introduce a helper to verify one scrub_stripe · 97cf8f37
      Qu Wenruo 提交于
      The new helper, scrub_verify_stripe(), shares the same main workflow of
      the old scrub code.
      
      The major differences are:
      
      - How pages/page_offset is grabbed
        Everything can be grabbed from scrub_stripe easily.
      
      - When error report happens
        Currently the helper only verifies the sectors, not really doing any
        error reporting.
        The error reporting would be done after we have done the repair.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97cf8f37
    • Q
      btrfs: scrub: introduce a helper to verify one metadata block · a3ddbaeb
      Qu Wenruo 提交于
      The new helper, scrub_verify_one_metadata(), is almost the same as
      scrub_checksum_tree_block().
      
      The difference is in how we grab the pages from other structures.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3ddbaeb
    • Q
      btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe · b9795475
      Qu Wenruo 提交于
      The new helper will search the extent tree to find the first extent of a
      logical range, then fill the sectors array by two loops:
      
      - Loop 1 to fill common bits and metadata generation
      
      - Loop 2 to fill csum data (only for data bgs)
        This loop will use the new btrfs_lookup_csums_bitmap() to fill
        the full csum buffer, and set scrub_sector_verification::csum.
      
      With all the needed info filled by this function, later we only need to
      submit and verify the stripe.
      
      Here we temporarily export the helper to avoid warning on unused static
      function.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9795475
    • Q
      btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interface · 2af2aaf9
      Qu Wenruo 提交于
      This patch introduces the following structures:
      
      - scrub_sector_verification
        Contains all the needed info to verify one sector (data or metadata).
      
      - scrub_stripe
        Contains all needed members (mostly bitmap based) to scrub one stripe
        (with a length of BTRFS_STRIPE_LEN).
      
      The basic idea is, we keep the existing per-device scrub behavior, but
      merge all the scrub_bio/scrub_bio into one generic structure, and read
      the full BTRFS_STRIPE_LEN stripe on the first try.
      
      This means we will read some sectors which are not scrub target, but
      that's fine. At dev-replace time we only writeback the utilized and good
      sectors, and for read-repair we only writeback the repaired sectors.
      
      With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio
      form shaping would be gone.
      Although to get the same performance of the old scrub behavior, we would
      need to submit the initial read for two stripes at once.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2af2aaf9
    • Q
      btrfs: scrub: use dedicated super block verification function to scrub one super block · 2a2dc22f
      Qu Wenruo 提交于
      There is really no need to go through the super complex scrub_sectors()
      to just handle super blocks.  Introduce a dedicated function to handle
      super block scrubbing.
      
      This new function will introduce a behavior change, instead of using the
      complex but concurrent scrub_bio system, here we just go submit-and-wait.
      
      There is really not much sense to care the performance of super block
      scrubbing. It only has 3 super blocks at most, and they are all
      scattered around the devices already.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2a2dc22f
    • Q
      btrfs: scrub: remove root and csum_root arguments from scrub_simple_mirror() · 6b4d375a
      Qu Wenruo 提交于
      We don't need to pass the roots as arguments, reading them from the
      rb-tree is cheap.  Thus there is really not much need to pre-fetch it
      and pass it all the way from scrub_stripe().
      
      And we already have more than enough arguments in scrub_simple_mirror()
      and scrub_simple_stripe(), it's better to remove them and only grab
      those roots in scrub_simple_mirror().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b4d375a
    • Q
      btrfs: scrub: remove unused path inside scrub_stripe() · 1d403297
      Qu Wenruo 提交于
      The variable @path is no longer passed into any call sites after commit
      18d30ab9 ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56
      data stripe scrub"), thus we can remove the variable completely.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d403297
    • Q
      btrfs: dev-replace: properly follow its read mode · 7b31e045
      Qu Wenruo 提交于
      [BUG]
      Although dev replace ioctl has a way to specify the mode on whether we
      should read from the source device, it's not properly followed.
      
       # mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
       # mount $dev1 $mnt
       # xfs_io -f -c "pwrite 0 32M" $mnt/file
       # sync
       # btrfs replace start -r -f 1 $dev3 $mnt
      
      And one extra trace is added to scrub_submit(), showing the detail about
      the bio:
      
        btrfs-11569 [005] ...  37.0270: scrub_submit.part.0: devid=1 logical=22036480 phy=22036480 len=16384
        btrfs-11569 [005] ...  37.0273: scrub_submit.part.0: devid=1 logical=30457856 phy=30457856 len=32768
        btrfs-11569 [005] ...  37.0274: scrub_submit.part.0: devid=1 logical=30507008 phy=30507008 len=49152
        btrfs-11569 [005] ...  37.0274: scrub_submit.part.0: devid=1 logical=30605312 phy=30605312 len=32768
        btrfs-11569 [005] ...  37.0275: scrub_submit.part.0: devid=1 logical=30703616 phy=30703616 len=65536
        btrfs-11569 [005] ...  37.0281: scrub_submit.part.0: devid=1 logical=298844160 phy=298844160 len=131072
        ...
        btrfs-11569 [005] ...  37.0762: scrub_submit.part.0: devid=1 logical=322961408 phy=322961408 len=131072
        btrfs-11569 [005] ...  37.0762: scrub_submit.part.0: devid=1 logical=323092480 phy=323092480 len=131072
      
      One can see that all the reads are submitted to devid 1, even if we have
      specified "-r" option to avoid reading from the source device.
      
      [CAUSE]
      The dev-replace read mode is only set but not followed by scrub code at
      all.  In fact, only common read path is properly following the read
      mode, but scrub itself has its own read path, thus not following the
      mode.
      
      [FIX]
      Here we enhance scrub_find_good_copy() to also follow the read mode.
      
      The idea is pretty simple, in the first loop, we avoid the following
      devices:
      
      - Missing devices
        This is the existing condition
      
      - The source device if the replace wants to avoid it.
      
      And if above loop found no candidate (e.g. replace a single device),
      then we discard the 2nd condition, and try again.
      
      Since we're here, also enhance the function scrub_find_good_copy() by:
      
      - Remove the forward declaration
      
      - Makes it return int
        To indicates errors, e.g. no good mirror found.
      
      - Add extra error messages
      
      Now with the same trace, "btrfs replace start -r" works as expected:
      
        btrfs-1213 [000] ...  991.9059: scrub_submit.part.0: devid=2 logical=22036480 phy=1064960 len=16384
        btrfs-1213 [000] ...  991.9062: scrub_submit.part.0: devid=2 logical=30457856 phy=9486336 len=32768
        btrfs-1213 [000] ...  991.9063: scrub_submit.part.0: devid=2 logical=30507008 phy=9535488 len=49152
        btrfs-1213 [000] ...  991.9064: scrub_submit.part.0: devid=2 logical=30605312 phy=9633792 len=32768
        btrfs-1213 [000] ...  991.9065: scrub_submit.part.0: devid=2 logical=30703616 phy=9732096 len=65536
        btrfs-1213 [000] ...  991.9073: scrub_submit.part.0: devid=2 logical=298844160 phy=277872640 len=131072
        btrfs-1213 [000] ...  991.9075: scrub_submit.part.0: devid=2 logical=298975232 phy=278003712 len=131072
        btrfs-1213 [000] ...  991.9078: scrub_submit.part.0: devid=2 logical=299106304 phy=278134784 len=131072
        ...
        btrfs-1213 [000] ...  991.9474: scrub_submit.part.0: devid=2 logical=318504960 phy=297533440 len=131072
        btrfs-1213 [000] ...  991.9476: scrub_submit.part.0: devid=2 logical=318636032 phy=297664512 len=131072
        btrfs-1213 [000] ...  991.9479: scrub_submit.part.0: devid=2 logical=318767104 phy=297795584 len=131072
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b31e045
    • Q
      btrfs: replace btrfs_io_context::raid_map with a fixed u64 value · 18d758a2
      Qu Wenruo 提交于
      In btrfs_io_context structure, we have a pointer raid_map, which
      indicates the logical bytenr for each stripe.
      
      But considering we always call sort_parity_stripes(), the result
      raid_map[] is always sorted, thus raid_map[0] is always the logical
      bytenr of the full stripe.
      
      So why we waste the space and time (for sorting) for raid_map?
      
      This patch will replace btrfs_io_context::raid_map with a single u64
      number, full_stripe_start, by:
      
      - Replace btrfs_io_context::raid_map with full_stripe_start
      
      - Replace call sites using raid_map[0] to use full_stripe_start
      
      - Replace call sites using raid_map[i] to compare with nr_data_stripes.
      
      The benefits are:
      
      - Less memory wasted on raid_map
        It's sizeof(u64) * num_stripes vs sizeof(u64).
        It'll always save at least one u64, and the benefit grows larger with
        num_stripes.
      
      - No more weird alloc_btrfs_io_context() behavior
        As there is only one fixed size + one variable length array.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18d758a2
    • Q
      btrfs: use an efficient way to represent source of duplicated stripes · 1faf3885
      Qu Wenruo 提交于
      For btrfs dev-replace, we have to duplicate writes to the source
      device into the target device.
      
      For non-RAID56, all writes into the same mapped ranges are sharing the
      same content, thus they don't really need to bother anything.
      (E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
      same write to all involved devices).
      
      But for RAID56, all stripes contain different content, thus we must
      have a clear mapping of which stripe is duplicated from which original
      stripe.
      
      Currently we use a complex way using tgtdev_map[] array, e.g:
      
       num_tgtdevs = 1
       tgtdev_map[0] = 0    <- Means stripes[0] is not involved in replace.
       tgtdev_map[1] = 3    <- Means stripes[1] is involved in replace,
      			 and it's duplicated to stripes[3].
       tgtdev_map[2] = 0    <- Means stripes[2] is not involved in replace.
      
      But this is wasting some space, and ignores one important thing for
      dev-replace, there is at most one running replace.
      
      Thus we can change it to a fixed array to represent the mapping:
      
       replace_nr_stripes = 1
       replace_stripe_src = 1    <- Means stripes[1] is involved in replace.
      			      thus the extra stripe is a copy of
      			      stripes[1]
      
      By this we can save some space for bioc on RAID56 chunks with many
      devices.  And we get rid of one variable sized array from bioc.
      
      Thus the patch involves the following changes:
      
      - Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
        and @replace_stripe_src.
      
        @num_tgtdevs is just renamed to @replace_nr_stripes.
        While the mapping is completely changed.
      
      - Add extra ASSERT()s for RAID56 code
      
      - Only add two more extra stripes for dev-replace cases.
        As we have an upper limit on how many dev-replace stripes we can have.
      
      - Unify the behavior of handle_ops_on_dev_replace()
        Previously handle_ops_on_dev_replace() go two different paths for
        WRITE and GET_READ_MIRRORS.
        Now unify them by always going the WRITE path first (with at most 2
        replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
        extra stripes, just drop one stripe.
      
      - Remove the @real_stripes argument from alloc_btrfs_io_context()
        As we don't need the old variable length array any more.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1faf3885
    • Q
      btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 · 6ded22c1
      Qu Wenruo 提交于
      There are quite some div64 calls inside btrfs_map_block() and its
      variants.
      
      Such calls are for @stripe_nr, where @stripe_nr is the number of
      stripes before our logical bytenr inside a chunk.
      
      However we can eliminate such div64 calls by just reducing the width of
      @stripe_nr from 64 to 32.
      
      This can be done because our chunk size limit is already 10G, with fixed
      stripe length 64K.
      Thus a U32 is definitely enough to contain the number of stripes.
      
      With such width reduction, we can get rid of slower div64, and extra
      warning for certain 32bit arch.
      
      This patch would do:
      
      - Add a new tree-checker chunk validation on chunk length
        Make sure no chunk can reach 256G, which can also act as a bitflip
        checker.
      
      - Reduce the width from u64 to u32 for @stripe_nr variables
      
      - Replace unnecessary div64 calls with regular modulo and division
        32bit division and modulo are much faster than 64bit operations, and
        we are finally free of the div64 fear at least in those involved
        functions.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ded22c1
    • Q
      btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN · a97699d1
      Qu Wenruo 提交于
      Currently btrfs doesn't support stripe lengths other than 64KiB.
      This is already set in the tree-checker.
      
      There is really no meaning to record that fixed value in map_lookup for
      now, and can all be replaced with BTRFS_STRIPE_LEN.
      
      Furthermore we can use the fix stripe length to do the following
      optimization:
      
      - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
        Now we only need to do a right shift.
      
        And the value of BTRFS_STRIPE_LEN itself is already too large for bit
        shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
        a compiler warning would be triggered.
      
        Thus this bit shift optimization would be safe.
      
      - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a97699d1
  4. 16 2月, 2023 1 次提交
  5. 14 2月, 2023 1 次提交
    • Q
      btrfs: scrub: improve tree block error reporting · 28232909
      Qu Wenruo 提交于
      [BUG]
      When debugging a scrub related metadata error, it turns out that our
      metadata error reporting is not ideal.
      
      The only 3 error messages are:
      
      - BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
        Showing we have metadata generation mismatch errors.
      
      - BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1
        Showing which tree blocks are corrupted.
      
      - BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5
        Showing which physical range the corrupted metadata is at.
      
      We have to combine the above 3 to know we have a corrupted metadata with
      generation mismatch.
      
      And this is already the better case, if we have other problems, like
      fsid mismatch, we can not even know the cause.
      
      [CAUSE]
      The problem is caused by the fact that, scrub_checksum_tree_block()
      never outputs any error message.
      
      It just return two bits for scrub: sblock->header_error, and
      sblock->generation_error.
      
      And later we report error in scrub_print_warning(), but unfortunately we
      only have two bits, there is not really much thing we can done to print
      any detailed errors.
      
      [FIX]
      This patch will do the following to enhance the error reporting of
      metadata scrub:
      
      - Add extra warning (ratelimited) for every error we hit
        This can help us to distinguish the different types of errors.
        Some errors can help us to know what's going wrong immediately,
        like bytenr mismatch.
      
      - Re-order the checks
        Currently we check bytenr first, then immediately generation.
        This can lead to false generation mismatch reports, while the fsid
        mismatches.
      
      Here is the new output for the bug I'm debugging (we forgot to
      writeback tree blocks for commit roots):
      
       BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
       BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
      
      Now we can immediately know it's some tree blocks didn't even get written
      back, other than the original confusing generation mismatch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28232909
  6. 06 12月, 2022 12 次提交
    • Q
      btrfs: introduce a bitmap based csum range search function · 97e38239
      Qu Wenruo 提交于
      Although we have an existing function, btrfs_lookup_csums_range(), to
      find all data checksums for a range, it's based on a btrfs_ordered_sum
      list.
      
      For the incoming RAID56 data checksum verification at RMW time, we don't
      want to waste time by allocating temporary memory.
      
      So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
      It will use bitmap based result, which will be a perfect fit for later
      RAID56 usage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97e38239
    • Q
      btrfs: use btrfs_dev_name() helper to handle missing devices better · cb3e217b
      Qu Wenruo 提交于
      [BUG]
      If dev-replace failed to re-construct its data/metadata, the kernel
      message would be incorrect for the missing device:
      
       BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
       BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
      
      Note the above "dev (efault)" of the second line.
      While the first line is properly reporting "<missing disk>".
      
      [CAUSE]
      Although dev-replace is using btrfs_dev_name(), the heavy lifting work
      is still done by scrub (scrub is reused by both dev-replace and regular
      scrub).
      
      Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
      only declared locally inside dev-replace.c.
      
      [FIX]
      Fix the output by:
      
      - Move the btrfs_dev_name() helper to volumes.h
      
      - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
        Only zoned code is not touched, as I'm not familiar with degraded
        zoned code.
      
      - Constify return value and parameter
      
      Now the output looks pretty sane:
      
       BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
       BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb3e217b
    • F
      btrfs: use a structure to pass arguments to backref walking functions · a2c8d27e
      Filipe Manana 提交于
      The public backref walking functions have quite a lot of arguments that
      are passed down the call stack to find_parent_nodes(), the core function
      of the backref walking code.
      
      The next patches in series will need to add even arguments to these
      functions that should be passed not only to find_parent_nodes(), but also
      to other functions used by the later (directly or even lower in the call
      stack).
      
      So create a structure to hold all these arguments and state used by the
      main backref walking function, find_parent_nodes(), and use it as the
      argument for the public backref walking functions iterate_extent_inodes(),
      btrfs_find_all_leafs() and btrfs_find_all_roots().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2c8d27e
    • F
      btrfs: use a single argument for extent offset in backref walking functions · 6ce6ba53
      Filipe Manana 提交于
      The interface for find_parent_nodes() has two extent offset related
      arguments:
      
      1) One u64 pointer argument for the extent offset;
      
      2) One boolean argument to tell if the extent offset should be ignored or
         not.
      
      These are confusing, becase the extent offset pointer can be NULL and in
      some cases callers pass a NULL value as a way to tell the backref walking
      code to ignore offsets in file extent items (and simply consider all file
      extent items that point to the target data extent).
      
      The boolean argument was added in commit c995ab3c ("btrfs: add a flag
      to iterate_inodes_from_logical to find all extent refs for uncompressed
      extents"), but it was never really necessary, it was enough if it could
      find a way to get a NULL value passed to the "extent_item_pos" argument of
      find_parent_nodes(). The arguments are also passed to functions called
      by find_parent_nodes() and respective helper functions, which further
      makes everything more complicated than needed.
      
      Then we have several backref walking related functions that end up calling
      find_parent_nodes(), either directly or through some other function that
      they call, and for many we have to use an "extent_item_pos" (u64) argument
      and a boolean "ignore_offset" argument too.
      
      This is confusing and not really necessary. So use a single argument to
      specify the extent offset, as a simple u64 and not as a pointer, but
      using a special value of (u64)-1, defined as a documented constant, to
      indicate when the extent offset should be ignored.
      
      This is also preparation work for the upcoming patches in the series that
      add other arguments to find_parent_nodes() and other related functions
      that use it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ce6ba53
    • F
      btrfs: send: optimize clone detection to increase extent sharing · c7499a64
      Filipe Manana 提交于
      Currently send does not do the best decisions when it comes to decide
      between multiple clone sources, which results in clone operations for
      partial extent ranges, which has the following disadvantages:
      
      1) We get less shared extents at the destination;
      
      2) We have to read more data during the send operation and emit more
         write commands.
      
      Besides not being optimal behaviour, it also breaks user expectations and
      is often reported by users, with a recent example in the Link tag at the
      bottom of this change log.
      
      Part of the reason for this non-optimal behaviour is that the backref
      walking code does not provide information about the length of the file
      extent items that were found for each backref, so send is blind about
      which backref is the best to chose as a cloning source.
      
      The other existing reasons are just silliness, namely always prefering
      the inode with the lowest number when multiple are found for the same
      root and when we can clone from multiple roots, always prefer the send
      root over any of the other clone roots. This does not make any sense
      since any inode or root is fine and as good as any other inode/root.
      
      Fix this by making backref walking pass information about the number of
      bytes referenced by each file extent item and then have send's backref
      callback pick the inode with the highest number of bytes for each root.
      Finally select the root from which we can clone more bytes from.
      
      Example reproducer:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
         cp --reflink=always $MNT/foo $MNT/baz
         sync
      
         # Overwrite the second half of file foo.
         xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
         sync
      
         echo
         echo "*** fiemap in the original filesystem ***"
         echo
         xfs_io -c "fiemap -v" $MNT/foo
         xfs_io -c "fiemap -v" $MNT/bar
         xfs_io -c "fiemap -v" $MNT/baz
         echo
      
         btrfs filesystem du $MNT
      
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         btrfs send -f /tmp/send_stream $MNT/snap
      
         umount $MNT
         mkfs.btrfs -f $DEV &> /dev/null
         mount $DEV $MNT
      
         btrfs receive -f /tmp/send_stream $MNT
      
         echo
         echo "*** fiemap in the new filesystem ***"
         echo
         xfs_io -r -c "fiemap -v" $MNT/snap/foo
         xfs_io -r -c "fiemap -v" $MNT/snap/bar
         xfs_io -r -c "fiemap -v" $MNT/snap/baz
         echo
      
         btrfs filesystem du $MNT
      
         rm -f /tmp/send_stream
         rm -f /tmp/snap.fssum
      
         umount $MNT
      
      Before this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
      
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    32768..34815      2048   0x1
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/bar
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/baz
            6.00MiB     2.00MiB           -  /mnt/sdi/snap
            6.00MiB     2.00MiB     2.00MiB  /mnt/sdi
      
      We end up with two 1M extents that are not shared for files bar and baz.
      
      After this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB       0.00B           -  /mnt/sdi/snap/bar
            2.00MiB       0.00B           -  /mnt/sdi/snap/baz
            6.00MiB       0.00B           -  /mnt/sdi/snap
            6.00MiB       0.00B     3.00MiB  /mnt/sdi
      
      Now there's a much better sharing, files bar and baz share 1M of the
      extent of file foo and the second extent of files bar and baz is shared
      between themselves.
      
      This will later be turned into a test case for fstests.
      
      Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7499a64
    • J
      btrfs: move scrub prototypes into scrub.h · 2fc6822c
      Josef Bacik 提交于
      Move these out of ctree.h into scrub.h to cut down on code in ctree.h.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2fc6822c
    • J
      btrfs: move file-item prototypes into their own header · 7c8ede16
      Josef Bacik 提交于
      Move these prototypes out of ctree.h and into file-item.h.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c8ede16
    • D
      btrfs: sink gfp_t parameter to alloc_scrub_sector · 02bc3927
      David Sterba 提交于
      All callers pas GFP_KERNEL as parameter so we can use it directly in
      alloc_scrub_sector.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02bc3927
    • D
      btrfs: switch GFP_NOFS to GFP_KERNEL in scrub_setup_recheck_block · fe10158c
      David Sterba 提交于
      There's only one caller that calls scrub_setup_recheck_block in the
      memalloc_nofs_save/_restore protection so it's effectively already
      GFP_NOFS and it's safe to use GFP_KERNEL.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe10158c
    • J
      btrfs: move accessor helpers into accessors.h · 07e81dc9
      Josef Bacik 提交于
      This is a large patch, but because they're all macros it's impossible to
      split up.  Simply copy all of the item accessors in ctree.h and paste
      them in accessors.h, and then update any files to include the header so
      everything compiles.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments, style fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07e81dc9
    • J
      btrfs: move fs wide helpers out of ctree.h · c7f13d42
      Josef Bacik 提交于
      We have several fs wide related helpers in ctree.h.  The bulk of these
      are the incompat flag test helpers, but there are things such as
      btrfs_fs_closing() and the read only helpers that also aren't directly
      related to the ctree code.  Move these into a fs.h header, which will
      serve as the location for file system wide related helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7f13d42
    • J
      btrfs: move BTRFS_MAX_MIRRORS into scrub.c · ed4c491a
      Josef Bacik 提交于
      This is only used locally in scrub.c, move it out of ctree.h into
      scrub.c.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed4c491a