1. 07 11月, 2022 2 次提交
    • J
      btrfs: zoned: fix locking imbalance on scrub · c62f6bec
      Johannes Thumshirn 提交于
      If we're doing device replace on a zoned filesystem and discover in
      scrub_enumerate_chunks() that we don't have to copy the block group it is
      unlocked before it gets skipped.
      
      But as the block group hasn't yet been locked before it leads to a locking
      imbalance. To fix this simply remove the unlock.
      
      This was uncovered by fstests' testcase btrfs/163.
      
      Fixes: 9283b9e0 ("btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY")
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c62f6bec
    • Q
      Revert "btrfs: scrub: use larger block size for data extent scrub" · b75b51f8
      Qu Wenruo 提交于
      This reverts commit 786672e9.
      
      [BUG]
      Since commit 786672e9 ("btrfs: scrub: use larger block size for data
      extent scrub"), btrfs scrub no longer reports errors if the corruption
      is not in the first sector of a STRIPE_LEN.
      
      The following script can expose the problem:
      
        mkfs.btrfs -f $dev
        mount $dev $mnt
        xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar
        umount $mnt
      
        # 13631488 is the logical bytenr of above 8K extent
        btrfs-map-logical -l 13631488 -b 4096 $dev
        mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1
      
        # Corrupt the 2nd sector of that extent
        xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev
      
        mount $dev $mnt
        btrfs scrub start -B $mnt
        scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7
        Scrub started:    Mon Nov  7 07:18:27 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   536.00MiB
        Rate:             0.00B/s
        Error summary:    no errors found <<<
      
      [CAUSE]
      That offending commit enlarges the data extent scrub size from sector
      size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated.
      
      But unfortunately the data extent scrub is still heavily relying on the
      fact that there is only one scrub_sector per scrub_block.
      
      Thus it will only check the first sector, and ignoring the remaining
      sectors.
      
      Furthermore the error reporting is not able to handle multiple sectors
      either.
      
      [FIX]
      For now just revert the offending commit.
      
      The consequence is just extra memory usage during scrub.
      We will need a proper change to make the remaining data scrub path to
      handle multiple sectors before we enlarging the data scrub size.
      Reported-by: NLi Zhang <zhanglikernel@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b75b51f8
  2. 29 9月, 2022 1 次提交
  3. 26 9月, 2022 13 次提交
    • C
      btrfs: properly abstract the parity raid bio handling · f1c29379
      Christoph Hellwig 提交于
      The parity raid write/recover functionality is currently not very well
      abstracted from the bio submission and completion handling in volumes.c:
      
       - the raid56 code directly completes the original btrfs_bio fed into
         btrfs_submit_bio instead of dispatching back to volumes.c
       - the raid56 code consumes the bioc and bio_counter references taken
         by volumes.c, which also leads to special casing of the calls from
         the scrub code into the raid56 code
      
      To fix this up supply a bi_end_io handler that calls back into the
      volumes.c machinery, which then puts the bioc, decrements the bio_counter
      and completes the original bio, and updates the scrub code to also
      take ownership of the bioc and bio_counter in all cases.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f1c29379
    • Q
      btrfs: scrub: use larger block size for data extent scrub · 786672e9
      Qu Wenruo 提交于
      [PROBLEM]
      The existing scrub code for data extents always limit the block size to
      sectorsize.
      
      This causes quite some extra scrub_block being allocated:
      (there is a data extent at logical bytenr 298844160, length 64KiB)
      
        alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
        alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1
        alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1
        alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1
        alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1
        alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1
        alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1
        alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1
        alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1
        alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1
        alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1
        alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1
        alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1
        alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1
        alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1
        alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1
        ...
        scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1
        scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1
        scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1
        scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1
        scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1
        scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1
        scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1
        scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1
        scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1
        scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1
        scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1
        scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1
        scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1
        scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1
        scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1
        scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1
      
      This behavior will waste a lot of memory, especially after we have moved
      quite some members from scrub_sector to scrub_block.
      
      [FIX]
      To reduce the allocation of scrub_block, and to reduce memory usage, use
      BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data
      extents.
      
      This results only one scrub_block to be allocated for above data extent:
      
        alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
        scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1
      
      This would greatly reduce the memory usage (even it's just transient)
      for larger data extents scrub.
      
      For above example, the memory usage would be:
      
      Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector))
           16          * (408                 + 96) = 8065
      
      New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector)
           408                 + 16          * 96 = 1944
      
      A good reduction of 75.9%.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      786672e9
    • Q
      btrfs: scrub: move logical/physical/dev/mirror_num from scrub_sector to scrub_block · 8686c40e
      Qu Wenruo 提交于
      Currently we store the following members in scrub_sector:
      
      - logical
      - physical
      - physical_for_dev_replace
      - dev
      - mirror_num
      
      However the current scrub code has ensured that scrub_blocks never cross
      stripe boundary.
      This is caused by the entry functions (scrub_simple_mirror,
      scrub_simple_stripe), thus every scrub_block will not cross stripe
      boundary.
      
      Thus this makes it possible to move those members into scrub_block other
      than putting them into scrub_sector.
      
      This should save quite some memory, as a scrub_block can be as large as 64
      sectors, even for metadata it's 16 sectors byte default.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8686c40e
    • Q
      btrfs: scrub: remove scrub_sector::page and use scrub_block::pages instead · eb2fad30
      Qu Wenruo 提交于
      Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases,
      it will allocate one page for each scrub_sector, which can cause extra
      unnecessary memory usage.
      
      Utilize scrub_block::pages[] instead of allocating page for each
      scrub_sector, this allows us to integrate larger extents while using
      less memory.
      
      For example, if our page size is 64K, sectorsize is 4K, and we got an
      32K sized extent.
      We will only allocate one page for scrub_block, and all 8 scrub sectors
      will point to that page.
      
      To do that properly, here we introduce several small helpers:
      
      - scrub_page_get_logical()
        Get the logical bytenr of a page.
        We store the logical bytenr of the page range into page::private.
        But for 32bit systems, their (void *) is not large enough to contain
        a u64, so in that case we will need to allocate extra memory for it.
      
        For 64bit systems, we can use page::private directly.
      
      - scrub_block_get_logical()
        Just get the logical bytenr of the first page.
      
      - scrub_sector_get_page()
        Return the page which the scrub_sector points to.
      
      - scrub_sector_get_page_offset()
        Return the offset inside the page which the scrub_sector points to.
      
      - scrub_sector_get_kaddr()
        Return the address which the scrub_sector points to.
        Just a wrapper using scrub_sector_get_page() and
        scrub_sector_get_page_offset()
      
      - bio_add_scrub_sector()
      
      Please note that, even with this patch, we're still allocating one page
      for one sector for data extents.
      
      This is because in scrub_extent() we split the data extent using
      sectorsize.
      
      The memory usage reduction will need extra work to make scrub to work
      like data read to only use the correct sector(s).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb2fad30
    • Q
      btrfs: scrub: introduce scrub_block::pages for more efficient memory usage for subpage · f3e01e0e
      Qu Wenruo 提交于
      [BACKGROUND]
      Currently for scrub, we allocate one page for one sector, this is fine
      for PAGE_SIZE == sectorsize support, but can waste extra memory for
      subpage support.
      
      [CODE CHANGE]
      Make scrub_block contain all the pages, so if we're scrubbing an extent
      sized 64K, and our page size is also 64K, we only need to allocate one
      page.
      
      [LIFESPAN CHANGE]
      Since now scrub_sector no longer holds a page, but is using
      scrub_block::pages[] instead, we have to ensure scrub_block has a longer
      lifespan for write bio. The lifespan for read bio is already large
      enough.
      
      Now scrub_block will only be released after the write bio finished.
      
      [COMING NEXT]
      Currently we only added scrub_block::pages[] for this purpose, but
      scrub_sector is still utilizing the old scrub_sector::page.
      
      The switch will happen in the next patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e01e0e
    • Q
      btrfs: scrub: factor out allocation and initialization of scrub_sector into helper · 5dd3d8e4
      Qu Wenruo 提交于
      The allocation and initialization is shared by 3 call sites, and we're
      going to change the initialization of some members in the upcoming
      patches.
      
      So factor out the allocation and initialization of scrub_sector into a
      helper, alloc_scrub_sector(), which will do the following work:
      
      - Allocate the memory for scrub_sector
      
      - Allocate a page for scrub_sector::page
      
      - Initialize scrub_sector::refs to 1
      
      - Attach the allocated scrub_sector to scrub_block
        The attachment is bidirectional, which means scrub_block::sectorv[]
        will be updated and scrub_sector::sblock will also be updated.
      
      - Update scrub_block::sector_count and do extra sanity check on it
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5dd3d8e4
    • Q
      btrfs: scrub: factor out initialization of scrub_block into helper · 15b88f6d
      Qu Wenruo 提交于
      Although there are only two callers, we are going to add some members
      for scrub_block in the incoming patches.  Factoring out the
      initialization code will make later expansion easier.
      
      One thing to note is, even scrub_handle_errored_block() doesn't utilize
      scrub_block::refs, we still use alloc_scrub_block() to initialize
      sblock::ref, allowing us to use scrub_block_put() to do cleanup.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      15b88f6d
    • Q
      btrfs: scrub: use pointer array to replace sblocks_for_recheck · 1dfa5005
      Qu Wenruo 提交于
      In function scrub_handle_errored_block(), we use @sblocks_for_recheck
      pointer to hold one scrub_block for each mirror, and uses kcalloc() to
      allocate an array.
      
      But this one pointer for an array is not readable due to the member
      offsets done by addition and not [].
      
      Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this
      will slightly increase the stack memory usage.
      
      Since function scrub_handle_errored_block() won't get iterative calls,
      this extra cost would completely be acceptable.
      
      And since we're here, also set sblock->refs and use scrub_block_put() to
      clean them up, as later we will add extra members in scrub_block, which
      needs scrub_block_put() to clean them up.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1dfa5005
    • Q
      btrfs: scrub: remove impossible sanity checks · fc65bb53
      Qu Wenruo 提交于
      There are several sanity checks which are no longer possible to trigger
      inside btrfs_scrub_dev().
      
      Since we have mount time check against super block nodesize/sectorsize,
      and our fixed macro is hardcoded to handle even the worst combination.
      
      Thus those sanity checks are no longer needed, can be easily removed.
      
      But this patch still uses some ASSERT()s as a safe net just in case we
      change some features in the future to trigger those impossible
      combinations.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc65bb53
    • J
      btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY · 9283b9e0
      Josef Bacik 提交于
      We use this during device replace for zoned devices, we were simply
      taking the lock because it was in a bit field and we needed the lock to
      be safe with other modifications in the bitfield.  With the bit helpers
      we no longer require that locking.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9283b9e0
    • J
      btrfs: convert block group bit field to use bit helpers · 3349b57f
      Josef Bacik 提交于
      We use a bit field in the btrfs_block_group for different flags, however
      this is awkward because we have to hold the block_group->lock for any
      modification of any of these fields, and makes the code clunky for a few
      of these flags.  Convert these to a properly flags setup so we can
      utilize the bit helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3349b57f
    • Q
      btrfs: scrub: try to fix super block errors · f9eab5f0
      Qu Wenruo 提交于
      [BUG]
      The following script shows that, although scrub can detect super block
      errors, it never tries to fix it:
      
      	mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
      	xfs_io -c "pwrite 67108864 4k" $dev2
      
      	mount $dev1 $mnt
      	btrfs scrub start -B $dev2
      	btrfs scrub start -Br $dev2
      	umount $mnt
      
      The first scrub reports the super error correctly:
      
        scrub done for f3289218-abd3-41ac-a630-202f766c0859
        Scrub started:    Tue Aug  2 14:44:11 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   1.26GiB
        Rate:             0.00B/s
        Error summary:    super=1
          Corrected:      0
          Uncorrectable:  0
          Unverified:     0
      
      But the second read-only scrub still reports the same super error:
      
        Scrub started:    Tue Aug  2 14:44:11 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   1.26GiB
        Rate:             0.00B/s
        Error summary:    super=1
          Corrected:      0
          Uncorrectable:  0
          Unverified:     0
      
      [CAUSE]
      The comments already shows that super block can be easily fixed by
      committing a transaction:
      
      	/*
      	 * If we find an error in a super block, we just report it.
      	 * They will get written with the next transaction commit
      	 * anyway
      	 */
      
      But the truth is, such assumption is not always true, and since scrub
      should try to repair every error it found (except for read-only scrub),
      we should really actively commit a transaction to fix this.
      
      [FIX]
      Just commit a transaction if we found any super block errors, after
      everything else is done.
      
      We cannot do this just after scrub_supers(), as
      btrfs_commit_transaction() will try to pause and wait for the running
      scrub, thus we can not call it with scrub_lock hold.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9eab5f0
    • Q
      btrfs: scrub: properly report super block errors in system log · e69bf81c
      Qu Wenruo 提交于
      [PROBLEM]
      
      Unlike data/metadata corruption, if scrub detected some error in the
      super block, the only error message is from the updated device status:
      
        BTRFS info (device dm-1): scrub: started on devid 2
        BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
      
      This is not helpful at all.
      
      [CAUSE]
      Unlike data/metadata error reporting, there is no visible report in
      kernel dmesg to report supper block errors.
      
      In fact, return value of scrub_checksum_super() is intentionally
      skipped, thus scrub_handle_errored_block() will never be called for
      super blocks.
      
      [FIX]
      Make super block errors to output an error message, now the full
      dmesg would looks like this:
      
        BTRFS info (device dm-1): scrub: started on devid 2
        BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
        BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
        BTRFS info (device dm-1): scrub: started on devid 2
      
      This fix involves:
      
      - Move the super_errors reporting to scrub_handle_errored_block()
        This allows the device status message to show after the super block
        error message.
        But now we no longer distinguish super block corruption and generation
        mismatch, now all counted as corruption.
      
      - Properly check the return value from scrub_checksum_super()
      - Add extra super block error reporting for scrub_print_warning().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e69bf81c
  4. 25 7月, 2022 4 次提交
  5. 16 5月, 2022 19 次提交
  6. 21 4月, 2022 1 次提交
    • F
      btrfs: fix assertion failure during scrub due to block group reallocation · a692e13d
      Filipe Manana 提交于
      During a scrub, or device replace, we can race with block group removal
      and allocation and trigger the following assertion failure:
      
      [7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
      [7526.387351] ------------[ cut here ]------------
      [7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
      [7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      [7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
      [7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
      [7526.393520] Code: f3 48 c7 c7 20 (...)
      [7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
      [7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
      [7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
      [7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
      [7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
      [7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
      [7526.402874] FS:  00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
      [7526.404038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
      [7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7526.408169] Call Trace:
      [7526.408529]  <TASK>
      [7526.408839]  scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
      [7526.409690]  ? do_wait_intr_irq+0xb0/0xb0
      [7526.410276]  btrfs_scrub_dev+0x226/0x620 [btrfs]
      [7526.410995]  ? preempt_count_add+0x49/0xa0
      [7526.411592]  btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
      [7526.412278]  ? __fget_files+0xc9/0x1b0
      [7526.412825]  ? kvm_sched_clock_read+0x14/0x40
      [7526.413459]  ? lock_release+0x155/0x4a0
      [7526.414022]  ? __x64_sys_ioctl+0x83/0xb0
      [7526.414601]  __x64_sys_ioctl+0x83/0xb0
      [7526.415150]  do_syscall_64+0x3b/0xc0
      [7526.415675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [7526.416408] RIP: 0033:0x7f6c99d34397
      [7526.416931] Code: 3c 1c e8 1c ff (...)
      [7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
      [7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
      [7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
      [7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
      [7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
      [7526.425950]  </TASK>
      
      That assertion is relatively new, introduced with commit d04fbe19
      ("btrfs: scrub: cleanup the argument list of scrub_chunk()").
      
      The block group we get at scrub_enumerate_chunks() can actually have a
      start address that is smaller then the chunk offset we extracted from a
      device extent item we got from the commit root of the device tree.
      This is very rare, but it can happen due to a race with block group
      removal and allocation. For example, the following steps show how this
      can happen:
      
      1) We are at transaction T, and we have the following blocks groups,
         sorted by their logical start address:
      
         [ bg A, start address A, length 1G (data) ]
         [ bg B, start address B, length 1G (data) ]
         (...)
         [ bg W, start address W, length 1G (data) ]
      
           --> logical address space hole of 256M,
               there used to be a 256M metadata block group here
      
         [ bg Y, start address Y, length 256M (metadata) ]
      
            --> Y matches W's end offset + 256M
      
         Block group Y is the block group with the highest logical address in
         the whole filesystem;
      
      2) Block group Y is deleted and its extent mapping is removed by the call
         to remove_extent_mapping() made from btrfs_remove_block_group().
      
         So after this point, the last element of the mapping red black tree,
         its rightmost node, is the mapping for block group W;
      
      3) While still at transaction T, a new data block group is allocated,
         with a length of 1G. When creating the block group we do a call to
         find_next_chunk(), which returns the logical start address for the
         new block group. This calls returns X, which corresponds to the
         end offset of the last block group, the rightmost node in the mapping
         red black tree (fs_info->mapping_tree), plus one.
      
         So we get a new block group that starts at logical address X and with
         a length of 1G. It spans over the whole logical range of the old block
         group Y, that was previously removed in the same transaction.
      
         However the device extent allocated to block group X is not the same
         device extent that was used by block group Y, and it also does not
         overlap that extent, which must be always the case because we allocate
         extents by searching through the commit root of the device tree
         (otherwise it could corrupt a filesystem after a power failure or
         an unclean shutdown in general), so the extent allocator is behaving
         as expected;
      
      4) We have a task running scrub, currently at scrub_enumerate_chunks().
         There it searches for device extent items in the device tree, using
         its commit root. It finds a device extent item that was used by
         block group Y, and it extracts the value Y from that item into the
         local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
      
         It then calls btrfs_lookup_block_group() to find block group for
         the logical address Y - since there's currently no block group that
         starts at that logical address, it returns block group X, because
         its range contains Y.
      
         This results in triggering the assertion:
      
            ASSERT(cache->start == chunk_offset);
      
         right before calling scrub_chunk(), as cache->start is X and
         chunk_offset is Y.
      
      This is more likely to happen of filesystems not larger than 50G, because
      for these filesystems we use a 256M size for metadata block groups and
      a 1G size for data block groups, while for filesystems larger than 50G,
      we use a 1G size for both data and metadata block groups (except for
      zoned filesystems). It could also happen on any filesystem size due to
      the fact that system block groups are always smaller (32M) than both
      data and metadata block groups, but these are not frequently deleted, so
      much less likely to trigger the race.
      
      So make scrub skip any block group with a start offset that is less than
      the value we expect, as that means it's a new block group that was created
      in the current transaction. It's pointless to continue and try to scrub
      its extents, because scrub searches for extents using the commit root, so
      it won't find any. For a device replace, skip it as well for the same
      reasons, and we don't need to worry about the possibility of extents of
      the new block group not being to the new device, because we have the write
      duplication setup done through btrfs_map_block().
      
      Fixes: d04fbe19 ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a692e13d