1. 25 7月, 2022 1 次提交
  2. 16 5月, 2022 19 次提交
  3. 21 4月, 2022 1 次提交
    • F
      btrfs: fix assertion failure during scrub due to block group reallocation · a692e13d
      Filipe Manana 提交于
      During a scrub, or device replace, we can race with block group removal
      and allocation and trigger the following assertion failure:
      
      [7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
      [7526.387351] ------------[ cut here ]------------
      [7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
      [7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      [7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
      [7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
      [7526.393520] Code: f3 48 c7 c7 20 (...)
      [7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
      [7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
      [7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
      [7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
      [7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
      [7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
      [7526.402874] FS:  00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
      [7526.404038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
      [7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7526.408169] Call Trace:
      [7526.408529]  <TASK>
      [7526.408839]  scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
      [7526.409690]  ? do_wait_intr_irq+0xb0/0xb0
      [7526.410276]  btrfs_scrub_dev+0x226/0x620 [btrfs]
      [7526.410995]  ? preempt_count_add+0x49/0xa0
      [7526.411592]  btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
      [7526.412278]  ? __fget_files+0xc9/0x1b0
      [7526.412825]  ? kvm_sched_clock_read+0x14/0x40
      [7526.413459]  ? lock_release+0x155/0x4a0
      [7526.414022]  ? __x64_sys_ioctl+0x83/0xb0
      [7526.414601]  __x64_sys_ioctl+0x83/0xb0
      [7526.415150]  do_syscall_64+0x3b/0xc0
      [7526.415675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [7526.416408] RIP: 0033:0x7f6c99d34397
      [7526.416931] Code: 3c 1c e8 1c ff (...)
      [7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
      [7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
      [7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
      [7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
      [7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
      [7526.425950]  </TASK>
      
      That assertion is relatively new, introduced with commit d04fbe19
      ("btrfs: scrub: cleanup the argument list of scrub_chunk()").
      
      The block group we get at scrub_enumerate_chunks() can actually have a
      start address that is smaller then the chunk offset we extracted from a
      device extent item we got from the commit root of the device tree.
      This is very rare, but it can happen due to a race with block group
      removal and allocation. For example, the following steps show how this
      can happen:
      
      1) We are at transaction T, and we have the following blocks groups,
         sorted by their logical start address:
      
         [ bg A, start address A, length 1G (data) ]
         [ bg B, start address B, length 1G (data) ]
         (...)
         [ bg W, start address W, length 1G (data) ]
      
           --> logical address space hole of 256M,
               there used to be a 256M metadata block group here
      
         [ bg Y, start address Y, length 256M (metadata) ]
      
            --> Y matches W's end offset + 256M
      
         Block group Y is the block group with the highest logical address in
         the whole filesystem;
      
      2) Block group Y is deleted and its extent mapping is removed by the call
         to remove_extent_mapping() made from btrfs_remove_block_group().
      
         So after this point, the last element of the mapping red black tree,
         its rightmost node, is the mapping for block group W;
      
      3) While still at transaction T, a new data block group is allocated,
         with a length of 1G. When creating the block group we do a call to
         find_next_chunk(), which returns the logical start address for the
         new block group. This calls returns X, which corresponds to the
         end offset of the last block group, the rightmost node in the mapping
         red black tree (fs_info->mapping_tree), plus one.
      
         So we get a new block group that starts at logical address X and with
         a length of 1G. It spans over the whole logical range of the old block
         group Y, that was previously removed in the same transaction.
      
         However the device extent allocated to block group X is not the same
         device extent that was used by block group Y, and it also does not
         overlap that extent, which must be always the case because we allocate
         extents by searching through the commit root of the device tree
         (otherwise it could corrupt a filesystem after a power failure or
         an unclean shutdown in general), so the extent allocator is behaving
         as expected;
      
      4) We have a task running scrub, currently at scrub_enumerate_chunks().
         There it searches for device extent items in the device tree, using
         its commit root. It finds a device extent item that was used by
         block group Y, and it extracts the value Y from that item into the
         local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
      
         It then calls btrfs_lookup_block_group() to find block group for
         the logical address Y - since there's currently no block group that
         starts at that logical address, it returns block group X, because
         its range contains Y.
      
         This results in triggering the assertion:
      
            ASSERT(cache->start == chunk_offset);
      
         right before calling scrub_chunk(), as cache->start is X and
         chunk_offset is Y.
      
      This is more likely to happen of filesystems not larger than 50G, because
      for these filesystems we use a 256M size for metadata block groups and
      a 1G size for data block groups, while for filesystems larger than 50G,
      we use a 1G size for both data and metadata block groups (except for
      zoned filesystems). It could also happen on any filesystem size due to
      the fact that system block groups are always smaller (32M) than both
      data and metadata block groups, but these are not frequently deleted, so
      much less likely to trigger the race.
      
      So make scrub skip any block group with a start offset that is less than
      the value we expect, as that means it's a new block group that was created
      in the current transaction. It's pointless to continue and try to scrub
      its extents, because scrub searches for extents using the commit root, so
      it won't find any. For a device replace, skip it as well for the same
      reasons, and we don't need to worry about the possibility of extents of
      the new block group not being to the new device, because we have the write
      duplication setup done through btrfs_map_block().
      
      Fixes: d04fbe19 ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a692e13d
  4. 14 3月, 2022 1 次提交
  5. 07 1月, 2022 8 次提交
    • Q
      btrfs: scrub: cleanup the argument list of scrub_stripe() · 2ae8ae3d
      Qu Wenruo 提交于
      The argument list of btrfs_stripe() has similar problems of
      scrub_chunk():
      
      - Duplicated and ambiguous @base argument
        Can be fetched from btrfs_block_group::bg.
      
      - Ambiguous argument @length
        It's again device extent length
      
      - Ambiguous argument @num
        The instinctive guess would be mirror number, but in fact it's stripe
        index.
      
      Fix it by:
      
      - Remove @base parameter
      
      - Rename @length to @dev_extent_len
      
      - Rename @num to @stripe_index
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ae8ae3d
    • Q
      btrfs: scrub: cleanup the argument list of scrub_chunk() · d04fbe19
      Qu Wenruo 提交于
      The argument list of scrub_chunk() has the following problems:
      
      - Duplicated @chunk_offset
        It is the same as btrfs_block_group::start.
      
      - Confusing @length
        The most instinctive guess is chunk length, and one may want to delete
        it, but the truth is, it's the device extent length.
      
      Fix this by:
      
      - Remove @chunk_offset
        Use btrfs_block_group::start instead.
      
      - Rename @length to @dev_extent_len
        Also rename the caller to remove the ambiguous naming.
      
      - Rename @cache to @bg
        The "_cache" suffix for btrfs_block_group has been removed for a while.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d04fbe19
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • Q
      btrfs: scrub: use btrfs_path::reada for extent tree readahead · dcf62b20
      Qu Wenruo 提交于
      For scrub, we trigger two readaheads for two trees, extent tree to get
      where to scrub, and csum tree to get the data checksum.
      
      For csum tree we already trigger readahead in
      btrfs_lookup_csums_range(), by setting path->reada.
      But for extent tree we don't have any path based readahead.
      
      Add the readahead for extent tree as well, so we can later remove the
      btrfs_reada_add() based readahead.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dcf62b20
    • Q
      btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity() · 2522dbe8
      Qu Wenruo 提交于
      In function scrub_stripe() we allocated two btrfs_path's, one @path for
      extent tree search and another @ppath for full stripe extent tree search
      for RAID56.
      
      This is totally umncessary, as the @ppath usage is completely inside
      scrub_raid56_parity(), thus we can move the path allocation into
      scrub_raid56_parity() completely.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2522dbe8
    • J
      btrfs: zoned: sink zone check into btrfs_repair_one_zone · 554aed7d
      Johannes Thumshirn 提交于
      Sink zone check into btrfs_repair_one_zone() so we don't need to do it
      in all callers.
      
      Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
      a boolean function and return false in case it got called on a non-zoned
      filesystem and true on a zoned filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      554aed7d
    • Q
      btrfs: scrub: merge SCRUB_PAGES_PER_RD_BIO and SCRUB_PAGES_PER_WR_BIO · c9d328c0
      Qu Wenruo 提交于
      These two values were introduced in commit ff023aac ("Btrfs: add code
      to scrub to copy read data to another disk") as an optimization.
      
      But the truth is, block layer scheduler can do whatever it wants to
      merge/split bios to improve performance.
      
      Doing such "optimization" is not really going to affect much, especially
      considering how good current block layer optimizations are doing.
      Remove such old and immature optimization from our code.
      
      Since we're here, also change BUG_ON()s using these two macros to use
      ASSERT()s.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9d328c0
    • Q
      btrfs: update SCRUB_MAX_PAGES_PER_BLOCK · 0bb3acdc
      Qu Wenruo 提交于
      Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to
      calculate this value.
      
      And remove one stale comment on the value, in fact with recent subpage
      support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond
      BTRFS_STRIPE_LEN, just we don't use the full page.
      
      Also since we're here, update the BUG_ON() related to
      SCRUB_MAX_PAGES_PER_BLOCK to ASSERT().
      
      As those ASSERT() are really only for developers to catch early obvious
      bugs, not to let end users suffer.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bb3acdc
  6. 03 1月, 2022 3 次提交
  7. 16 11月, 2021 1 次提交
  8. 27 10月, 2021 6 次提交
    • J
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik 提交于
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      562d7b15
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • Q
      btrfs: remove btrfs_raid_bio::fs_info member · 6a258d72
      Qu Wenruo 提交于
      We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is
      always passed into alloc_rbio(), and only get released when the raid bio
      is released.
      
      Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info
      parameters for alloc_rbio() callers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a258d72
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo 提交于
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611