1. 16 5月, 2022 4 次提交
  2. 21 4月, 2022 1 次提交
    • F
      btrfs: fix assertion failure during scrub due to block group reallocation · a692e13d
      Filipe Manana 提交于
      During a scrub, or device replace, we can race with block group removal
      and allocation and trigger the following assertion failure:
      
      [7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
      [7526.387351] ------------[ cut here ]------------
      [7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
      [7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      [7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
      [7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
      [7526.393520] Code: f3 48 c7 c7 20 (...)
      [7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
      [7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
      [7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
      [7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
      [7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
      [7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
      [7526.402874] FS:  00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
      [7526.404038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
      [7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7526.408169] Call Trace:
      [7526.408529]  <TASK>
      [7526.408839]  scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
      [7526.409690]  ? do_wait_intr_irq+0xb0/0xb0
      [7526.410276]  btrfs_scrub_dev+0x226/0x620 [btrfs]
      [7526.410995]  ? preempt_count_add+0x49/0xa0
      [7526.411592]  btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
      [7526.412278]  ? __fget_files+0xc9/0x1b0
      [7526.412825]  ? kvm_sched_clock_read+0x14/0x40
      [7526.413459]  ? lock_release+0x155/0x4a0
      [7526.414022]  ? __x64_sys_ioctl+0x83/0xb0
      [7526.414601]  __x64_sys_ioctl+0x83/0xb0
      [7526.415150]  do_syscall_64+0x3b/0xc0
      [7526.415675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [7526.416408] RIP: 0033:0x7f6c99d34397
      [7526.416931] Code: 3c 1c e8 1c ff (...)
      [7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
      [7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
      [7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
      [7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
      [7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
      [7526.425950]  </TASK>
      
      That assertion is relatively new, introduced with commit d04fbe19
      ("btrfs: scrub: cleanup the argument list of scrub_chunk()").
      
      The block group we get at scrub_enumerate_chunks() can actually have a
      start address that is smaller then the chunk offset we extracted from a
      device extent item we got from the commit root of the device tree.
      This is very rare, but it can happen due to a race with block group
      removal and allocation. For example, the following steps show how this
      can happen:
      
      1) We are at transaction T, and we have the following blocks groups,
         sorted by their logical start address:
      
         [ bg A, start address A, length 1G (data) ]
         [ bg B, start address B, length 1G (data) ]
         (...)
         [ bg W, start address W, length 1G (data) ]
      
           --> logical address space hole of 256M,
               there used to be a 256M metadata block group here
      
         [ bg Y, start address Y, length 256M (metadata) ]
      
            --> Y matches W's end offset + 256M
      
         Block group Y is the block group with the highest logical address in
         the whole filesystem;
      
      2) Block group Y is deleted and its extent mapping is removed by the call
         to remove_extent_mapping() made from btrfs_remove_block_group().
      
         So after this point, the last element of the mapping red black tree,
         its rightmost node, is the mapping for block group W;
      
      3) While still at transaction T, a new data block group is allocated,
         with a length of 1G. When creating the block group we do a call to
         find_next_chunk(), which returns the logical start address for the
         new block group. This calls returns X, which corresponds to the
         end offset of the last block group, the rightmost node in the mapping
         red black tree (fs_info->mapping_tree), plus one.
      
         So we get a new block group that starts at logical address X and with
         a length of 1G. It spans over the whole logical range of the old block
         group Y, that was previously removed in the same transaction.
      
         However the device extent allocated to block group X is not the same
         device extent that was used by block group Y, and it also does not
         overlap that extent, which must be always the case because we allocate
         extents by searching through the commit root of the device tree
         (otherwise it could corrupt a filesystem after a power failure or
         an unclean shutdown in general), so the extent allocator is behaving
         as expected;
      
      4) We have a task running scrub, currently at scrub_enumerate_chunks().
         There it searches for device extent items in the device tree, using
         its commit root. It finds a device extent item that was used by
         block group Y, and it extracts the value Y from that item into the
         local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
      
         It then calls btrfs_lookup_block_group() to find block group for
         the logical address Y - since there's currently no block group that
         starts at that logical address, it returns block group X, because
         its range contains Y.
      
         This results in triggering the assertion:
      
            ASSERT(cache->start == chunk_offset);
      
         right before calling scrub_chunk(), as cache->start is X and
         chunk_offset is Y.
      
      This is more likely to happen of filesystems not larger than 50G, because
      for these filesystems we use a 256M size for metadata block groups and
      a 1G size for data block groups, while for filesystems larger than 50G,
      we use a 1G size for both data and metadata block groups (except for
      zoned filesystems). It could also happen on any filesystem size due to
      the fact that system block groups are always smaller (32M) than both
      data and metadata block groups, but these are not frequently deleted, so
      much less likely to trigger the race.
      
      So make scrub skip any block group with a start offset that is less than
      the value we expect, as that means it's a new block group that was created
      in the current transaction. It's pointless to continue and try to scrub
      its extents, because scrub searches for extents using the commit root, so
      it won't find any. For a device replace, skip it as well for the same
      reasons, and we don't need to worry about the possibility of extents of
      the new block group not being to the new device, because we have the write
      duplication setup done through btrfs_map_block().
      
      Fixes: d04fbe19 ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a692e13d
  3. 14 3月, 2022 1 次提交
  4. 07 1月, 2022 8 次提交
    • Q
      btrfs: scrub: cleanup the argument list of scrub_stripe() · 2ae8ae3d
      Qu Wenruo 提交于
      The argument list of btrfs_stripe() has similar problems of
      scrub_chunk():
      
      - Duplicated and ambiguous @base argument
        Can be fetched from btrfs_block_group::bg.
      
      - Ambiguous argument @length
        It's again device extent length
      
      - Ambiguous argument @num
        The instinctive guess would be mirror number, but in fact it's stripe
        index.
      
      Fix it by:
      
      - Remove @base parameter
      
      - Rename @length to @dev_extent_len
      
      - Rename @num to @stripe_index
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ae8ae3d
    • Q
      btrfs: scrub: cleanup the argument list of scrub_chunk() · d04fbe19
      Qu Wenruo 提交于
      The argument list of scrub_chunk() has the following problems:
      
      - Duplicated @chunk_offset
        It is the same as btrfs_block_group::start.
      
      - Confusing @length
        The most instinctive guess is chunk length, and one may want to delete
        it, but the truth is, it's the device extent length.
      
      Fix this by:
      
      - Remove @chunk_offset
        Use btrfs_block_group::start instead.
      
      - Rename @length to @dev_extent_len
        Also rename the caller to remove the ambiguous naming.
      
      - Rename @cache to @bg
        The "_cache" suffix for btrfs_block_group has been removed for a while.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d04fbe19
    • Q
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo 提交于
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f26c9238
    • Q
      btrfs: scrub: use btrfs_path::reada for extent tree readahead · dcf62b20
      Qu Wenruo 提交于
      For scrub, we trigger two readaheads for two trees, extent tree to get
      where to scrub, and csum tree to get the data checksum.
      
      For csum tree we already trigger readahead in
      btrfs_lookup_csums_range(), by setting path->reada.
      But for extent tree we don't have any path based readahead.
      
      Add the readahead for extent tree as well, so we can later remove the
      btrfs_reada_add() based readahead.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dcf62b20
    • Q
      btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity() · 2522dbe8
      Qu Wenruo 提交于
      In function scrub_stripe() we allocated two btrfs_path's, one @path for
      extent tree search and another @ppath for full stripe extent tree search
      for RAID56.
      
      This is totally umncessary, as the @ppath usage is completely inside
      scrub_raid56_parity(), thus we can move the path allocation into
      scrub_raid56_parity() completely.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2522dbe8
    • J
      btrfs: zoned: sink zone check into btrfs_repair_one_zone · 554aed7d
      Johannes Thumshirn 提交于
      Sink zone check into btrfs_repair_one_zone() so we don't need to do it
      in all callers.
      
      Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
      a boolean function and return false in case it got called on a non-zoned
      filesystem and true on a zoned filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      554aed7d
    • Q
      btrfs: scrub: merge SCRUB_PAGES_PER_RD_BIO and SCRUB_PAGES_PER_WR_BIO · c9d328c0
      Qu Wenruo 提交于
      These two values were introduced in commit ff023aac ("Btrfs: add code
      to scrub to copy read data to another disk") as an optimization.
      
      But the truth is, block layer scheduler can do whatever it wants to
      merge/split bios to improve performance.
      
      Doing such "optimization" is not really going to affect much, especially
      considering how good current block layer optimizations are doing.
      Remove such old and immature optimization from our code.
      
      Since we're here, also change BUG_ON()s using these two macros to use
      ASSERT()s.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9d328c0
    • Q
      btrfs: update SCRUB_MAX_PAGES_PER_BLOCK · 0bb3acdc
      Qu Wenruo 提交于
      Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to
      calculate this value.
      
      And remove one stale comment on the value, in fact with recent subpage
      support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond
      BTRFS_STRIPE_LEN, just we don't use the full page.
      
      Also since we're here, update the BUG_ON() related to
      SCRUB_MAX_PAGES_PER_BLOCK to ASSERT().
      
      As those ASSERT() are really only for developers to catch early obvious
      bugs, not to let end users suffer.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bb3acdc
  5. 03 1月, 2022 3 次提交
  6. 16 11月, 2021 1 次提交
  7. 27 10月, 2021 6 次提交
    • J
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik 提交于
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      562d7b15
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • Q
      btrfs: remove btrfs_raid_bio::fs_info member · 6a258d72
      Qu Wenruo 提交于
      We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is
      always passed into alloc_rbio(), and only get released when the raid bio
      is released.
      
      Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info
      parameters for alloc_rbio() callers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a258d72
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo 提交于
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
  8. 22 6月, 2021 1 次提交
  9. 21 6月, 2021 3 次提交
    • Q
      btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE · 8df507cb
      Qu Wenruo 提交于
      [BUG]
      For the following file layout, scrub will not be able to repair all
      these two repairable error, but in fact make one corruption even
      unrepairable:
      
      	  inode offset 0      4k     8K
      Mirror 1               |XXXXXX|      |
      Mirror 2               |      |XXXXXX|
      
      [CAUSE]
      The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
      go crazy for subpage.
      
      For above case, when reading the first sector, we use PAGE_SIZE other
      than sectorsize to read, which makes us to read the full range [0, 64K).
      In fact, after 8K there may be no data at all, we can just get some
      garbage.
      
      Then when doing the repair, we also writeback a full page from mirror 2,
      this means, we will also writeback the corrupted data in mirror 2 back
      to mirror 1, leaving the range [4K, 8K) unrepairable.
      
      [FIX]
      This patch will modify the following PAGE_SIZE use with sectorsize:
      
      - scrub_print_warning_inode()
        Remove the min() and replace PAGE_SIZE with sectorsize.
        The min() makes no sense, as csum is done for the full sector with
        padding.
      
        This fixes a bug that subpage report extra length like:
         checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
         physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
      
        Where the error is only 1 sector.
      
      - scrub_handle_errored_block()
        Comments with PAGE|page involved, all changed to sector.
      
      - scrub_setup_recheck_block()
      - scrub_repair_page_from_good_copy()
      - scrub_add_page_to_wr_bio()
      - scrub_wr_submit()
      - scrub_add_page_to_rd_bio()
      - scrub_block_complete()
        Replace PAGE_SIZE with sectorsize.
        This solves several problems where we read/write extra range for
        subpage case.
      
      RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
      usage, and is not really safe enough.
      Thus we will reject RAID56 for subpage in later commit.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8df507cb
    • D
      btrfs: scrub: factor out common scrub_stripe constraints · 7735cd75
      David Sterba 提交于
      There are common values set for the stripe constraints, some of them
      are already factored out. Do that for increment and mirror_num as well.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7735cd75
    • D
      btrfs: scrub: per-device bandwidth control · eb3b5053
      David Sterba 提交于
      Add sysfs interface to limit io during scrub. We relied on the ionice
      interface to do that, eg. the idle class let the system usable while
      scrub was running. This has changed when mq-deadline got widespread and
      did not implement the scheduling classes. That was a CFQ thing that got
      deleted. We've got numerous complaints from users about degraded
      performance.
      
      Currently only BFQ supports that but it's not a common scheduler and we
      can't ask everybody to switch to it.
      
      Alternatively the cgroup io limiting can be used but that also a
      non-trivial setup (v2 required, the controller must be enabled on the
      system). This can still be used if desired.
      
      Other ideas that have been explored: piggy-back on ionice (that is set
      per-process and is accessible) and interpret the class and classdata as
      bandwidth limits, but this does not have enough flexibility as there are
      only 8 allowed and we'd have to map fixed limits to each value. Also
      adjusting the value would need to lookup the process that currently runs
      scrub on the given device, and the value is not sticky so would have to
      be adjusted each time scrub runs.
      
      Running out of options, sysfs does not look that bad:
      
      - it's accessible from scripts, or udev rules
      - the name is similar to what MD-RAID has
        (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
      - the value is sticky at least for filesystem mount time
      - adjusting the value has immediate effect
      - sysfs is available in constrained environments (eg. system rescue)
      - the limit also applies to device replace
      
      Sysfs:
      
      - raw value is in bytes
      - values written to the file accept suffixes like K, M
      - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
      - 0 means use default priority of IO
      
      The scheduler is a simple deadline one and the accuracy is up to nearest
      128K.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb3b5053
  10. 21 4月, 2021 1 次提交
  11. 19 4月, 2021 1 次提交
  12. 11 3月, 2021 1 次提交
  13. 23 2月, 2021 1 次提交
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
  14. 09 2月, 2021 4 次提交
    • N
      btrfs: zoned: relocate block group to repair IO failure in zoned filesystems · f7ef5287
      Naohiro Aota 提交于
      When a bad checksum is found and if the filesystem has a mirror of the
      damaged data, we read the correct data from the mirror and writes it to
      damaged blocks. This however, violates the sequential write constraints
      of a zoned block device.
      
      We can consider three methods to repair an IO failure in zoned filesystems:
      
      (1) Reset and rewrite the damaged zone
      (2) Allocate new device extent and replace the damaged device extent to
          the new extent
      (3) Relocate the corresponding block group
      
      Method (1) is most similar to a behavior done with regular devices.
      However, it also wipes non-damaged data in the same device extent, and
      so it unnecessary degrades non-damaged data.
      
      Method (2) is much like device replacing but done in the same device. It
      is safe because it keeps the device extent until the replacing finish.
      However, extending device replacing is non-trivial. It assumes
      "src_dev->physical == dst_dev->physical". Also, the extent mapping
      replacing function should be extended to support replacing device extent
      position in one device.
      
      Method (3) invokes relocation of the damaged block group and is
      straightforward to implement. It relocates all the mirrored device
      extents, so it potentially is a more costly operation than method (1) or
      (2). But it relocates only used extents which reduce the total IO size.
      
      Let's apply method (3) for now. In the future, we can extend device-replace
      and apply method (2).
      
      For protecting a block group gets relocated multiple time with multiple
      IO errors, this commit introduces "relocating_repair" bit to show it's
      now relocating to repair IO failures. Also it uses a new kthread
      "btrfs-relocating-repair", not to block IO path with relocating process.
      
      This commit also supports repairing in the scrub process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7ef5287
    • N
      btrfs: zoned: support dev-replace in zoned filesystems · 7db1c5d1
      Naohiro Aota 提交于
      This is 4/4 patch to implement device-replace on zoned filesystems.
      
      Even after the copying is done, the write pointers of the source device
      and the destination device may not be synchronized. For example, when
      the last allocated extent is freed before device-replace process, the
      extent is not copied, leaving a hole there.
      
      Synchronize the write pointers by writing zeroes to the destination
      device.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7db1c5d1
    • N
      btrfs: zoned: implement copying for zoned device-replace · de17addc
      Naohiro Aota 提交于
      This is 3/4 patch to implement device-replace on zoned filesystems.
      
      This commit implements copying. To do this, it tracks the write pointer
      during the device replace process. As device-replace's copy process is
      smart enough to only copy used extents on the source device, we have to
      fill the gap to honor the sequential write requirement in the target
      device.
      
      The device-replace process on zoned filesystems must copy or clone all
      the extents in the source device exactly once. So, we need to ensure
      allocations started just before the dev-replace process to have their
      corresponding extent information in the B-trees.
      finish_extent_writes_for_zoned() implements that functionality, which
      basically is the removed code in the commit 042528f8 ("Btrfs: fix
      block group remaining RO forever after error during device replace").
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de17addc
    • N
      btrfs: zoned: mark block groups to copy for device-replace · 78ce9fc2
      Naohiro Aota 提交于
      This is the 1/4 patch to support device-replace on zoned filesystems.
      
      We have two types of IOs during the device replace process. One is an IO
      to "copy" (by the scrub functions) all the device extents from the source
      device to the destination device. The other one is an IO to "clone" (by
      handle_ops_on_dev_replace()) new incoming write IOs from users to the
      source device into the target device.
      
      Cloning incoming IOs can break the sequential write rule in on target
      device. When a write is mapped in the middle of a block group, the IO is
      directed to the middle of a target device zone, which breaks the
      sequential write requirement.
      
      However, the cloning function cannot be disabled since incoming IOs
      targeting already copied device extents must be cloned so that the IO is
      executed on the target device.
      
      We cannot use dev_replace->cursor_{left,right} to determine whether a bio
      is going to a not yet copied region. Since we have a time gap between
      finishing btrfs_scrub_dev() and rewriting the mapping tree in
      btrfs_dev_replace_finishing(), we can have a newly allocated device extent
      which is never cloned nor copied.
      
      So the point is to copy only already existing device extents. This patch
      introduces mark_block_group_to_copy() to mark existing block groups as a
      target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
      check the flag to do their job.
      
      Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
      is the last stripe in the block group. With the last stripe copied,
      the to_copy flag is finally disabled. Afterwards we can safely clone
      incoming IOs on this block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78ce9fc2
  15. 25 1月, 2021 1 次提交
  16. 10 12月, 2020 3 次提交
    • Q
      btrfs: scrub: allow scrub to work with subpage sectorsize · b42fe98c
      Qu Wenruo 提交于
      Since btrfs scrub is utilizing its own infrastructure to submit
      read/write, scrub is independent from all other routines.
      
      This brings one very neat feature, allow us to read 4K data into offset
      0 of a 64K page.  So is the writeback routine.
      
      This makes scrub on subpage sector size much easier to implement, and
      thanks to previous commits which just changed the implementation to
      always do scrub based on sector size, now scrub can handle subpage
      filesystem without any problem.
      
      This patch will just remove the restriction on
      (sectorsize != PAGE_SIZE), to make scrub finally work on subpage
      filesystems.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b42fe98c
    • Q
      btrfs: scrub: support subpage data scrub · b29dca44
      Qu Wenruo 提交于
      Btrfs scrub is more flexible than buffered data write path, as we can
      read an unaligned subpage data into page offset 0.
      
      This ability makes subpage support much easier, we just need to check
      each scrub_page::page_len and ensure we only calculate hash for [0,
      page_len) of a page.
      
      There is a small thing to notice: for subpage case, we still do sector
      by sector scrub.  This means we will submit a read bio for each sector
      to scrub, resulting in the same amount of read bios, just like on the 4K
      page systems.
      
      This behavior can be considered as a good thing, if we want everything
      to be the same as 4K page systems.  But this also means, we're wasting
      the possibility to submit larger bio using 64K page size.  This is
      another problem to consider in the future.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b29dca44
    • Q
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo 提交于
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f3251d