1. 18 4月, 2023 3 次提交
    • Q
      btrfs: replace btrfs_io_context::raid_map with a fixed u64 value · 18d758a2
      Qu Wenruo 提交于
      In btrfs_io_context structure, we have a pointer raid_map, which
      indicates the logical bytenr for each stripe.
      
      But considering we always call sort_parity_stripes(), the result
      raid_map[] is always sorted, thus raid_map[0] is always the logical
      bytenr of the full stripe.
      
      So why we waste the space and time (for sorting) for raid_map?
      
      This patch will replace btrfs_io_context::raid_map with a single u64
      number, full_stripe_start, by:
      
      - Replace btrfs_io_context::raid_map with full_stripe_start
      
      - Replace call sites using raid_map[0] to use full_stripe_start
      
      - Replace call sites using raid_map[i] to compare with nr_data_stripes.
      
      The benefits are:
      
      - Less memory wasted on raid_map
        It's sizeof(u64) * num_stripes vs sizeof(u64).
        It'll always save at least one u64, and the benefit grows larger with
        num_stripes.
      
      - No more weird alloc_btrfs_io_context() behavior
        As there is only one fixed size + one variable length array.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18d758a2
    • Q
      btrfs: use an efficient way to represent source of duplicated stripes · 1faf3885
      Qu Wenruo 提交于
      For btrfs dev-replace, we have to duplicate writes to the source
      device into the target device.
      
      For non-RAID56, all writes into the same mapped ranges are sharing the
      same content, thus they don't really need to bother anything.
      (E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
      same write to all involved devices).
      
      But for RAID56, all stripes contain different content, thus we must
      have a clear mapping of which stripe is duplicated from which original
      stripe.
      
      Currently we use a complex way using tgtdev_map[] array, e.g:
      
       num_tgtdevs = 1
       tgtdev_map[0] = 0    <- Means stripes[0] is not involved in replace.
       tgtdev_map[1] = 3    <- Means stripes[1] is involved in replace,
      			 and it's duplicated to stripes[3].
       tgtdev_map[2] = 0    <- Means stripes[2] is not involved in replace.
      
      But this is wasting some space, and ignores one important thing for
      dev-replace, there is at most one running replace.
      
      Thus we can change it to a fixed array to represent the mapping:
      
       replace_nr_stripes = 1
       replace_stripe_src = 1    <- Means stripes[1] is involved in replace.
      			      thus the extra stripe is a copy of
      			      stripes[1]
      
      By this we can save some space for bioc on RAID56 chunks with many
      devices.  And we get rid of one variable sized array from bioc.
      
      Thus the patch involves the following changes:
      
      - Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
        and @replace_stripe_src.
      
        @num_tgtdevs is just renamed to @replace_nr_stripes.
        While the mapping is completely changed.
      
      - Add extra ASSERT()s for RAID56 code
      
      - Only add two more extra stripes for dev-replace cases.
        As we have an upper limit on how many dev-replace stripes we can have.
      
      - Unify the behavior of handle_ops_on_dev_replace()
        Previously handle_ops_on_dev_replace() go two different paths for
        WRITE and GET_READ_MIRRORS.
        Now unify them by always going the WRITE path first (with at most 2
        replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
        extra stripes, just drop one stripe.
      
      - Remove the @real_stripes argument from alloc_btrfs_io_context()
        As we don't need the old variable length array any more.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1faf3885
    • C
      btrfs: raid56: no need for irqsafe locking · 74cc3600
      Christoph Hellwig 提交于
      These days all the operations that take locks in the raid56.c code are
      run from user context (mostly workqueues).  Drop all the irqsafe locking
      that is not required any more.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74cc3600
  2. 16 2月, 2023 12 次提交
  3. 27 1月, 2023 1 次提交
    • Q
      btrfs: raid56: make error_bitmap update atomic · a9ad4d87
      Qu Wenruo 提交于
      In the rework of raid56 code, there is very limited concurrency in the
      endio context.
      
      Most of the work is done inside the sectors arrays, which different bios
      will never touch the same sector.
      
      But there is a concurrency here for error_bitmap. Both read and write
      endio functions need to touch them, and we can have multiple write bios
      touching the same error bitmap if they all hit some errors.
      
      Here we fix the unprotected bitmap operation by going set_bit() in a
      loop.
      
      Since we have a very small ceiling of the sectors (at most 16 sectors),
      such set_bit() in a loop should be very acceptable.
      
      Fixes: 2942a50d ("btrfs: raid56: introduce btrfs_raid_bio::error_bitmap")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9ad4d87
  4. 26 1月, 2023 1 次提交
  5. 21 12月, 2022 1 次提交
    • J
      btrfs: scrub: fix uninitialized return value in recover_scrub_rbio · e7fc357e
      Josef Bacik 提交于
      Commit 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery
      path to use error_bitmap") introduced an uninitialized return variable.
      
      This can be caught by gcc 12.1 by -Wmaybe-uninitialized:
      
        CC [M]  fs/btrfs/raid56.o
      fs/btrfs/raid56.c: In function ‘scrub_rbio’:
      fs/btrfs/raid56.c:2801:15: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized]
       2801 |         ret = recover_scrub_rbio(rbio);
            |               ^~~~~~~~~~~~~~~~~~~~~~~~
      fs/btrfs/raid56.c:2649:13: note: ‘ret’ was declared here
       2649 |         int ret;
      
      The warning is disabled by default so we haven't caught that.
      
      Due to the bug the raid56 scrub fstests have been failing since the
      patch was merged, so initialize that.
      
      Fixes: 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7fc357e
  6. 06 12月, 2022 21 次提交
    • Q
      btrfs: raid56: do data csum verification during RMW cycle · 7a315072
      Qu Wenruo 提交于
      [BUG]
      For the following small script, btrfs will be unable to recover the
      content of file1:
      
        mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3
      
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
        md5sum $mnt/file1
        umount $mnt
      
        # Corrupt the above 64K data stripe.
        xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
        mount $dev1 $mnt
      
        # Write a new 64K, which should be in the other data stripe
        # And this is a sub-stripe write, which will cause RMW
        xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
        md5sum $mnt/file1
        umount $mnt
      
      Above md5sum would fail.
      
      [CAUSE]
      There is a long existing problem for raid56 (not limited to btrfs
      raid56) that, if we already have some corrupted on-disk data, and then
      trigger a sub-stripe write (which needs RMW cycle), it can cause further
      damage into P/Q stripe.
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000|
        Disk 2: parity |0xffffffffffff|
      
      In above case, data 1 is already corrupted, the original data should be
      64KiB of 0xff.
      
      At this stage, if we read data 1, and it has data checksum, we can still
      recovery going via the regular RAID56 recovery path.
      
      But if now we decide to write some data into data 2, then we need to go
      RMW.
      Let's say we want to write 64KiB of '0x00' into data 2, then we read the
      on-disk data of data 1, calculate the new parity, resulting the
      following layout:
      
        Disk 1: data 1 |0x000000000000| <- Corrupted
        Disk 2: data 2 |0x000000000000| <- New '0x00' writes
        Disk 2: parity |0x000000000000| <- New Parity.
      
      But the new parity is calculated using the *corrupted* data 1, we can
      no longer recover the correct data of data1.  Thus the corruption is
      forever there.
      
      [FIX]
      To solve above problem, this patch will do a full stripe data checksum
      verification at RMW time.
      
      This involves the following changes:
      
      - Always read the full stripe (including data/P/Q) when doing RMW
        Before we only read the missing data sectors, but since we may do a
        data csum verification and recovery, we need to read everything out.
      
        Please note that, if we have a cached rbio, we don't need to read
        anything, and can treat it the same as full stripe write.
      
        As only stripe with all its csum matches can be cached.
      
      - Verify the data csum during read.
        The goal is only the rbio stripe sectors, and only if the rbio
        already has csum_buf/csum_bitmap filled.
      
        And sectors which cannot pass csum verification will have their bit
        set in error_bitmap.
      
      - Always call recovery_sectors() after we read out all the sectors
        Since error_bitmap will be updated during read, recover_sectors()
        can easily find out all the bad sectors and try to recover (if still
        under tolerance).
      
        And since recovery_sectors() is already migrated to use error_bitmap,
        it can skip vertical stripes which don't have any error.
      
      - Verify the repaired sectors against its csum in recover_vertical()
      
      - Rename rmw_read_and_wait() to rmw_read_wait_recover()
        Since we will always recover the sectors, the old name is no longer
        accurate.
      
        Furthermore since recovery is already done in rmw_read_wait_recover(),
        we no longer need to call recovery_sectors() inside rmw_rbio().
      
      Obviously this will have a performance impact, as we are doing more
      work during RMW cycle:
      
      - Fetch the data checksums
      - Do checksum verification for all data stripes
      - Do checksum verification again after repair
      
      But for full stripe write or cached rbio we won't have the overhead all,
      thus for fully optimized RAID56 workload (always full stripe write),
      there should be no extra overhead.
      
      To me, the extra overhead looks reasonable, as data consistency is way
      more important than performance.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a315072
    • Q
      btrfs: raid56: prepare data checksums for later RMW verification · c5a41562
      Qu Wenruo 提交于
      This is for later data checksum verification at RMW time.
      
      This patch will try to allocate the needed memory for a locked rbio if
      the rbio is for data exclusively (we don't want to handle mixed bg yet).
      The memory will be released when the rbio is finished.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5a41562
    • Q
      btrfs: raid56: remove the old error tracking system · ad3daf1c
      Qu Wenruo 提交于
      Since all the recovery paths have been migrated to the new error bitmap
      based system, we can remove the old stripe number based system.
      
      This cleanup involves one behavior change:
      
      - Rebuild rbio can no longer be merged
        Previously a rebuild rbio (caused by retry after data csum mismatch)
        can be merged, if the error happens in the same stripe.
      
        But with the new error bitmap based solution, it's much harder to
        compare error bitmaps.
      
        So here we just don't merge rebuild rbio at all.
        This may introduce some performance impact at extreme corner cases,
        but we're willing to take it.
      
      Other than that, this patch will cleanup the following members:
      
      - rbio::faila
      - rbio::failb
        They will be replaced by per-vertical stripe check, which is more
        accurate.
      
      - rbio::error
        It will be replace by per-vertical stripe error bitmap check.
      
      - Allow get_rbio_vertical_errors() to accept NULL pointers for
        @faila and @failb
        Some call sites only want to check if we have errors beyond the
        tolerance.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad3daf1c
    • Q
      btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap · 75b47033
      Qu Wenruo 提交于
      Since we have rbio::error_bitmap to indicate exactly where the errors
      are (including read error and csum mismatch error), we can make recovery
      path more accurate.
      
      For example:
      
                   0        32K       64K
           Data 1  |XXXXXXXX|         |
           Data 2  |        |XXXXXXXXX|
           Parity  |        |         |
      
      1) Get csum mismatch when reading data 1 [0, 32K)
      
      2) Mark corresponding range error
         The old code will mark the whole data 1 stripe as error.
         While the new code will only mark data 1 [0, 32K) as error.
      
      3) Recovery path
         The old code will recover data 1 [0, 64K), all using Data 2 and
         parity.
      
         This means, Data 1 [32K, 64K) will be corrupted data, as data 2
         [32K, 64K) is already corrupted.
      
         While the new code will only recover data 1 [0, 32K), as only
         that range has error so far.
      
      This new behavior can avoid populating rbio cache with incorrect data.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75b47033
    • Q
      btrfs: raid56: introduce btrfs_raid_bio::error_bitmap · 2942a50d
      Qu Wenruo 提交于
      Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
      which stripe(s) had IO errors.
      
      But that has some problems:
      
      - If one sector failed csum check, the whole stripe where the corruption
        is will be marked error.
        This can reduce the chance we do recover, like this:
      
                0  4K 8K
        Data 1  |XX|  |
        Data 2  |  |XX|
        Parity  |  |  |
      
        In above case, 0~4K in data 1 should be recovered using data 2 and
        parity, while 4K~8K in data 2 should be recovered using data 1 and
        parity.
      
        Currently if we trigger read on 0~4K of data 1, we will also recover
        4K~8K of data 1 using corrupted data 2 and parity, causing wrong
        result in rbio cache.
      
      - Harder to expand for future M-N scheme
        As we're limited to just faila/b, two corruptions.
      
      - Harder to expand to handle extra csum errors
        This can be problematic if we start to do csum verification.
      
      This patch will introduce an extra @error_bitmap, where one bit
      represents error that happened for that sector.
      
      The choice to introduce a new error bitmap other than reusing
      sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
      rbio::bio_sectors[].
      
      Since we can submit bio using sectors from both sectors, doing proper
      search on both array will more complex.
      
      Although the new bitmap will take extra memory, later we can remove
      things like @error and faila/b to save some memory.
      
      Currently the new error bitmap and failab mechanism coexists, the error
      bitmap is only updated at endio time and recover entrance.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2942a50d
    • Q
      btrfs: raid56: switch scrub path to use a single function · 6bfd0133
      Qu Wenruo 提交于
      This switch involves the following changes:
      
      - Make finish_parity_scrub() only to submit the write bios
        It will no longer call rbio_orig_end_io(), and now it will
        return error.
      
      - Add a new helper, recover_scrub_rbio(), to handle recovery
        It's just doing extra scrub related checks, and then call
        recover_sectors().
      
      - Rename raid56_parity_scrub_stripe() to scrub_rbio()
      - Rename scrub_parity_work() to scrub_rbio_work_locked()
        To follow the existing naming scheme.
      
      - Delete unused functions
        Including:
        * finish_rmw()
        * raid_write_end_io()
        * raid56_bio_end_io()
        * __raid_recover_end_io()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6bfd0133
    • Q
      btrfs: raid56: extract scrub read bio list assembly code into a helper · cb3450b7
      Qu Wenruo 提交于
      Just like what we did for write/recovery, also extract the read bio
      assembly code into a helper for scrub.
      
      The difference between the three are:
      
      - rmw_assemble_read_bios() only submit reads for missing sectors
        Thus it will skip cached sectors, but will also read sectors which
        is not covered by any full stripe. (For cache usage)
      
      - recover_assemble_read_bios() reads every sector which has not failed
      
      - scrub_assemble_read_bios() has extra check for vertical stripes
        It's mostly the same as rmw_assemble_read_bios(), but will skip
        sectors which is not covered by a vertical stripe.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb3450b7
    • Q
      btrfs: raid56: switch write path to rmw_rbio() · 93723095
      Qu Wenruo 提交于
      This includes the following changes:
      
      - Implement new raid_unplug() functions
        Now we don't need a workqueue to run the plug, as all our
        work is just queue rmw_rbio_work() call, which can be executed
        without sleep.
      
      - Implement a rmw_rbio_work_locked() helper
        This is for unlock_stripe(), which is already holding the full stripe
        lock.
      
      - Remove all the old functions
        This should already shows how complex the old functions are, as we
        ended up removing the following functions:
      
        * rmw_work()
        * validate_rbio_for_rmw()
        * raid56_rmw_end_io_work()
        * raid56_rmw_stripe()
        * full_stripe_write()
        * partial_stripe_write()
        * __raid56_parity_write()
        * run_plug()
        * unplug_work()
        * btrfs_raid_unplug()
        * rmw_work()
        * __raid56_parity_recover()
        * raid_recover_end_io_work()
      
      - Unexport rmw_rbio()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      93723095
    • Q
      btrfs: raid56: introduce the main entrance for RMW path · 5eb30ee2
      Qu Wenruo 提交于
      The new entrance will be called rmw_rbio(), it will have a streamlined
      workflow by using submit-and-wait method.
      
      Thus there will be no weird jumps between tons of functions, thus way
      more reader friendly, and will make later expansion easier, as it's now
      a straight workflow, the timing is way more clear.
      
      Unfortunately we can not yet migrate the RMW path to use this new
      entrance as we still need extra work to address the plug and
      unlock_stripe() function.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5eb30ee2
    • Q
      btrfs: raid56: extract rwm write bios assembly into a helper · 6486d21c
      Qu Wenruo 提交于
      The helper will be later used to refactor the rmw write path.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6486d21c
    • Q
      btrfs: raid56: extract the rmw bio list build code into a helper · 509c27aa
      Qu Wenruo 提交于
      The helper will later be used to refactor the whole RMW path.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      509c27aa
    • Q
      btrfs: raid56: switch recovery path to a single function · d817ce35
      Qu Wenruo 提交于
      Currently btrfs uses end_io functions to jump between different stages
      of recovery.
      
      For example, we go the following different functions:
      
      - raid56_bio_end_io()
        This handles the read for all the sectors (except the missing device).
      
      - __raid_recover_end_io()
        This does the real work, it's called inside the delayed work function
        raid_recover_end_io_work().
      
      This one recovery path involves at least 3 different functions, which is
      a big burden for readers.
      
      This patch will change the behavior by:
      
      - Introduce a unified recovery entrance, recover_rbio()
      
      - Use submit-and-wait method
        So the workflow is not interrupted by the endio function jump.
        This doesn't bring performance change, but reduce the burden for
        reviewers.
      
      - Run the main function in the rmw_workers workqueue
        Now raid56_parity_recover() only needs to setup the work, and
        queue the work using start_async_work().
      
      Now readers only need to do one function jump (start_async_work()) to
      find out the main entrance of recovery path.
      
      Furthermore, recover_rbio() function can easily be reused by other paths.
      
      The old recovery path is still utilized by degraded write path.
      It will be cleaned up when we have migrated the write path.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d817ce35
    • Q
      btrfs: raid56: extract sector recovery code into a helper · ec936b03
      Qu Wenruo 提交于
      This includes extra changes:
      
      - The allocation for unmap_array[] and pointers[]
        Now we allocate them in one go, and free them together.
      
      - Remove @err
        Use errno_to_blk_status(ret) instead.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec936b03
    • Q
      btrfs: raid56: extract the recovery bio list build code into a helper · d31968d9
      Qu Wenruo 提交于
      This new helper will be also utilized in the incoming refactor of
      recovery path.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d31968d9
    • Q
      btrfs: raid56: extract the pq generation code into a helper · 30e3c897
      Qu Wenruo 提交于
      Currently finish_rmw() will update the P/Q stripes before submitting
      the writes.
      
      It's done behind a for(;;) loop, it's a little congested indent-wise, so
      extract the code into a helper called generate_pq_vertical().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      30e3c897
    • Q
      btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() · 9c5ff9b4
      Qu Wenruo 提交于
      This refactor includes the following behavior change first:
      
      - Don't error out if only P/Q is corrupted
      
        The old code will directly error out if only P/Q is corrupted.
        Although it is an logical error if we go into rebuild path with
        only P/Q corrupted, there is no need to error out.
      
        Just skip the rebuild and return the already good data.
      
      Then comes the following refactor which shouldn't cause behavior
      changes:
      
      - Introduce a helper to do vertical stripe recovery
      
        This not only reduce one indent level, but also paves the road for
        later data checksum verification in RMW cycles.
      
      - Sort rbio->faila/b before recovery
      
        So we don't need to do the same swap every vertical stripe
      
      - Replace a BUG_ON() with ASSERT()
      
        Or checkpatch won't let me pass.
      
      - Mark recovered sectors uptodate after the recover loop
      
      - Do the cleanup for pointers unconditionally
      
        We only need to initialize @pointers and @unmap_array to NULL, so
        we can safely free them unconditionally.
      
      - Mark the repaired sector uptodate in recover_vertical()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c5ff9b4
    • D
      btrfs: update function comments · 43dd529a
      David Sterba 提交于
      Update, reformat or reword function comments. This also removes the kdoc
      marker so we don't get reports when the function name is missing.
      
      Changes made:
      
      - remove kdoc markers
      - reformat the brief description to be a proper sentence
      - reword to imperative voice
      - align parameter list
      - fix typos
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43dd529a
    • J
      btrfs: move the printk helpers out of ctree.h · 9b569ea0
      Josef Bacik 提交于
      We have a bunch of printk helpers that are in ctree.h.  These have
      nothing to do with ctree.c, so move them into their own header.
      Subsequent patches will cleanup the printk helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9b569ea0
    • Q
      btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate · 88074c8b
      Qu Wenruo 提交于
      For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).
      
      We call cache_rbio_pages() to mark a qualified rbio ready for cache.
      
      The timing happens at:
      
      - finish_rmw()
      
        At this timing, we have already read all necessary sectors, along with
        the rbio sectors, we have covered all data stripes.
      
      - __raid_recover_end_io()
      
        At this timing, we have rebuild the rbio, thus all data sectors
        involved (either from stripe or bio list) are uptodate now.
      
      Thus at the timing of cache_rbio_pages(), we should have all data
      sectors uptodate.
      
      This patch will make it explicit that all data sectors are uptodate at
      cache_rbio_pages() timing, mostly to prepare for the incoming
      verification at RMW time.
      
      This patch will add:
      
      - Extra ASSERT()s in cache_rbio_pages()
        This is to make sure all data sectors, which are not covered by bio,
        are already uptodate.
      
      - Extra ASSERT()s in steal_rbio()
        Since only cached rbio can be stolen, thus every data sector should
        already be uptodate in the source rbio.
      
      - Update __raid_recover_end_io() to update recovered sector->uptodate
        Previously __raid_recover_end_io() will only mark failed sectors
        uptodate if it's doing an RMW.
      
        But this can trigger new ASSERT()s, as for recovery case, a recovered
        failed sector will not be marked uptodate, and trigger ASSERT() in
        later cache_rbio_pages() call.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88074c8b
    • Q
      btrfs: raid56: allocate memory separately for rbio pointers · 797d74b7
      Qu Wenruo 提交于
      Currently inside alloc_rbio(), we allocate a larger memory to contain
      the following members:
      
      - struct btrfs_raid_rbio itself
      - stripe_pages array
      - bio_sectors array
      - stripe_sectors array
      - finish_pointers array
      
      Then update rbio pointers to point the extra space after the rbio
      structure itself.
      
      Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.
      
      This is too hacky, and is going to make later pointers expansion harder.
      
      This patch will change it to use regular kcalloc() for each pointer
      inside btrfs_raid_bio, making the later expansion much easier.
      
      And introduce a helper free_raid_bio_pointers() to free up all the
      pointer members in btrfs_raid_bio, which will be used in both
      free_raid_bio() and error path of alloc_rbio().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      797d74b7
    • Q
      btrfs: raid56: cleanup for function __free_raid_bio() · ff2b64a2
      Qu Wenruo 提交于
      The cleanup involves two things:
      
      - Remove the "__" prefix
        There is no naming confliction.
      
      - Remove the forward declaration
        There is no special function call involved.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff2b64a2
  7. 24 10月, 2022 1 次提交
    • Q
      btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6
      Qu Wenruo 提交于
      Currently if full_stripe_write() failed to allocate the pages for
      parity, it will call __free_raid_bio() first, then return -ENOMEM.
      
      But some caller of full_stripe_write() will also call __free_raid_bio()
      again, this would cause double freeing.
      
      And it's not a logically sound either, normally we should either free
      the memory at the same level where we allocated it, or let endio to
      handle everything.
      
      So this patch will solve the double freeing by make
      raid56_parity_write() to handle the error and free the rbio.
      
      Just like what we do in raid56_parity_recover().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab4c54c6