提交 · a9ad4d87aa263de36895402b66115a3a4b88bf1c · openeuler / Kernel

27 1月, 2023 1 次提交

btrfs: raid56: make error_bitmap update atomic · a9ad4d87

由 Qu Wenruo 提交于 1月 21, 2023

In the rework of raid56 code, there is very limited concurrency in the
endio context.

Most of the work is done inside the sectors arrays, which different bios
will never touch the same sector.

But there is a concurrency here for error_bitmap. Both read and write
endio functions need to touch them, and we can have multiple write bios
touching the same error bitmap if they all hit some errors.

Here we fix the unprotected bitmap operation by going set_bit() in a
loop.

Since we have a very small ceiling of the sectors (at most 16 sectors),
such set_bit() in a loop should be very acceptable.

Fixes: 2942a50d ("btrfs: raid56: introduce btrfs_raid_bio::error_bitmap")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a9ad4d87

26 1月, 2023 1 次提交

btrfs: raid56: fix stripes if vertical errors are found · f7c11aff

由 Tanmay Bhushan 提交于 1月 15, 2023

We take two stripe numbers if vertical errors are found.  In case it is
just a pstripe it does not matter but in case of raid 6 it matters as
both stripes need to be fixed.

Fixes: 7a315072 ("btrfs: raid56: do data csum verification during RMW cycle")
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NTanmay Bhushan <007047221b@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f7c11aff

21 12月, 2022 1 次提交

btrfs: scrub: fix uninitialized return value in recover_scrub_rbio · e7fc357e

由 Josef Bacik 提交于 12月 16, 2022

Commit 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery
path to use error_bitmap") introduced an uninitialized return variable.

This can be caught by gcc 12.1 by -Wmaybe-uninitialized:

  CC [M]  fs/btrfs/raid56.o
fs/btrfs/raid56.c: In function ‘scrub_rbio’:
fs/btrfs/raid56.c:2801:15: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized]
 2801 |         ret = recover_scrub_rbio(rbio);
      |               ^~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/raid56.c:2649:13: note: ‘ret’ was declared here
 2649 |         int ret;

The warning is disabled by default so we haven't caught that.

Due to the bug the raid56 scrub fstests have been failing since the
patch was merged, so initialize that.

Fixes: 75b47033 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e7fc357e

06 12月, 2022 21 次提交

btrfs: raid56: do data csum verification during RMW cycle · 7a315072

由 Qu Wenruo 提交于 11月 14, 2022

[BUG]
For the following small script, btrfs will be unable to recover the
content of file1:

  mkfs.btrfs -f -m raid1 -d raid5 -b 1G $dev1 $dev2 $dev3

  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/file1
  md5sum $mnt/file1
  umount $mnt

  # Corrupt the above 64K data stripe.
  xfs_io -f -c "pwrite -S 0x00 323026944 64K" -c sync $dev3
  mount $dev1 $mnt

  # Write a new 64K, which should be in the other data stripe
  # And this is a sub-stripe write, which will cause RMW
  xfs_io -f -c "pwrite 0 64k" -c sync $mnt/file2
  md5sum $mnt/file1
  umount $mnt

Above md5sum would fail.

[CAUSE]
There is a long existing problem for raid56 (not limited to btrfs
raid56) that, if we already have some corrupted on-disk data, and then
trigger a sub-stripe write (which needs RMW cycle), it can cause further
damage into P/Q stripe.

  Disk 1: data 1 |0x000000000000| <- Corrupted
  Disk 2: data 2 |0x000000000000|
  Disk 2: parity |0xffffffffffff|

In above case, data 1 is already corrupted, the original data should be
64KiB of 0xff.

At this stage, if we read data 1, and it has data checksum, we can still
recovery going via the regular RAID56 recovery path.

But if now we decide to write some data into data 2, then we need to go
RMW.
Let's say we want to write 64KiB of '0x00' into data 2, then we read the
on-disk data of data 1, calculate the new parity, resulting the
following layout:

  Disk 1: data 1 |0x000000000000| <- Corrupted
  Disk 2: data 2 |0x000000000000| <- New '0x00' writes
  Disk 2: parity |0x000000000000| <- New Parity.

But the new parity is calculated using the *corrupted* data 1, we can
no longer recover the correct data of data1.  Thus the corruption is
forever there.

[FIX]
To solve above problem, this patch will do a full stripe data checksum
verification at RMW time.

This involves the following changes:

- Always read the full stripe (including data/P/Q) when doing RMW
  Before we only read the missing data sectors, but since we may do a
  data csum verification and recovery, we need to read everything out.

  Please note that, if we have a cached rbio, we don't need to read
  anything, and can treat it the same as full stripe write.

  As only stripe with all its csum matches can be cached.

- Verify the data csum during read.
  The goal is only the rbio stripe sectors, and only if the rbio
  already has csum_buf/csum_bitmap filled.

  And sectors which cannot pass csum verification will have their bit
  set in error_bitmap.

- Always call recovery_sectors() after we read out all the sectors
  Since error_bitmap will be updated during read, recover_sectors()
  can easily find out all the bad sectors and try to recover (if still
  under tolerance).

  And since recovery_sectors() is already migrated to use error_bitmap,
  it can skip vertical stripes which don't have any error.

- Verify the repaired sectors against its csum in recover_vertical()

- Rename rmw_read_and_wait() to rmw_read_wait_recover()
  Since we will always recover the sectors, the old name is no longer
  accurate.

  Furthermore since recovery is already done in rmw_read_wait_recover(),
  we no longer need to call recovery_sectors() inside rmw_rbio().

Obviously this will have a performance impact, as we are doing more
work during RMW cycle:

- Fetch the data checksums
- Do checksum verification for all data stripes
- Do checksum verification again after repair

But for full stripe write or cached rbio we won't have the overhead all,
thus for fully optimized RAID56 workload (always full stripe write),
there should be no extra overhead.

To me, the extra overhead looks reasonable, as data consistency is way
more important than performance.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7a315072

btrfs: raid56: prepare data checksums for later RMW verification · c5a41562

由 Qu Wenruo 提交于 11月 14, 2022

This is for later data checksum verification at RMW time.

This patch will try to allocate the needed memory for a locked rbio if
the rbio is for data exclusively (we don't want to handle mixed bg yet).
The memory will be released when the rbio is finished.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c5a41562

btrfs: raid56: remove the old error tracking system · ad3daf1c

由 Qu Wenruo 提交于 11月 07, 2022

Since all the recovery paths have been migrated to the new error bitmap
based system, we can remove the old stripe number based system.

This cleanup involves one behavior change:

- Rebuild rbio can no longer be merged
  Previously a rebuild rbio (caused by retry after data csum mismatch)
  can be merged, if the error happens in the same stripe.

  But with the new error bitmap based solution, it's much harder to
  compare error bitmaps.

  So here we just don't merge rebuild rbio at all.
  This may introduce some performance impact at extreme corner cases,
  but we're willing to take it.

Other than that, this patch will cleanup the following members:

- rbio::faila
- rbio::failb
  They will be replaced by per-vertical stripe check, which is more
  accurate.

- rbio::error
  It will be replace by per-vertical stripe error bitmap check.

- Allow get_rbio_vertical_errors() to accept NULL pointers for
  @faila and @failb
  Some call sites only want to check if we have errors beyond the
  tolerance.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ad3daf1c

btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap · 75b47033

由 Qu Wenruo 提交于 11月 07, 2022

Since we have rbio::error_bitmap to indicate exactly where the errors
are (including read error and csum mismatch error), we can make recovery
path more accurate.

For example:

             0        32K       64K
     Data 1  |XXXXXXXX|         |
     Data 2  |        |XXXXXXXXX|
     Parity  |        |         |

1) Get csum mismatch when reading data 1 [0, 32K)

2) Mark corresponding range error
   The old code will mark the whole data 1 stripe as error.
   While the new code will only mark data 1 [0, 32K) as error.

3) Recovery path
   The old code will recover data 1 [0, 64K), all using Data 2 and
   parity.

   This means, Data 1 [32K, 64K) will be corrupted data, as data 2
   [32K, 64K) is already corrupted.

   While the new code will only recover data 1 [0, 32K), as only
   that range has error so far.

This new behavior can avoid populating rbio cache with incorrect data.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

75b47033

btrfs: raid56: introduce btrfs_raid_bio::error_bitmap · 2942a50d

由 Qu Wenruo 提交于 11月 07, 2022

Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
which stripe(s) had IO errors.

But that has some problems:

- If one sector failed csum check, the whole stripe where the corruption
  is will be marked error.
  This can reduce the chance we do recover, like this:

          0  4K 8K
  Data 1  |XX|  |
  Data 2  |  |XX|
  Parity  |  |  |

  In above case, 0~4K in data 1 should be recovered using data 2 and
  parity, while 4K~8K in data 2 should be recovered using data 1 and
  parity.

  Currently if we trigger read on 0~4K of data 1, we will also recover
  4K~8K of data 1 using corrupted data 2 and parity, causing wrong
  result in rbio cache.

- Harder to expand for future M-N scheme
  As we're limited to just faila/b, two corruptions.

- Harder to expand to handle extra csum errors
  This can be problematic if we start to do csum verification.

This patch will introduce an extra @error_bitmap, where one bit
represents error that happened for that sector.

The choice to introduce a new error bitmap other than reusing
sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
rbio::bio_sectors[].

Since we can submit bio using sectors from both sectors, doing proper
search on both array will more complex.

Although the new bitmap will take extra memory, later we can remove
things like @error and faila/b to save some memory.

Currently the new error bitmap and failab mechanism coexists, the error
bitmap is only updated at endio time and recover entrance.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2942a50d

btrfs: raid56: switch scrub path to use a single function · 6bfd0133

由 Qu Wenruo 提交于 11月 01, 2022

This switch involves the following changes:

- Make finish_parity_scrub() only to submit the write bios
  It will no longer call rbio_orig_end_io(), and now it will
  return error.

- Add a new helper, recover_scrub_rbio(), to handle recovery
  It's just doing extra scrub related checks, and then call
  recover_sectors().

- Rename raid56_parity_scrub_stripe() to scrub_rbio()
- Rename scrub_parity_work() to scrub_rbio_work_locked()
  To follow the existing naming scheme.

- Delete unused functions
  Including:
  * finish_rmw()
  * raid_write_end_io()
  * raid56_bio_end_io()
  * __raid_recover_end_io()
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6bfd0133

btrfs: raid56: extract scrub read bio list assembly code into a helper · cb3450b7

由 Qu Wenruo 提交于 11月 01, 2022

Just like what we did for write/recovery, also extract the read bio
assembly code into a helper for scrub.

The difference between the three are:

- rmw_assemble_read_bios() only submit reads for missing sectors
  Thus it will skip cached sectors, but will also read sectors which
  is not covered by any full stripe. (For cache usage)

- recover_assemble_read_bios() reads every sector which has not failed

- scrub_assemble_read_bios() has extra check for vertical stripes
  It's mostly the same as rmw_assemble_read_bios(), but will skip
  sectors which is not covered by a vertical stripe.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cb3450b7

btrfs: raid56: switch write path to rmw_rbio() · 93723095

由 Qu Wenruo 提交于 11月 01, 2022

This includes the following changes:

- Implement new raid_unplug() functions
  Now we don't need a workqueue to run the plug, as all our
  work is just queue rmw_rbio_work() call, which can be executed
  without sleep.

- Implement a rmw_rbio_work_locked() helper
  This is for unlock_stripe(), which is already holding the full stripe
  lock.

- Remove all the old functions
  This should already shows how complex the old functions are, as we
  ended up removing the following functions:

  * rmw_work()
  * validate_rbio_for_rmw()
  * raid56_rmw_end_io_work()
  * raid56_rmw_stripe()
  * full_stripe_write()
  * partial_stripe_write()
  * __raid56_parity_write()
  * run_plug()
  * unplug_work()
  * btrfs_raid_unplug()
  * rmw_work()
  * __raid56_parity_recover()
  * raid_recover_end_io_work()

- Unexport rmw_rbio()
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

93723095

btrfs: raid56: introduce the main entrance for RMW path · 5eb30ee2

由 Qu Wenruo 提交于 11月 01, 2022

The new entrance will be called rmw_rbio(), it will have a streamlined
workflow by using submit-and-wait method.

Thus there will be no weird jumps between tons of functions, thus way
more reader friendly, and will make later expansion easier, as it's now
a straight workflow, the timing is way more clear.

Unfortunately we can not yet migrate the RMW path to use this new
entrance as we still need extra work to address the plug and
unlock_stripe() function.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5eb30ee2

btrfs: raid56: extract rwm write bios assembly into a helper · 6486d21c

由 Qu Wenruo 提交于 11月 01, 2022

The helper will be later used to refactor the rmw write path.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6486d21c

btrfs: raid56: extract the rmw bio list build code into a helper · 509c27aa

由 Qu Wenruo 提交于 11月 01, 2022

The helper will later be used to refactor the whole RMW path.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

509c27aa

btrfs: raid56: switch recovery path to a single function · d817ce35

由 Qu Wenruo 提交于 11月 01, 2022

Currently btrfs uses end_io functions to jump between different stages
of recovery.

For example, we go the following different functions:

- raid56_bio_end_io()
  This handles the read for all the sectors (except the missing device).

- __raid_recover_end_io()
  This does the real work, it's called inside the delayed work function
  raid_recover_end_io_work().

This one recovery path involves at least 3 different functions, which is
a big burden for readers.

This patch will change the behavior by:

- Introduce a unified recovery entrance, recover_rbio()

- Use submit-and-wait method
  So the workflow is not interrupted by the endio function jump.
  This doesn't bring performance change, but reduce the burden for
  reviewers.

- Run the main function in the rmw_workers workqueue
  Now raid56_parity_recover() only needs to setup the work, and
  queue the work using start_async_work().

Now readers only need to do one function jump (start_async_work()) to
find out the main entrance of recovery path.

Furthermore, recover_rbio() function can easily be reused by other paths.

The old recovery path is still utilized by degraded write path.
It will be cleaned up when we have migrated the write path.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d817ce35

btrfs: raid56: extract sector recovery code into a helper · ec936b03

由 Qu Wenruo 提交于 11月 01, 2022

This includes extra changes:

- The allocation for unmap_array[] and pointers[]
  Now we allocate them in one go, and free them together.

- Remove @err
  Use errno_to_blk_status(ret) instead.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ec936b03

btrfs: raid56: extract the recovery bio list build code into a helper · d31968d9

由 Qu Wenruo 提交于 11月 01, 2022

This new helper will be also utilized in the incoming refactor of
recovery path.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d31968d9

btrfs: raid56: extract the pq generation code into a helper · 30e3c897

由 Qu Wenruo 提交于 11月 01, 2022

Currently finish_rmw() will update the P/Q stripes before submitting
the writes.

It's done behind a for(;;) loop, it's a little congested indent-wise, so
extract the code into a helper called generate_pq_vertical().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

30e3c897

btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() · 9c5ff9b4

由 Qu Wenruo 提交于 11月 01, 2022

This refactor includes the following behavior change first:

- Don't error out if only P/Q is corrupted

  The old code will directly error out if only P/Q is corrupted.
  Although it is an logical error if we go into rebuild path with
  only P/Q corrupted, there is no need to error out.

  Just skip the rebuild and return the already good data.

Then comes the following refactor which shouldn't cause behavior
changes:

- Introduce a helper to do vertical stripe recovery

  This not only reduce one indent level, but also paves the road for
  later data checksum verification in RMW cycles.

- Sort rbio->faila/b before recovery

  So we don't need to do the same swap every vertical stripe

- Replace a BUG_ON() with ASSERT()

  Or checkpatch won't let me pass.

- Mark recovered sectors uptodate after the recover loop

- Do the cleanup for pointers unconditionally

  We only need to initialize @pointers and @unmap_array to NULL, so
  we can safely free them unconditionally.

- Mark the repaired sector uptodate in recover_vertical()
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9c5ff9b4

btrfs: update function comments · 43dd529a

由 David Sterba 提交于 10月 27, 2022

Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.

Changes made:

- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos
Signed-off-by: NDavid Sterba <dsterba@suse.com>

43dd529a

btrfs: move the printk helpers out of ctree.h · 9b569ea0

由 Josef Bacik 提交于 10月 19, 2022

We have a bunch of printk helpers that are in ctree.h.  These have
nothing to do with ctree.c, so move them into their own header.
Subsequent patches will cleanup the printk helpers.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9b569ea0

btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate · 88074c8b

由 Qu Wenruo 提交于 10月 10, 2022

For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).

We call cache_rbio_pages() to mark a qualified rbio ready for cache.

The timing happens at:

- finish_rmw()

  At this timing, we have already read all necessary sectors, along with
  the rbio sectors, we have covered all data stripes.

- __raid_recover_end_io()

  At this timing, we have rebuild the rbio, thus all data sectors
  involved (either from stripe or bio list) are uptodate now.

Thus at the timing of cache_rbio_pages(), we should have all data
sectors uptodate.

This patch will make it explicit that all data sectors are uptodate at
cache_rbio_pages() timing, mostly to prepare for the incoming
verification at RMW time.

This patch will add:

- Extra ASSERT()s in cache_rbio_pages()
  This is to make sure all data sectors, which are not covered by bio,
  are already uptodate.

- Extra ASSERT()s in steal_rbio()
  Since only cached rbio can be stolen, thus every data sector should
  already be uptodate in the source rbio.

- Update __raid_recover_end_io() to update recovered sector->uptodate
  Previously __raid_recover_end_io() will only mark failed sectors
  uptodate if it's doing an RMW.

  But this can trigger new ASSERT()s, as for recovery case, a recovered
  failed sector will not be marked uptodate, and trigger ASSERT() in
  later cache_rbio_pages() call.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

88074c8b

btrfs: raid56: allocate memory separately for rbio pointers · 797d74b7

由 Qu Wenruo 提交于 10月 10, 2022

Currently inside alloc_rbio(), we allocate a larger memory to contain
the following members:

- struct btrfs_raid_rbio itself
- stripe_pages array
- bio_sectors array
- stripe_sectors array
- finish_pointers array

Then update rbio pointers to point the extra space after the rbio
structure itself.

Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.

This is too hacky, and is going to make later pointers expansion harder.

This patch will change it to use regular kcalloc() for each pointer
inside btrfs_raid_bio, making the later expansion much easier.

And introduce a helper free_raid_bio_pointers() to free up all the
pointer members in btrfs_raid_bio, which will be used in both
free_raid_bio() and error path of alloc_rbio().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

797d74b7

btrfs: raid56: cleanup for function __free_raid_bio() · ff2b64a2

由 Qu Wenruo 提交于 10月 10, 2022

The cleanup involves two things:

- Remove the "__" prefix
  There is no naming confliction.

- Remove the forward declaration
  There is no special function call involved.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ff2b64a2

24 10月, 2022 2 次提交

btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6

由 Qu Wenruo 提交于 10月 10, 2022

Currently if full_stripe_write() failed to allocate the pages for
parity, it will call __free_raid_bio() first, then return -ENOMEM.

But some caller of full_stripe_write() will also call __free_raid_bio()
again, this would cause double freeing.

And it's not a logically sound either, normally we should either free
the memory at the same level where we allocated it, or let endio to
handle everything.

So this patch will solve the double freeing by make
raid56_parity_write() to handle the error and free the rbio.

Just like what we do in raid56_parity_recover().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ab4c54c6

btrfs: raid56: properly handle the error when unable to find the missing stripe · f15fb2cd

由 Qu Wenruo 提交于 10月 10, 2022

In raid56_alloc_missing_rbio(), if we can not determine where the
missing device is inside the full stripe, we just BUG_ON().

This is not necessary especially the only caller inside scrub.c is
already properly checking the return value, and will treat it as a
memory allocation failure.

Fix the error handling by:

- Add an extra warning for the reason
  Although personally speaking it may be better to be an ASSERT().

- Properly free the allocated rbio
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f15fb2cd

26 9月, 2022 1 次提交

btrfs: properly abstract the parity raid bio handling · f1c29379

由 Christoph Hellwig 提交于 8月 06, 2022

The parity raid write/recover functionality is currently not very well
abstracted from the bio submission and completion handling in volumes.c:

 - the raid56 code directly completes the original btrfs_bio fed into
   btrfs_submit_bio instead of dispatching back to volumes.c
 - the raid56 code consumes the bioc and bio_counter references taken
   by volumes.c, which also leads to special casing of the calls from
   the scrub code into the raid56 code

To fix this up supply a bi_end_io handler that calls back into the
volumes.c machinery, which then puts the bioc, decrements the bio_counter
and completes the original bio, and updates the scrub code to also
take ownership of the bioc and bio_counter in all cases.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f1c29379

25 7月, 2022 13 次提交

btrfs: raid56: transfer the bio counter reference to the raid submission helpers · b9af128d

由 Christoph Hellwig 提交于 6月 17, 2022

Transfer the bio counter reference acquired by btrfs_submit_bio to
raid56_parity_write and raid56_parity_recovery together with the bio
that the reference was acquired for instead of acquiring another
reference in those helpers and dropping the original one in
btrfs_submit_bio.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b9af128d

btrfs: do not return errors from raid56_parity_recover · 6065fd95

由 Christoph Hellwig 提交于 6月 17, 2022

Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.

Also use the proper bool type for the generic_io argument.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6065fd95

btrfs: do not return errors from raid56_parity_write · 31683f4a

由 Christoph Hellwig 提交于 6月 17, 2022

Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

31683f4a

btrfs: raid56: use fixed stripe length everywhere · ff18a4af

由 Christoph Hellwig 提交于 6月 17, 2022

The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.

Partially based on a patch from Qu Wenruo.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ff18a4af

btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() · f6065f8e

由 Qu Wenruo 提交于 6月 09, 2022

[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)

  mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt
  btrfs dev scan -u $dev3
  mount -o degraded $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  umount $mnt
  btrfs dev scan
  mount $dev1 $mnt
  btrfs balance start --full-balance $mnt
  umount $mnt

The failure is always failed to read some tree blocks:

  BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
  ...

[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:

  56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
  56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
  56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
  56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
  56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
  56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
  56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
  56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
  56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
  56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
  56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
  56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2

Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.

Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.

This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.

Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.

When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).

This explains why we have no event raid56_scrub_read_recover()
triggered.

Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.

In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.

[BTRFS DESTRUCTIVE RMW PROBLEM]

Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

        0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.

This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:

- Btrfs will cache sectors for unrelated vertical stripes.

  In above example, if we're only writing into 0~32K range, btrfs will
  still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
  This behavior is to cache sectors for later update.

  Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio()
  subpage compatible") has a bug which makes RAID56 to never trust the
  cached sectors, thus slightly improve the situation for recovery.

  Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
  steal_rbio" will revert the behavior back to the old one.

- Btrfs raid56 partial write will update all P/Q sectors and cache them

  This means, even if data at (64K ~ 96K) of Data2 is free space, and
  only (96K ~ 128K) of Data2 is really stale data.
  And we write into that (96K ~ 128K), we will update all the parity
  sectors for the full stripe.

  This unnecessary behavior will completely kill the chance of recovery.

  Thankfully, an unrelated optimization "btrfs: only write the sectors
  in the vertical stripe which has data stripes" will prevent
  submitting the write bio for untouched vertical sectors.

  That optimization will keep the on-disk P/Q untouched for a chance for
  later recovery.

[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.

With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.

Now we really need to do is just drop all the cached sectors when doing
recovery.

By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.

In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.

The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.

So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.

But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.

Related patches:

- btrfs: only write the sectors in the vertical stripe
- d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f6065f8e

btrfs: use btrfs_raid_array to calculate number of parity stripes · 0b30f719

由 Qu Wenruo 提交于 5月 13, 2022

Use the raid table instead of hard coded values and rename the helper as
it is exported.  This could make later extension on RAID56 based
profiles easier.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0b30f719

btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() · 1c10702e

由 Qu Wenruo 提交于 6月 08, 2022

Originally it's iterating all the sectors which has dbitmap sector for
the vertical stripe.

It can be easily converted to sector bytenr iteration with an test_bit()
call.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1c10702e

btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() · 550cdeb3

由 Qu Wenruo 提交于 6月 08, 2022

This function doesn't even utilize full stripe skip, just iterate all
the data sectors is definitely enough.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

550cdeb3

btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() · aee35e4b

由 Qu Wenruo 提交于 6月 08, 2022

The double loop is just checking if the page for the vertical stripe
is allocated.

We can easily convert it to single loop and get rid of @stripe variable.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

aee35e4b

btrfs: raid56: avoid double for loop inside __raid56_parity_recover() · ef340fcc

由 Qu Wenruo 提交于 6月 02, 2022

The double for loop can be easily converted to single for loop as we're
really iterating the sectors in their bytenr order.

The only exception is the full stripe skip, however that can also easily
be done inside the loop.  Add an ASSERT() along with a comment for that
specific case.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ef340fcc

btrfs: raid56: avoid double for loop inside finish_rmw() · 36920044

由 Qu Wenruo 提交于 6月 02, 2022

We can easily calculate the stripe number and sector number inside the
stripe.  Thus there is not much need for a double for loop.

For the only case we want to skip the whole stripe, we can manually
increase @total_sector_nr.
This is not a recommended behavior, thus every time the iterator gets
modified there will be a comment along with an ASSERT() for it.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

36920044

btrfs: stop looking at btrfs_bio->iter in index_one_bio · 5eecef71

由 Christoph Hellwig 提交于 6月 03, 2022

All the bios that index_one_bio operates on are the bios submitted by the
upper layer.  These are never resubmitted to an actual device by the
raid56 code, and thus the iter never changes from the initial state.
Thus we can always just use bi_iter directly as it will be the same as
the saved copy.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5eecef71

btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d

由 Christoph Hellwig 提交于 5月 26, 2022

Instead of attaching an extra allocation an indirect call to each
low-level bio issued by the RAID code, add a work_struct to struct
btrfs_raid_bio and only defer the per-rbio completion action.  The
per-bio action for all the I/Os are trivial and can be safely done
from interrupt context.

As a nice side effect this also allows sharing the boilerplate code
for the per-bio completions
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d34e123d

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功