提交 · 88074c8b1376ac315ef4a294db82d861df074ef2 · openeuler / Kernel

06 12月, 2022 3 次提交

btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate · 88074c8b

由 Qu Wenruo 提交于 10月 10, 2022

For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).

We call cache_rbio_pages() to mark a qualified rbio ready for cache.

The timing happens at:

- finish_rmw()

  At this timing, we have already read all necessary sectors, along with
  the rbio sectors, we have covered all data stripes.

- __raid_recover_end_io()

  At this timing, we have rebuild the rbio, thus all data sectors
  involved (either from stripe or bio list) are uptodate now.

Thus at the timing of cache_rbio_pages(), we should have all data
sectors uptodate.

This patch will make it explicit that all data sectors are uptodate at
cache_rbio_pages() timing, mostly to prepare for the incoming
verification at RMW time.

This patch will add:

- Extra ASSERT()s in cache_rbio_pages()
  This is to make sure all data sectors, which are not covered by bio,
  are already uptodate.

- Extra ASSERT()s in steal_rbio()
  Since only cached rbio can be stolen, thus every data sector should
  already be uptodate in the source rbio.

- Update __raid_recover_end_io() to update recovered sector->uptodate
  Previously __raid_recover_end_io() will only mark failed sectors
  uptodate if it's doing an RMW.

  But this can trigger new ASSERT()s, as for recovery case, a recovered
  failed sector will not be marked uptodate, and trigger ASSERT() in
  later cache_rbio_pages() call.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

88074c8b

btrfs: raid56: allocate memory separately for rbio pointers · 797d74b7

由 Qu Wenruo 提交于 10月 10, 2022

Currently inside alloc_rbio(), we allocate a larger memory to contain
the following members:

- struct btrfs_raid_rbio itself
- stripe_pages array
- bio_sectors array
- stripe_sectors array
- finish_pointers array

Then update rbio pointers to point the extra space after the rbio
structure itself.

Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.

This is too hacky, and is going to make later pointers expansion harder.

This patch will change it to use regular kcalloc() for each pointer
inside btrfs_raid_bio, making the later expansion much easier.

And introduce a helper free_raid_bio_pointers() to free up all the
pointer members in btrfs_raid_bio, which will be used in both
free_raid_bio() and error path of alloc_rbio().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

797d74b7

btrfs: raid56: cleanup for function __free_raid_bio() · ff2b64a2

由 Qu Wenruo 提交于 10月 10, 2022

The cleanup involves two things:

- Remove the "__" prefix
  There is no naming confliction.

- Remove the forward declaration
  There is no special function call involved.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ff2b64a2

24 10月, 2022 2 次提交

btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6

由 Qu Wenruo 提交于 10月 10, 2022

Currently if full_stripe_write() failed to allocate the pages for
parity, it will call __free_raid_bio() first, then return -ENOMEM.

But some caller of full_stripe_write() will also call __free_raid_bio()
again, this would cause double freeing.

And it's not a logically sound either, normally we should either free
the memory at the same level where we allocated it, or let endio to
handle everything.

So this patch will solve the double freeing by make
raid56_parity_write() to handle the error and free the rbio.

Just like what we do in raid56_parity_recover().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ab4c54c6

btrfs: raid56: properly handle the error when unable to find the missing stripe · f15fb2cd

由 Qu Wenruo 提交于 10月 10, 2022

In raid56_alloc_missing_rbio(), if we can not determine where the
missing device is inside the full stripe, we just BUG_ON().

This is not necessary especially the only caller inside scrub.c is
already properly checking the return value, and will treat it as a
memory allocation failure.

Fix the error handling by:

- Add an extra warning for the reason
  Although personally speaking it may be better to be an ASSERT().

- Properly free the allocated rbio
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f15fb2cd

26 9月, 2022 1 次提交

btrfs: properly abstract the parity raid bio handling · f1c29379

由 Christoph Hellwig 提交于 8月 06, 2022

The parity raid write/recover functionality is currently not very well
abstracted from the bio submission and completion handling in volumes.c:

 - the raid56 code directly completes the original btrfs_bio fed into
   btrfs_submit_bio instead of dispatching back to volumes.c
 - the raid56 code consumes the bioc and bio_counter references taken
   by volumes.c, which also leads to special casing of the calls from
   the scrub code into the raid56 code

To fix this up supply a bi_end_io handler that calls back into the
volumes.c machinery, which then puts the bioc, decrements the bio_counter
and completes the original bio, and updates the scrub code to also
take ownership of the bioc and bio_counter in all cases.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f1c29379

25 7月, 2022 18 次提交

btrfs: raid56: transfer the bio counter reference to the raid submission helpers · b9af128d

由 Christoph Hellwig 提交于 6月 17, 2022

Transfer the bio counter reference acquired by btrfs_submit_bio to
raid56_parity_write and raid56_parity_recovery together with the bio
that the reference was acquired for instead of acquiring another
reference in those helpers and dropping the original one in
btrfs_submit_bio.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b9af128d

btrfs: do not return errors from raid56_parity_recover · 6065fd95

由 Christoph Hellwig 提交于 6月 17, 2022

Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.

Also use the proper bool type for the generic_io argument.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6065fd95

btrfs: do not return errors from raid56_parity_write · 31683f4a

由 Christoph Hellwig 提交于 6月 17, 2022

Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

31683f4a

btrfs: raid56: use fixed stripe length everywhere · ff18a4af

由 Christoph Hellwig 提交于 6月 17, 2022

The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.

Partially based on a patch from Qu Wenruo.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Tested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ff18a4af

btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() · f6065f8e

由 Qu Wenruo 提交于 6月 09, 2022

[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)

  mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt
  btrfs dev scan -u $dev3
  mount -o degraded $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  umount $mnt
  btrfs dev scan
  mount $dev1 $mnt
  btrfs balance start --full-balance $mnt
  umount $mnt

The failure is always failed to read some tree blocks:

  BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
  BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
  ...

[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:

  56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
  56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
  56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
  56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
  56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
  56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
  56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
  56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
  56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
  56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
  56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
  56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2

Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.

Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.

This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.

Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.

When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).

This explains why we have no event raid56_scrub_read_recover()
triggered.

Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.

In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.

[BTRFS DESTRUCTIVE RMW PROBLEM]

Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:

        0       32K     64K
Data1:  | Good  | Good  |
Data2:  | Bad   | Bad   |
Parity: | Good  | Good  |

In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.

This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:

- Btrfs will cache sectors for unrelated vertical stripes.

  In above example, if we're only writing into 0~32K range, btrfs will
  still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
  This behavior is to cache sectors for later update.

  Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio()
  subpage compatible") has a bug which makes RAID56 to never trust the
  cached sectors, thus slightly improve the situation for recovery.

  Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
  steal_rbio" will revert the behavior back to the old one.

- Btrfs raid56 partial write will update all P/Q sectors and cache them

  This means, even if data at (64K ~ 96K) of Data2 is free space, and
  only (96K ~ 128K) of Data2 is really stale data.
  And we write into that (96K ~ 128K), we will update all the parity
  sectors for the full stripe.

  This unnecessary behavior will completely kill the chance of recovery.

  Thankfully, an unrelated optimization "btrfs: only write the sectors
  in the vertical stripe which has data stripes" will prevent
  submitting the write bio for untouched vertical sectors.

  That optimization will keep the on-disk P/Q untouched for a chance for
  later recovery.

[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.

With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.

Now we really need to do is just drop all the cached sectors when doing
recovery.

By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.

In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.

The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.

So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.

But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.

Related patches:

- btrfs: only write the sectors in the vertical stripe
- d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f6065f8e

btrfs: use btrfs_raid_array to calculate number of parity stripes · 0b30f719

由 Qu Wenruo 提交于 5月 13, 2022

Use the raid table instead of hard coded values and rename the helper as
it is exported.  This could make later extension on RAID56 based
profiles easier.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0b30f719

btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() · 1c10702e

由 Qu Wenruo 提交于 6月 08, 2022

Originally it's iterating all the sectors which has dbitmap sector for
the vertical stripe.

It can be easily converted to sector bytenr iteration with an test_bit()
call.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1c10702e

btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() · 550cdeb3

由 Qu Wenruo 提交于 6月 08, 2022

This function doesn't even utilize full stripe skip, just iterate all
the data sectors is definitely enough.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

550cdeb3

btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() · aee35e4b

由 Qu Wenruo 提交于 6月 08, 2022

The double loop is just checking if the page for the vertical stripe
is allocated.

We can easily convert it to single loop and get rid of @stripe variable.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

aee35e4b

btrfs: raid56: avoid double for loop inside __raid56_parity_recover() · ef340fcc

由 Qu Wenruo 提交于 6月 02, 2022

The double for loop can be easily converted to single for loop as we're
really iterating the sectors in their bytenr order.

The only exception is the full stripe skip, however that can also easily
be done inside the loop.  Add an ASSERT() along with a comment for that
specific case.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ef340fcc

btrfs: raid56: avoid double for loop inside finish_rmw() · 36920044

由 Qu Wenruo 提交于 6月 02, 2022

We can easily calculate the stripe number and sector number inside the
stripe.  Thus there is not much need for a double for loop.

For the only case we want to skip the whole stripe, we can manually
increase @total_sector_nr.
This is not a recommended behavior, thus every time the iterator gets
modified there will be a comment along with an ASSERT() for it.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

36920044

btrfs: stop looking at btrfs_bio->iter in index_one_bio · 5eecef71

由 Christoph Hellwig 提交于 6月 03, 2022

All the bios that index_one_bio operates on are the bios submitted by the
upper layer.  These are never resubmitted to an actual device by the
raid56 code, and thus the iter never changes from the initial state.
Thus we can always just use bi_iter directly as it will be the same as
the saved copy.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5eecef71

btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d

由 Christoph Hellwig 提交于 5月 26, 2022

Instead of attaching an extra allocation an indirect call to each
low-level bio issued by the RAID code, add a work_struct to struct
btrfs_raid_bio and only defer the per-rbio completion action.  The
per-bio action for all the I/Os are trivial and can be safely done
from interrupt context.

As a nice side effect this also allows sharing the boilerplate code
for the per-bio completions
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d34e123d

btrfs: add trace event for submitted RAID56 bio · b8bea09a

由 Qu Wenruo 提交于 6月 01, 2022

Add tracepoint for better insight to how the RAID56 data are submitted.

The output looks like this: (trace event header and UUID skipped)

   raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
   raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
   raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
   raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768

The above debug output is from a 32K data write into an empty RAID56
data chunk.

Some explanation on the event output:

  full_stripe:	the logical bytenr of the full stripe
  devid:	btrfs devid
  type:		raid stripe type.
         	DATA1:	the first data stripe
         	DATA2:	the second data stripe
         	PQ1:	the P stripe
         	PQ2:	the Q stripe
  offset:	the offset inside the stripe.
  opf:		the bio op type
  physical:	the physical offset the bio is for
  len:		the length of the bio

The first two lines are from partial RMW read, which is reading the
remaining data stripes from disks.

The last two lines are for full stripe RMW write, which is writing the
involved two 16K stripes (one for DATA1 stripe, one for P stripe).
The stripe for DATA2 doesn't need to be written.

There are 5 types of trace events:

- raid56_read_partial
  Read remaining data for regular read/write path.

- raid56_write_stripe
  Write the modified stripes for regular read/write path.

- raid56_scrub_read_recover
  Read remaining data for scrub recovery path.

- raid56_scrub_write_stripe
  Write the modified stripes for scrub path.

- raid56_scrub_read
  Read remaining data for scrub path.

Also, since the trace events are included at super.c, we have to export
needed structure definitions to 'raid56.h' and include the header in
super.c, or we're unable to access those members.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b8bea09a

btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466

由 Qu Wenruo 提交于 6月 01, 2022

[BUG]
With added debugging, it turns out the following write sequence would
cause extra read which is unnecessary:

  # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
		 $mnt/file

The debug message looks like this (btrfs header skipped):

  partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
  partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
  full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
  full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
  partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
  partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
  full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
  full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
  partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
  partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
  ^^^^
   Still partial read, even 389152768 is already cached by the first.
   write.

  full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
  full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
  partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
  partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
  ^^^^
   Still partial read for 298844160.

  full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
  full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768

This means every 32K writes, even they are in the same full stripe,
still trigger read for previously cached data.

This would cause extra RAID56 IO, making the btrfs raid56 cache useless.

[CAUSE]
Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
compatible") tries to make steal_rbio() subpage compatible, but during
that conversion, there is one thing missing.

We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.

This means, previously if we switch the pointer, everything is done,
as the PageUptodate flag is still bound to that page.

But now we have to manually mark the involved sectors uptodate, or later
raid56_rmw_stripe() will find the stolen sector is not uptodate, and
assemble the read bio for it, wasting IO.

[FIX]
We can easily fix the bug, by also update the
rbio->stripe_sectors[].uptodate in steal_rbio().

With this fixed, now the same write pattern no longer leads to the same
unnecessary read:

  partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
  partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
  full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
  full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
  partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
  partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
  full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
  full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
  ^^^ No more partial read, directly into the write path.
  full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
  full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
  full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
  full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768

Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4d100466

btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62

由 Qu Wenruo 提交于 5月 27, 2022

If we have only 8K partial write at the beginning of a full RAID56
stripe, we will write the following contents:

                    0  8K           32K             64K
Disk 1	(data):     |XX|            |               |
Disk 2  (data):     |               |               |
Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|

|X| means the sector will be written back to disk.

Note that, although we won't write any sectors from disk 2, but we will
write the full 64KiB of parity to disk.

This behavior is fine for now, but not for the future (especially for
RAID56J, as we waste quite some space to journal the unused parity
stripes).

So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
queue a higher level bio into an rbio, we will update rbio::dbitmap to
indicate which vertical stripes we need to writeback.

And at finish_rmw(), we also check dbitmap to see if we need to write
any sector in the vertical stripe.

So after the patch, above example will only lead to the following
writeback pattern:

                    0  8K           32K             64K
Disk 1	(data):     |XX|            |               |
Disk 2  (data):     |               |               |
Disk 3  (parity):   |XX|            |               |
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bd8f7e62

btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb

由 Qu Wenruo 提交于 5月 27, 2022

Previsouly we use "unsigned long *" for those two bitmaps.

But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".

This saves us 8 bytes in total for btrfs_raid_bio.

To be extra safe, add an ASSERT() making sure calculated
@stripe_nsectors is always smaller than BITS_PER_LONG.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c67c68eb

D
btrfs: fix typos in comments · 143823cf
由 David Sterba 提交于 5月 25, 2022
```
Codespell has found a few typos.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
143823cf

15 7月, 2022 1 次提交

fs/btrfs: Use the enum req_op and blk_opf_t types · bf9486d6

由 Bart Van Assche 提交于 7月 14, 2022

Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.
Acked-by: NDavid Sterba <dsterba@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

bf9486d6

16 5月, 2022 15 次提交

btrfs: use a normal workqueue for rmw_workers · 385de0ef

由 Christoph Hellwig 提交于 4月 18, 2022

rmw_workers doesn't need ordered execution or thread disabling threshold
(as the thresh parameter is less than DFT_THRESHOLD).

Just switch to the normal workqueues that use a lot less resources,
especially in the work_struct vs btrfs_work structures.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

385de0ef

btrfs: raid56: enable subpage support for RAID56 · a7b8e39c

由 Qu Wenruo 提交于 4月 01, 2022

Now the btrfs RAID56 infrastructure has migrated to use sector_ptr
interface, it should be safe to enable subpage support for RAID56.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a7b8e39c

btrfs: raid56: make alloc_rbio_essential_pages() subpage compatible · 3907ce29

由 Qu Wenruo 提交于 4月 01, 2022

The non-compatible part is only the bitmap iteration part, now the
bitmap size is extended to rbio::stripe_nsectors, not the old
rbio::stripe_npages.

Since we're here, also slightly improve the function by:

- Rename @i to @stripe
- Rename @bit to @sectornr
- Move @page and @index into the inner loop
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3907ce29

btrfs: raid56: make steal_rbio() subpage compatible · d4e28d9b

由 Qu Wenruo 提交于 4月 01, 2022

Function steal_rbio() will take all the uptodate pages from the source
rbio to destination rbio.

With the new stripe_sectors[] array, we also need to do the extra check:

- Check sector::flags to make sure the full page is uptodate
  Now we don't use PageUptodate flag for subpage cases to indicate
  if the page is uptodate.

  Instead we need to check all the sectors belong to the page to be sure
  about whether it's full page uptodate.

  So here we introduce a new helper, full_page_sectors_uptodate() to do
  the check.

- Update rbio::stripe_sectors[] to use the new page pointer
  We only need to change the page pointer, no need to change anything
  else.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d4e28d9b

btrfs: raid56: make set_bio_pages_uptodate() subpage compatible · 5fdb7afc

由 Qu Wenruo 提交于 4月 01, 2022

Unlike previous code, we can not directly set PageUptodate for stripe
pages now.  Instead we have to iterate through all the sectors and set
SECTOR_UPTODATE flag there.

Introduce a new helper find_stripe_sector(), to do the work.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5fdb7afc

btrfs: raid56: remove btrfs_raid_bio::bio_pages array · ac26df8b

由 Qu Wenruo 提交于 4月 01, 2022

The functionality is completely replaced by the new bio_sectors member,
now it's time to remove the old member.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ac26df8b

btrfs: raid56: make raid56_add_scrub_pages() subpage compatible · 6346f6bf

由 Qu Wenruo 提交于 4月 01, 2022

This requires one extra parameter @pgoff for the function.

In the current code base, scrub is still one page per sector, thus the
new parameter will always be 0.

It needs the extra subpage scrub optimization code to fully take
advantage.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6346f6bf

btrfs: raid56: open code rbio_stripe_page_index() · f77183dc

由 Qu Wenruo 提交于 4月 01, 2022

There is only one caller for that helper now, and we're definitely fine
to open-code it.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f77183dc

btrfs: raid56: make finish_rmw() subpage compatible · 1145059a

由 Qu Wenruo 提交于 4月 01, 2022

With this function converted to subpage compatible sector interfaces,
the following helper functions can be removed:

- rbio_stripe_page()
- rbio_pstripe_page()
- rbio_qstripe_page()
- page_in_rbio()
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1145059a

btrfs: raid56: make __raid_recover_endio_io() subpage compatible · 07e4d380

由 Qu Wenruo 提交于 4月 01, 2022

This involves:

- Use sector_ptr interface to grab the pointers

- Add sector->pgoff to pointers[]

- Rebuild data using sectorsize instead of PAGE_SIZE

- Use memcpy() to replace copy_page()
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

07e4d380

btrfs: raid56: make finish_parity_scrub() subpage compatible · 46900662

由 Qu Wenruo 提交于 4月 01, 2022

The core is to convert direct page usage into sector_ptr usage, and
use memcpy() to replace copy_page().

For pointers usage, we need to convert it to kmap_local_page() +
sector->pgoff.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

46900662

btrfs: raid56: make rbio_add_io_page() subpage compatible · 3e77605d

由 Qu Wenruo 提交于 4月 01, 2022

Make rbio_add_io_page() subpage compatible, which involves:

- Rename rbio_add_io_page() to rbio_add_io_sector()
  Although we still rely on PAGE_SIZE == sectorsize, so add a new
  ASSERT() inside rbio_add_io_sector() to make sure all pgoff is 0.

- Introduce rbio_stripe_sector() helper
  The equivalent of rbio_stripe_page().

  This new helper has extra ASSERT()s to validate the stripe and sector
  number.

- Introduce sector_in_rbio() helper
  The equivalent of page_in_rbio().

- Rename @pagenr variables to @sectornr

- Use rbio::stripe_nsectors when iterating the bitmap

Please note that, this only changes the interface, the bios are still
using full page for IO.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3e77605d

btrfs: raid56: introduce btrfs_raid_bio::bio_sectors · 00425dd9

由 Qu Wenruo 提交于 4月 01, 2022

This new member is going to fully replace bio_pages in the future, but
for now let's keep them co-exist, until the full switch is done.

Currently cache_rbio_pages() and index_rbio_pages() will also populate
the new array.

And cache_rbio_pages() need to record which sectors are uptodate, so we
also need to introduce sector_ptr::uptodate bit.

To avoid extra memory usage, we let the new @uptodate bit to share bits
with @pgoff.  Now pgoff only has at most 31 bits, which is already more
than enough, as even for 256K page size, we only need 18 bits.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

00425dd9

btrfs: raid56: introduce btrfs_raid_bio::stripe_sectors · eb357060

由 Qu Wenruo 提交于 4月 01, 2022

The new member is an array of sector_ptr pointers, they will represent
all sectors inside a full stripe (including P/Q).

They co-operate with btrfs_raid_bio::stripe_pages:

stripe_pages:   | Page 0, range [0, 64K)   | Page 1 ...
stripe_sectors: |  |  | ...             |  |
                |  |                    \- sector 15, page 0, pgoff=60K
                |  \- sector 1, page 0, pgoff=4K
                \---- sector 0, page 0, pfoff=0

With such structure, we can represent subpage sectors without using
extra pages.

Here we introduce a new helper, index_stripe_sectors(), to update
stripe_sectors[] to point to correct page and pgoff.

So every time rbio::stripe_pages[] pointer gets updated, the new helper
should be called.

The following functions have to call the new helper:

- steal_rbio()
- alloc_rbio_pages()
- alloc_rbio_parity_pages()
- alloc_rbio_essential_pages()
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

eb357060

btrfs: raid56: introduce new cached members for btrfs_raid_bio · 94efbe19

由 Qu Wenruo 提交于 4月 01, 2022

The new members are all related to number of sectors, but the existing
number of pages members are kept as is:

- nr_sectors
  Total sectors of the full stripe including P/Q.

- stripe_nsectors
  The sectors of a single stripe.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

94efbe19

openeuler / Kernel 11 个月 前同步成功

openeuler / Kernel
11 个月前同步成功