1. 06 12月, 2022 3 次提交
    • Q
      btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate · 88074c8b
      Qu Wenruo 提交于
      For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).
      
      We call cache_rbio_pages() to mark a qualified rbio ready for cache.
      
      The timing happens at:
      
      - finish_rmw()
      
        At this timing, we have already read all necessary sectors, along with
        the rbio sectors, we have covered all data stripes.
      
      - __raid_recover_end_io()
      
        At this timing, we have rebuild the rbio, thus all data sectors
        involved (either from stripe or bio list) are uptodate now.
      
      Thus at the timing of cache_rbio_pages(), we should have all data
      sectors uptodate.
      
      This patch will make it explicit that all data sectors are uptodate at
      cache_rbio_pages() timing, mostly to prepare for the incoming
      verification at RMW time.
      
      This patch will add:
      
      - Extra ASSERT()s in cache_rbio_pages()
        This is to make sure all data sectors, which are not covered by bio,
        are already uptodate.
      
      - Extra ASSERT()s in steal_rbio()
        Since only cached rbio can be stolen, thus every data sector should
        already be uptodate in the source rbio.
      
      - Update __raid_recover_end_io() to update recovered sector->uptodate
        Previously __raid_recover_end_io() will only mark failed sectors
        uptodate if it's doing an RMW.
      
        But this can trigger new ASSERT()s, as for recovery case, a recovered
        failed sector will not be marked uptodate, and trigger ASSERT() in
        later cache_rbio_pages() call.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88074c8b
    • Q
      btrfs: raid56: allocate memory separately for rbio pointers · 797d74b7
      Qu Wenruo 提交于
      Currently inside alloc_rbio(), we allocate a larger memory to contain
      the following members:
      
      - struct btrfs_raid_rbio itself
      - stripe_pages array
      - bio_sectors array
      - stripe_sectors array
      - finish_pointers array
      
      Then update rbio pointers to point the extra space after the rbio
      structure itself.
      
      Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.
      
      This is too hacky, and is going to make later pointers expansion harder.
      
      This patch will change it to use regular kcalloc() for each pointer
      inside btrfs_raid_bio, making the later expansion much easier.
      
      And introduce a helper free_raid_bio_pointers() to free up all the
      pointer members in btrfs_raid_bio, which will be used in both
      free_raid_bio() and error path of alloc_rbio().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      797d74b7
    • Q
      btrfs: raid56: cleanup for function __free_raid_bio() · ff2b64a2
      Qu Wenruo 提交于
      The cleanup involves two things:
      
      - Remove the "__" prefix
        There is no naming confliction.
      
      - Remove the forward declaration
        There is no special function call involved.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff2b64a2
  2. 24 10月, 2022 2 次提交
    • Q
      btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6
      Qu Wenruo 提交于
      Currently if full_stripe_write() failed to allocate the pages for
      parity, it will call __free_raid_bio() first, then return -ENOMEM.
      
      But some caller of full_stripe_write() will also call __free_raid_bio()
      again, this would cause double freeing.
      
      And it's not a logically sound either, normally we should either free
      the memory at the same level where we allocated it, or let endio to
      handle everything.
      
      So this patch will solve the double freeing by make
      raid56_parity_write() to handle the error and free the rbio.
      
      Just like what we do in raid56_parity_recover().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab4c54c6
    • Q
      btrfs: raid56: properly handle the error when unable to find the missing stripe · f15fb2cd
      Qu Wenruo 提交于
      In raid56_alloc_missing_rbio(), if we can not determine where the
      missing device is inside the full stripe, we just BUG_ON().
      
      This is not necessary especially the only caller inside scrub.c is
      already properly checking the return value, and will treat it as a
      memory allocation failure.
      
      Fix the error handling by:
      
      - Add an extra warning for the reason
        Although personally speaking it may be better to be an ASSERT().
      
      - Properly free the allocated rbio
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f15fb2cd
  3. 26 9月, 2022 1 次提交
  4. 25 7月, 2022 18 次提交
    • C
      btrfs: raid56: transfer the bio counter reference to the raid submission helpers · b9af128d
      Christoph Hellwig 提交于
      Transfer the bio counter reference acquired by btrfs_submit_bio to
      raid56_parity_write and raid56_parity_recovery together with the bio
      that the reference was acquired for instead of acquiring another
      reference in those helpers and dropping the original one in
      btrfs_submit_bio.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9af128d
    • C
      btrfs: do not return errors from raid56_parity_recover · 6065fd95
      Christoph Hellwig 提交于
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission does and avoids any confusion on who
      needs to handle errors.
      
      Also use the proper bool type for the generic_io argument.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6065fd95
    • C
      btrfs: do not return errors from raid56_parity_write · 31683f4a
      Christoph Hellwig 提交于
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission does and avoids any confusion on who
      needs to handle errors.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31683f4a
    • C
      btrfs: raid56: use fixed stripe length everywhere · ff18a4af
      Christoph Hellwig 提交于
      The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
      are functions passing it as arguments, this is not necessary. The fixed
      value has been used for a long time and though the stripe length should
      be configurable by super block member stripesize, this hasn't been
      implemented and would require more changes so we don't need to keep this
      code around until then.
      
      Partially based on a patch from Qu Wenruo.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff18a4af
    • Q
      btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() · f6065f8e
      Qu Wenruo 提交于
      [BUG]
      There is a small workload which will always fail with recent kernel:
      (A simplified version from btrfs/125 test case)
      
        mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
        sync
        umount $mnt
        btrfs dev scan -u $dev3
        mount -o degraded $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
        umount $mnt
        btrfs dev scan
        mount $dev1 $mnt
        btrfs balance start --full-balance $mnt
        umount $mnt
      
      The failure is always failed to read some tree blocks:
      
        BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        ...
      
      [CAUSE]
      With the recently added debug output, we can see all RAID56 operations
      related to full stripe 38928384:
      
        56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
        56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
        56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
        56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
        56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
        56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
      
      Before we enter raid56_parity_recover(), we have triggered some metadata
      write for the full stripe 38928384, this leads to us to read all the
      sectors from disk.
      
      Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
      avoid unnecessary read.
      
      This means, for that full stripe, after any partial write, we will have
      stale data, along with P/Q calculated using that stale data.
      
      Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
      which has data stripes" we haven't submitted all the corrupted P/Q to disk.
      
      When we really need to recover certain range, aka in
      raid56_parity_recover(), we will use the cached rbio, along with its
      cached sectors (the full stripe is all cached).
      
      This explains why we have no event raid56_scrub_read_recover()
      triggered.
      
      Since we have the cached P/Q which is calculated using the stale data,
      the recovered one will just be stale.
      
      In our particular test case, it will always return the same incorrect
      metadata, thus causing the same error message "parent transid verify
      failed on 39010304 wanted 9 found 7" again and again.
      
      [BTRFS DESTRUCTIVE RMW PROBLEM]
      
      Test case btrfs/125 (and above workload) always has its trouble with
      the destructive read-modify-write (RMW) cycle:
      
              0       32K     64K
      Data1:  | Good  | Good  |
      Data2:  | Bad   | Bad   |
      Parity: | Good  | Good  |
      
      In above case, if we trigger any write into Data1, we will use the bad
      data in Data2 to re-generate parity, killing the only chance to recovery
      Data2, thus Data2 is lost forever.
      
      This destructive RMW cycle is not specific to btrfs RAID56, but there
      are some btrfs specific behaviors making the case even worse:
      
      - Btrfs will cache sectors for unrelated vertical stripes.
      
        In above example, if we're only writing into 0~32K range, btrfs will
        still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
        This behavior is to cache sectors for later update.
      
        Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio()
        subpage compatible") has a bug which makes RAID56 to never trust the
        cached sectors, thus slightly improve the situation for recovery.
      
        Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
        steal_rbio" will revert the behavior back to the old one.
      
      - Btrfs raid56 partial write will update all P/Q sectors and cache them
      
        This means, even if data at (64K ~ 96K) of Data2 is free space, and
        only (96K ~ 128K) of Data2 is really stale data.
        And we write into that (96K ~ 128K), we will update all the parity
        sectors for the full stripe.
      
        This unnecessary behavior will completely kill the chance of recovery.
      
        Thankfully, an unrelated optimization "btrfs: only write the sectors
        in the vertical stripe which has data stripes" will prevent
        submitting the write bio for untouched vertical sectors.
      
        That optimization will keep the on-disk P/Q untouched for a chance for
        later recovery.
      
      [FIX]
      Although we have no good way to completely fix the destructive RMW
      (unless we go full scrub for each partial write), we can still limit the
      damage.
      
      With patch "btrfs: only write the sectors in the vertical stripe which
      has data stripes" now we won't really submit the P/Q of unrelated
      vertical stripes, so the on-disk P/Q should still be fine.
      
      Now we really need to do is just drop all the cached sectors when doing
      recovery.
      
      By this, we have a chance to read the original P/Q from disk, and have a
      chance to recover the stale data, while still keep the cache to speed up
      regular write path.
      
      In fact, just dropping all the cache for recovery path is good enough to
      allow the test case btrfs/125 along with the small script to pass
      reliably.
      
      The lack of metadata write after the degraded mount, and forced metadata
      COW is saving us this time.
      
      So this patch will fix the behavior by not trust any cache in
      __raid56_parity_recover(), to solve the problem while still keep the
      cache useful.
      
      But please note that this test pass DOES NOT mean we have solved the
      destructive RMW problem, we just do better damage control a little
      better.
      
      Related patches:
      
      - btrfs: only write the sectors in the vertical stripe
      - d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      - btrfs: update stripe_sectors::uptodate in steal_rbio
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6065f8e
    • Q
      btrfs: use btrfs_raid_array to calculate number of parity stripes · 0b30f719
      Qu Wenruo 提交于
      Use the raid table instead of hard coded values and rename the helper as
      it is exported.  This could make later extension on RAID56 based
      profiles easier.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0b30f719
    • Q
      btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() · 1c10702e
      Qu Wenruo 提交于
      Originally it's iterating all the sectors which has dbitmap sector for
      the vertical stripe.
      
      It can be easily converted to sector bytenr iteration with an test_bit()
      call.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c10702e
    • Q
      btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() · 550cdeb3
      Qu Wenruo 提交于
      This function doesn't even utilize full stripe skip, just iterate all
      the data sectors is definitely enough.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      550cdeb3
    • Q
      btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() · aee35e4b
      Qu Wenruo 提交于
      The double loop is just checking if the page for the vertical stripe
      is allocated.
      
      We can easily convert it to single loop and get rid of @stripe variable.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aee35e4b
    • Q
      btrfs: raid56: avoid double for loop inside __raid56_parity_recover() · ef340fcc
      Qu Wenruo 提交于
      The double for loop can be easily converted to single for loop as we're
      really iterating the sectors in their bytenr order.
      
      The only exception is the full stripe skip, however that can also easily
      be done inside the loop.  Add an ASSERT() along with a comment for that
      specific case.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ef340fcc
    • Q
      btrfs: raid56: avoid double for loop inside finish_rmw() · 36920044
      Qu Wenruo 提交于
      We can easily calculate the stripe number and sector number inside the
      stripe.  Thus there is not much need for a double for loop.
      
      For the only case we want to skip the whole stripe, we can manually
      increase @total_sector_nr.
      This is not a recommended behavior, thus every time the iterator gets
      modified there will be a comment along with an ASSERT() for it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36920044
    • C
      btrfs: stop looking at btrfs_bio->iter in index_one_bio · 5eecef71
      Christoph Hellwig 提交于
      All the bios that index_one_bio operates on are the bios submitted by the
      upper layer.  These are never resubmitted to an actual device by the
      raid56 code, and thus the iter never changes from the initial state.
      Thus we can always just use bi_iter directly as it will be the same as
      the saved copy.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5eecef71
    • C
      btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d
      Christoph Hellwig 提交于
      Instead of attaching an extra allocation an indirect call to each
      low-level bio issued by the RAID code, add a work_struct to struct
      btrfs_raid_bio and only defer the per-rbio completion action.  The
      per-bio action for all the I/Os are trivial and can be safely done
      from interrupt context.
      
      As a nice side effect this also allows sharing the boilerplate code
      for the per-bio completions
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d34e123d
    • Q
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo 提交于
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Q
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo 提交于
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d100466
    • Q
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo 提交于
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Q
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo 提交于
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67c68eb
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  5. 15 7月, 2022 1 次提交
  6. 16 5月, 2022 15 次提交