1. 25 7月, 2022 12 次提交
    • Q
      btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() · 1c10702e
      Qu Wenruo 提交于
      Originally it's iterating all the sectors which has dbitmap sector for
      the vertical stripe.
      
      It can be easily converted to sector bytenr iteration with an test_bit()
      call.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c10702e
    • Q
      btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() · 550cdeb3
      Qu Wenruo 提交于
      This function doesn't even utilize full stripe skip, just iterate all
      the data sectors is definitely enough.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      550cdeb3
    • Q
      btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() · aee35e4b
      Qu Wenruo 提交于
      The double loop is just checking if the page for the vertical stripe
      is allocated.
      
      We can easily convert it to single loop and get rid of @stripe variable.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aee35e4b
    • Q
      btrfs: raid56: avoid double for loop inside __raid56_parity_recover() · ef340fcc
      Qu Wenruo 提交于
      The double for loop can be easily converted to single for loop as we're
      really iterating the sectors in their bytenr order.
      
      The only exception is the full stripe skip, however that can also easily
      be done inside the loop.  Add an ASSERT() along with a comment for that
      specific case.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ef340fcc
    • Q
      btrfs: raid56: avoid double for loop inside finish_rmw() · 36920044
      Qu Wenruo 提交于
      We can easily calculate the stripe number and sector number inside the
      stripe.  Thus there is not much need for a double for loop.
      
      For the only case we want to skip the whole stripe, we can manually
      increase @total_sector_nr.
      This is not a recommended behavior, thus every time the iterator gets
      modified there will be a comment along with an ASSERT() for it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36920044
    • C
      btrfs: stop looking at btrfs_bio->iter in index_one_bio · 5eecef71
      Christoph Hellwig 提交于
      All the bios that index_one_bio operates on are the bios submitted by the
      upper layer.  These are never resubmitted to an actual device by the
      raid56 code, and thus the iter never changes from the initial state.
      Thus we can always just use bi_iter directly as it will be the same as
      the saved copy.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5eecef71
    • C
      btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d
      Christoph Hellwig 提交于
      Instead of attaching an extra allocation an indirect call to each
      low-level bio issued by the RAID code, add a work_struct to struct
      btrfs_raid_bio and only defer the per-rbio completion action.  The
      per-bio action for all the I/Os are trivial and can be safely done
      from interrupt context.
      
      As a nice side effect this also allows sharing the boilerplate code
      for the per-bio completions
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d34e123d
    • Q
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo 提交于
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Q
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo 提交于
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d100466
    • Q
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo 提交于
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Q
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo 提交于
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67c68eb
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  2. 16 5月, 2022 22 次提交
  3. 27 10月, 2021 3 次提交
    • Q
      btrfs: remove btrfs_raid_bio::fs_info member · 6a258d72
      Qu Wenruo 提交于
      We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is
      always passed into alloc_rbio(), and only get released when the raid bio
      is released.
      
      Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info
      parameters for alloc_rbio() callers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a258d72
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
  4. 23 8月, 2021 2 次提交
    • D
      btrfs: constify and cleanup variables in comparators · 214cc184
      David Sterba 提交于
      Comparators just read the data and thus get const parameters. This
      should be also preserved by the local variables, update all comparators
      passed to sort or bsearch.
      
      Cleanups:
      
      - unnecessary casts are dropped
      - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
        and 'inline' is dropped as the function address is taken
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      214cc184
    • D
      btrfs: drop from __GFP_HIGHMEM all allocations · b0ee5e1e
      David Sterba 提交于
      The highmem flag is used for allocating pages for compression and for
      raid56 pages. The high memory makes sense on 32bit systems but is not
      without problems. On 64bit system's it's just another layer of wrappers.
      
      The time the pages are allocated for compression or raid56 is relatively
      short (about a transaction commit), so the pages are not blocked
      indefinitely. As the number of pages depends on the amount of data being
      written/read, there's a theoretical problem. A fast device on a 32bit
      system could use most of the low memory pool, while with the highmem
      allocation that would not happen. This was possibly the original idea
      long time ago, but nowadays we optimize for 64bit systems.
      
      This patch removes all usage of the __GFP_HIGHMEM flag for page
      allocation, the kmap/kunmap are still in place and will be removed in
      followup patches. Remaining is masking out the bit in
      alloc_extent_state and __lookup_free_space_inode, that can safely stay.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0ee5e1e
  5. 19 4月, 2021 1 次提交
    • I
      btrfs: raid56: convert kmaps to kmap_local_page · 94a0b58d
      Ira Weiny 提交于
      These kmaps are thread local and don't need to be atomic.  So they can use
      the more efficient kmap_local_page().  However, the mapping of pages in
      the stripes and the additional parity and qstripe pages are a bit
      trickier because the unmapping must occur in the opposite order from the
      mapping.  Furthermore, the pointer array in __raid_recover_end_io() may
      get reordered.
      
      Convert these calls to kmap_local_page() taking care to reverse the
      unmappings of any page arrays as well as being careful with the mappings
      of any special pages such as the parity and qstripe pages.
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94a0b58d