1. 25 7月, 2022 5 次提交
    • Q
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo 提交于
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Q
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo 提交于
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d100466
    • Q
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo 提交于
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Q
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo 提交于
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67c68eb
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  2. 16 5月, 2022 22 次提交
  3. 27 10月, 2021 3 次提交
    • Q
      btrfs: remove btrfs_raid_bio::fs_info member · 6a258d72
      Qu Wenruo 提交于
      We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is
      always passed into alloc_rbio(), and only get released when the raid bio
      is released.
      
      Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info
      parameters for alloc_rbio() callers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a258d72
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
  4. 23 8月, 2021 2 次提交
    • D
      btrfs: constify and cleanup variables in comparators · 214cc184
      David Sterba 提交于
      Comparators just read the data and thus get const parameters. This
      should be also preserved by the local variables, update all comparators
      passed to sort or bsearch.
      
      Cleanups:
      
      - unnecessary casts are dropped
      - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
        and 'inline' is dropped as the function address is taken
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      214cc184
    • D
      btrfs: drop from __GFP_HIGHMEM all allocations · b0ee5e1e
      David Sterba 提交于
      The highmem flag is used for allocating pages for compression and for
      raid56 pages. The high memory makes sense on 32bit systems but is not
      without problems. On 64bit system's it's just another layer of wrappers.
      
      The time the pages are allocated for compression or raid56 is relatively
      short (about a transaction commit), so the pages are not blocked
      indefinitely. As the number of pages depends on the amount of data being
      written/read, there's a theoretical problem. A fast device on a 32bit
      system could use most of the low memory pool, while with the highmem
      allocation that would not happen. This was possibly the original idea
      long time ago, but nowadays we optimize for 64bit systems.
      
      This patch removes all usage of the __GFP_HIGHMEM flag for page
      allocation, the kmap/kunmap are still in place and will be removed in
      followup patches. Remaining is masking out the bit in
      alloc_extent_state and __lookup_free_space_inode, that can safely stay.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0ee5e1e
  5. 19 4月, 2021 3 次提交
    • I
      btrfs: raid56: convert kmaps to kmap_local_page · 94a0b58d
      Ira Weiny 提交于
      These kmaps are thread local and don't need to be atomic.  So they can use
      the more efficient kmap_local_page().  However, the mapping of pages in
      the stripes and the additional parity and qstripe pages are a bit
      trickier because the unmapping must occur in the opposite order from the
      mapping.  Furthermore, the pointer array in __raid_recover_end_io() may
      get reordered.
      
      Convert these calls to kmap_local_page() taking care to reverse the
      unmappings of any page arrays as well as being careful with the mappings
      of any special pages such as the parity and qstripe pages.
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94a0b58d
    • I
      btrfs: convert kmap to kmap_local_page, simple cases · 58c1a35c
      Ira Weiny 提交于
      Use a simple coccinelle script to help convert the most common
      kmap()/kunmap() patterns to kmap_local_page()/kunmap_local().
      
      Note that some kmaps which were caught by this script needed to be
      handled by hand because of the strict unmapping order of kunmap_local()
      so they are not included in this patch.  But this script got us started.
      
      There's another temp variable added for the final length write to the
      first page so it does not interfere with cpage_out that is used for
      mapping other pages.
      
      The development of this patch was aided by the follow script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap and replace with kmap_local_page then mark kunmap
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      
      @ catch_all @
      expression e, e2;
      @@
      
      (
      -kmap(e)
      +kmap_local_page(e)
      )
      ...
      (
      -kunmap(...)
      +kunmap_local()
      )
      
      // </smpl>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58c1a35c
    • J
      btrfs: remove duplicated in_range() macro · cea62800
      Johannes Thumshirn 提交于
      The in_range() macro is defined twice in btrfs' source, once in ctree.h
      and once in misc.h.
      
      Remove the definition in ctree.h and include misc.h in the files depending
      on it.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cea62800
  6. 09 4月, 2021 1 次提交
  7. 26 2月, 2021 1 次提交
    • I
      btrfs: use copy_highpage() instead of 2 kmaps() · 80cc8384
      Ira Weiny 提交于
      There are many places where kmap/memove/kunmap patterns occur.
      
      This pattern exists in the core common function copy_highpage().
      
      Use copy_highpage to avoid open coding the use of kmap and leverages the
      core functions use of kmap_local_page().
      
      Development of this patch was aided by the following coccinelle script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/copypage/kunmap pattern and replace with copy_highpage calls
      //
      // NOTE: The expressions in the copy page version of this kmap pattern are
      // overly complex and so these all need individual attention.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // Then a copy_page where we have 2 pages involved.
      //
      @ copy_page_rule @
      expression page, page2, To, From, Size;
      identifier ptr, ptr2;
      type VP, VP2;
      @@
      
      /* kmap */
      (
      -VP ptr = kmap(page);
      ...
      -VP2 ptr2 = kmap(page2);
      |
      -VP ptr = kmap_atomic(page);
      ...
      -VP2 ptr2 = kmap_atomic(page2);
      |
      -ptr = kmap(page);
      ...
      -ptr2 = kmap(page2);
      |
      -ptr = kmap_atomic(page);
      ...
      -ptr2 = kmap_atomic(page2);
      )
      
      // 1 or more copy versions of the entire page
      <+...
      (
      -copy_page(To, From);
      +copy_highpage(To, From);
      |
      -memmove(To, From, Size);
      +memmoveExtra(To, From, Size);
      )
      ...+>
      
      /* kunmap */
      (
      -kunmap(page2);
      ...
      -kunmap(page);
      |
      -kunmap(page);
      ...
      -kunmap(page2);
      |
      -kmap_atomic(ptr2);
      ...
      -kmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on copy_page_rule
      @
      identifier copy_page_rule.ptr;
      identifier copy_page_rule.ptr2;
      type VP, VP1;
      type VP2, VP21;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      -VP2 ptr2;
      	... when != ptr2;
      ? VP21 ptr2;
      
      // </smpl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      80cc8384
  8. 23 2月, 2021 1 次提交
    • I
      btrfs: fix raid6 qstripe kmap · d70cef0d
      Ira Weiny 提交于
      When a qstripe is required an extra page is allocated and mapped.  There
      were 3 problems:
      
      1) There is no corresponding call of kunmap() for the qstripe page.
      2) There is no reason to map the qstripe page more than once if the
         number of bits set in rbio->dbitmap is greater than one.
      3) There is no reason to map the parity page and unmap it each time
         through the loop.
      
      The page memory can continue to be reused with a single mapping on each
      iteration by raid6_call.gen_syndrome() without remapping.  So map the
      page for the duration of the loop.
      
      Similarly, improve the algorithm by mapping the parity page just 1 time.
      
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4.x: c17af965: btrfs: raid56: simplify tracking of Q stripe presence
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d70cef0d
  9. 09 2月, 2021 1 次提交
  10. 25 1月, 2021 1 次提交