1. 25 7月, 2022 35 次提交
    • F
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item · 3bae13e9
      Filipe Manana 提交于
      At btrfs_insert_delayed_dir_index(), we don't expect the metadata
      reservation for the delayed dir index item insertion to fail, because the
      caller is supposed to have reserved 1 unit of metadata space for that.
      All callers are able to deal with an error in case that happens, so there
      is no need for something so drastic as a BUG_ON() in case of failure.
      Instead just emit a warning, so that's easily noticed during development
      (fstests in particular), and return the error to the caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3bae13e9
    • F
      btrfs: improve batch insertion of delayed dir index items · 06ac264f
      Filipe Manana 提交于
      Currently we group delayed dir index items for insertion as a single batch
      (a single btree operation) as long as their keys are sequential in the key
      space.
      
      For example we have delayed index items for the following index keys:
      
         10, 11, 12, 15, 16, 20, 21
      
      We end up building three batches:
      
      1) First one for index keys 10, 11 and 12;
      2) Second one for index keys 15 and 16;
      3) Third one for index keys 20 and 21.
      
      However, since the dir index numbers come from a monotonically increasing
      counter and are never reused, we could group all these items into a single
      batch. The existence of holes in the sequence happens only when we had
      delayed dir index items for insertion that got deleted before they were
      flushed to the subvolume's tree.
      
      The delayed items are stored in a rbtree based on their key order, so
      we can just group items into a batch as long as they all fit in a leaf,
      and ignore if there's a gap (key offset, index number) between two
      consecutive items. This is more efficient and reduces the amount of
      time spent when running delayed items if there are gaps between dir
      index items.
      
      For example running the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=100
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
             echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        sync
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nsync took $dur milliseconds"
      
        umount $MNT
      
      While having the following bpftrace script running in another shell:
      
        $ cat bpf-delayed-items-inserts.sh
        #!/usr/bin/bpftrace
      
        /* Must add 'noinline' to btrfs_insert_delayed_items(). */
        k:btrfs_insert_delayed_items
        {
            @start_insert_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_insert_empty_items
        /@start_insert_delayed_items[tid]/
        {
           @insert_batches = count();
        }
      
        kr:btrfs_insert_delayed_items
        /@start_insert_delayed_items[tid]/
        {
            $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
            @btrfs_insert_delayed_items_total_time = sum($dur);
            delete(@start_insert_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_insert_delayed_items_total_time: 576
      @insert_batches: 51
      
      After this change:
      
      @btrfs_insert_delayed_items_total_time: 174
      @insert_batches: 2
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      06ac264f
    • F
      btrfs: assert that delayed item is a dir index item when adding it · a176affe
      Filipe Manana 提交于
      All delayed items are for dir index items, we don't support any other item
      types at the moment. So simplify __btrfs_add_delayed_item() and add an
      assertion for checking the item's key type. This also allows the next
      change to be simpler and avoid to check key types. In case we add support
      for different item types in the future, then we'll hit the assertion
      during development and be able to adjust any code that is assuming delayed
      items are always associated to dir index items.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a176affe
    • F
      btrfs: improve batch deletion of delayed dir index items · 4bd02d90
      Filipe Manana 提交于
      Currently we group delayed dir index items for deletion in a single batch
      (single btree operation) as long as they all exist in the same leaf and as
      long as their keys are sequential in the key space. For example if we have
      a leaf that has dir index items with offsets:
      
          2, 3, 4, 6, 7, 10
      
      And we have delayed dir index items for deleting all these indexes, and
      no delayed items for any other index keys in between, then we end up
      deleting in 3 batches:
      
      1) First batch for indexes 2, 3 and 4;
      2) Second batch for indexes 6 and 7;
      3) Third batch for index 10.
      
      This is a waste because we can delete all the index keys in a single
      batch. What matters is that each consecutive delayed index key matches
      each consecutive dir index key in a leaf.
      
      So update the logic at btrfs_batch_delete_items() to check only for a
      key match between delayed dir index items and dir index items in a leaf.
      Also avoid the useless first iteration on comparing the key of the
      first slot to delete with the key of the first delayed item, as it's
      silly since they always match, as the delayed item's key was used for
      the btree search that gave us the path we have.
      
      This is more efficient and reduces runtime of running delayed items, as
      well as lock contention on the subvolume's tree.
      
      For example, the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=1000
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # Sync to force any delayed items to be flushed to the tree.
        sync
      
        start=$(date +%s%N)
        rm -fr $MNT/testdir
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nrm -fr took $dur milliseconds"
      
        umount $MNT
      
      Running that test script while having the following bpftrace script
      running in another shell:
      
        $ cat bpf-measure.sh
        #!/usr/bin/bpftrace
      
        /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */
        k:btrfs_delete_delayed_items
        {
            @start_delete_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_del_items
        /@start_delete_delayed_items[tid]/
        {
            @delete_batches = count();
        }
      
        kr:btrfs_delete_delayed_items
        /@start_delete_delayed_items[tid]/
        {
            $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000;
            @btrfs_delete_delayed_items_total_time = sum($dur);
            delete(@start_delete_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_delete_delayed_items_total_time: 9563
      @delete_batches: 1001
      
      After this change:
      
      @btrfs_delete_delayed_items_total_time: 7328
      @delete_batches: 509
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4bd02d90
    • F
      btrfs: refactor the delayed item deletion entry point · 36baa2c7
      Filipe Manana 提交于
      The delayed item deletion entry point, btrfs_delete_delayed_items(), is a
      bit convoluted for a few reasons:
      
      1) It's really a loop disguised with labels and goto statements;
      
      2) There's a 'delete_fail' label which isn't only for error cases, we can
         jump to that label even if no error happened, if we simply don't have
         more delayed items to delete;
      
      3) Unnecessarily keeps track of the current and previous items for no
         good reason, as after getting the next item and releasing the current
         one, it just jumps to the 'again' label just to look again for the
         first delayed item;
      
      4) When a delayed item is not in the tree (because it was already deleted
         before), it releases the item while holding a path locked, which is
         not necessary and adds more contention to the tree, specially taking
         into account that the path came from a deletion search, meaning we have
         write locks for nodes at levels 2, 1 and 0. And releasing the item is
         not computationally trivial (rb tree deletion, a kfree() and some
         trivial things).
      
      So refactor it to use a while loop and add some comments to make it more
      obvious why we can have delayed items without a matching item in the tree
      as well as why not keep the delayed node locked all the time when running
      all its deletion items. This is also a preparation for some upcoming work
      involving delayed items.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36baa2c7
    • F
      btrfs: deal with deletion errors when deleting delayed items · 2b1d260d
      Filipe Manana 提交于
      Currently, btrfs_delete_delayed_items() ignores any errors returned from
      btrfs_batch_delete_items(). This looks fishy but it's not a problem at
      the moment because:
      
      1) Two of the errors returned from btrfs_batch_delete_items() are for
         impossible cases, cases where a delayed item does not match any item
         in the leaf the path points to - btrfs_delete_delayed_items() always
         calls btrfs_batch_delete_items() with a path that points to a leaf
         that contains an item matching a delayed item;
      
      2) btrfs_batch_delete_items() may return an error from btrfs_del_items(),
         in which case it does not release the delayed items of the batch.
      
         At the moment this is harmless because btrfs_del_items() actually is
         always able to delete items, even if it returns an error - when it
         returns an error it's because it ended up with a leaf mostly empty
         (less than 1/3 full) and failed to migrate items from that leaf into
         its neighbour leaves - this is not critical, as all the items were
         deleted, we just left the tree a bit unbalanced, but it's still a
         valid tree and causes no harm, and future operations on the tree will
         eventually balance it.
      
         So even if we get an error from btrfs_del_items(), the delayed items
         will not be released but the next time we run delayed items we will
         find out, at btrfs_delete_delayed_items(), that they are not present
         in the tree anymore and then release them.
      
      This is all a bit subtle, and it's certainly prone to be a disaster in
      case btrfs_del_items() changes one day and may return errors before being
      able to delete all the requested items, in which case we could leave the
      filesystem in an inconsistent state as we would commit a transaction
      despite a failure from deleting items from the tree.
      
      So make btrfs_delete_delayed_items() check for any errors from the call
      to btrfs_batch_delete_items().
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b1d260d
    • F
      btrfs: add assertions when deleting batches of delayed items · 659192e6
      Filipe Manana 提交于
      There are a few impossible cases that btrfs_batch_delete_items() tries to
      deal with:
      
      1) Getting a path pointing to a NULL leaf;
      2) The leaf slot is pointing beyond the last item in the leaf;
      3) We can't find a single item to delete.
      
      The first case is impossible because the given path was returned by a
      successful call to btrfs_search_slot(). Replace the BUG_ON() with an
      ASSERT for this.
      
      The second case is impossible because we are always called when a delayed
      item matches an item in the given leaf. So add an ASSERT() for that and
      if that condition is not satisfied, trigger a warning and return an error.
      
      The third case is impossible exactly because of the same reason as the
      second case. The given delayed item matches one item in the leaf, so we
      know that our batch always has at least one item. Add an ASSERT to check
      that, trigger a warning if that expectation fails and return an error.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      659192e6
    • F
      btrfs: balance btree dirty pages and delayed items after clone and dedupe · 6fe81a3a
      Filipe Manana 提交于
      When reflinking extents (clone and deduplication), we need to touch the
      btree of the destination inode's subvolume, as well as potentially
      create a delayed inode for the destination inode (if it was not created
      before). However we are neither balancing the btree dirty pages nor the
      delayed items after such operations, so if we have a task that is doing
      a long series of clone or deduplication operations, it can result in
      accumulation of too many btree dirty pages and delayed items.
      
      So just call btrfs_btree_balance_dirty() after clone and deduplication,
      just like we do for every other system call that results on modifying a
      btree and adding delayed items.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6fe81a3a
    • F
      btrfs: free the path earlier when creating a new inode · 814e7718
      Filipe Manana 提交于
      When creating an inode, through btrfs_create_new_inode(), we release the
      path we allocated before once we don't need it anymore. But we keep it
      allocated until we return from that function, which is wasteful because
      after we release the path we do several things that can allocate yet
      another path: inheriting properties, setting the xattrs used by ACLs and
      secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
      dir item (for the non-O_TMPFILE case).
      
      So instead of releasing the path once we don't need it anymore, free it
      instead. This way we avoid having two paths allocated until we return
      from btrfs_create_new_inode().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      814e7718
    • F
      btrfs: balance btree dirty pages and delayed items after a rename · ca6dee6b
      Filipe Manana 提交于
      A rename operation modifies a subvolume's btree, to remove the old dir
      item, add the new dir item, remove an inode ref and add a new inode ref.
      It can also create the delayed inode for the inodes involved in the
      operation, and it creates two delayed dir index items, one to delete
      the old name and another one to add the new name.
      
      However we are neither balancing the btree dirty pages nor the delayed
      items after a rename, which can result in accumulation of too many
      btree dirty pages and delayed items, specially if a task is doing a
      series of rename operations (for example it can happen for package
      installations/upgrades through the zypper tool).
      
      So just call btrfs_btree_balance_dirty() after a rename, just like we
      do for every other system call that results on modifying a btree and
      adding delayed items.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca6dee6b
    • Q
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo 提交于
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Q
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo 提交于
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d100466
    • D
      btrfs: remove redundant calls to flush_dcache_page · 21a8935e
      David Sterba 提交于
      Both memzero_page and memcpy_to_page already call flush_dcache_page so
      we can remove the calls from btrfs code.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      21a8935e
    • Q
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo 提交于
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Q
      btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap · 381b9b4c
      Qu Wenruo 提交于
      Previously we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for scrub_parity.
      
      To be extra safe, add an ASSERT() making sure calclulated @nsectors is
      always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      381b9b4c
    • Q
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo 提交于
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67c68eb
    • N
      btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance · 099aa972
      Nikolay Borisov 提交于
      This eliminates 2 labels and makes the code generally more streamlined.
      Also rename the 'out_bargs' label to 'out_unlock' since bargs is going
      to be freed under the 'out' label. This also fixes a memory leak since
      bargs wasn't correctly freed in one of the condition which are now moved
      in btrfs_try_lock_balance.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      099aa972
    • N
      btrfs: introduce btrfs_try_lock_balance · 7fb10ed8
      Nikolay Borisov 提交于
      This function contains the factored out locking sequence of
      btrfs_ioctl_balance. Having this piece of code separate helps to
      simplify btrfs_ioctl_balance which has too complicated.  This will be
      used in the next patch to streamline the logic in btrfs_ioctl_balance.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7fb10ed8
    • C
      btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio · 1e87770c
      Christoph Hellwig 提交于
      Use the new btrfs_bio_for_each_sector iterator to simplify
      btrfs_check_read_dio_bio.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e87770c
    • Q
      btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks · 261d812b
      Qu Wenruo 提交于
      Add a helper that works similar to __bio_for_each_segment, but instead of
      iterating over PAGE_SIZE chunks it iterates over each sector.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [hch: split from a larger patch, and iterate over the offset instead of
            the offset bits]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add parameter comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      261d812b
    • C
      btrfs: factor out a btrfs_csum_ptr helper · a89ce08c
      Christoph Hellwig 提交于
      Add a helper to find the csum for a byte offset into the csum buffer.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a89ce08c
    • C
      btrfs: refactor end_bio_extent_readpage code flow · 97861cd1
      Christoph Hellwig 提交于
      Untangle the goto and move the code it jumps to so it goes in the order
      of the most likely states first.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97861cd1
    • C
      btrfs: factor out a helper to end a single sector buffer I/O · a5aa7ab6
      Christoph Hellwig 提交于
      Add a helper to end I/O on a single sector, which will come in handy
      with the new read repair code.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5aa7ab6
    • Q
      btrfs: remove duplicated parameters from submit_data_read_repair() · fd5a6f63
      Qu Wenruo 提交于
      The function submit_data_read_repair() is only called for buffered data
      read path, thus those members can be calculated using bvec directly:
      
      - start
        start = page_offset(bvec->bv_page) + bvec->bv_offset;
      
      - end
        end = start + bvec->bv_len - 1;
      
      - page
        page = bvec->bv_page;
      
      - pgoff
        pgoff = bvec->bv_offset;
      
      Thus we can safely replace those 4 parameters with just one bio_vec.
      
      Also remove the unused return value.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [hch: also remove the return value]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd5a6f63
    • Q
      btrfs: introduce a data checksum checking helper · ae643a74
      Qu Wenruo 提交于
      Although we have several data csum verification code, we never have a
      function really just to verify checksum for one sector.
      
      Function check_data_csum() do extra work for error reporting, thus it
      requires a lot of extra things like file offset, bio_offset etc.
      
      Function btrfs_verify_data_csum() is even worse, it will utilize page
      checked flag, which means it can not be utilized for direct IO pages.
      
      Here we introduce a new helper, btrfs_check_sector_csum(), which really
      only accept a sector in page, and expected checksum pointer.
      
      We use this function to implement check_data_csum(), and export it for
      incoming patch.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [hch: keep passing the csum array as an arguments, as the callers want
            to print it, rename per request]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae643a74
    • Q
      btrfs: quit early if the fs has no RAID56 support for raid56 related checks · b036f479
      Qu Wenruo 提交于
      The following functions do special handling for RAID56 chunks:
      
      - btrfs_is_parity_mirror()
        Check if the range is in RAID56 chunks.
      
      - btrfs_full_stripe_len()
        Either return sectorsize for non-RAID56 profiles or full stripe length
        for RAID56 chunks.
      
      But if a filesystem without any RAID56 chunks, it will not have RAID56
      incompat flags, and we can skip the chunk tree looking up completely.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b036f479
    • F
      btrfs: use PAGE_ALIGNED instead of IS_ALIGNED · 1280d2d1
      Fanjun Kong 提交于
      The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
      use it instead of IS_ALIGNED and passing PAGE_SIZE directly.
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFanjun Kong <bh1scw@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1280d2d1
    • P
      btrfs: zoned: fix comment description for sb_write_pointer logic · 31f37269
      Pankaj Raghav 提交于
      Fix the comment to represent the actual logic used for sb_write_pointer
      
      - Empty[0] && In use[1] should be an invalid state instead of returning
        zone 0 wp
      - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp
      - In use[0] && Empty[1] should be returning zone 0 wp instead of being an
        invalid state
      - In use[0] && Full[1] should be returning zone 0 wp instead of returning
        zone 1 wp
      - Full[0] && Empty[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      - Full[0] && In use[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31f37269
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
    • L
      Linux 5.19-rc8 · e0dccc3b
      Linus Torvalds 提交于
      e0dccc3b
    • A
      certs: make system keyring depend on x509 parser · e9088629
      Adam Borowski 提交于
      This code requires x509_load_certificate_list() to be built-in.
      
      Fixes: 60050ffe ("certs: Move load_certificate_list() to be with the asymmetric keys code")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Link: https://lore.kernel.org/all/202206221515.DqpUuvbQ-lkp@intel.com/
      Link: https://lore.kernel.org/all/20220712104554.408dbf42@gandalf.local.home/Signed-off-by: NAdam Borowski <kilobyte@angband.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9088629
    • L
      Merge tag 'perf_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · af2c9ac2
      Linus Torvalds 提交于
      Pull perf fix from Borislav Petkov:
      
       - Reorganize the perf LBR init code so that a TSX quirk is applied
         early enough in order for the LBR MSR access to not #GP
      
      * tag 'perf_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel/lbr: Fix unchecked MSR access error on HSW
      af2c9ac2
    • L
      Merge tag 'sched_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c2602a7c
      Linus Torvalds 提交于
      Pull scheduler fix from Borislav Petkov:
       "A single fix to correct a wrong BUG_ON() condition for deboosted
        tasks"
      
      * tag 'sched_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/deadline: Fix BUG_ON condition for deboosted tasks
      c2602a7c
    • L
      Merge tag 'x86_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 05017fed
      Linus Torvalds 提交于
      Pull x86 fixes from Borislav Petkov:
       "A couple more retbleed fallout fixes.
      
        It looks like their urgency is decreasing so it seems like we've
        managed to catch whatever snafus the limited -rc testing has exposed.
        Maybe we're getting ready... :)
      
         - Make retbleed mitigations 64-bit only (32-bit will need a bit more
           work if even needed, at all).
      
         - Prevent return thunks patching of the LKDTM modules as it is not
           needed there
      
         - Avoid writing the SPEC_CTRL MSR on every kernel entry on eIBRS
           parts
      
         - Enhance error output of apply_returns() when it fails to patch a
           return thunk
      
         - A sparse fix to the sev-guest module
      
         - Protect EFI fw calls by issuing an IBPB on AMD"
      
      * tag 'x86_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/speculation: Make all RETbleed mitigations 64-bit only
        lkdtm: Disable return thunks in rodata.c
        x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS parts
        x86/alternative: Report missing return thunk details
        virt: sev-guest: Pass the appropriate argument type to iounmap()
        x86/amd: Use IBPB for firmware calls
      05017fed
    • L
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 714b82c1
      Linus Torvalds 提交于
      Pull clk fix from Stephen Boyd:
       "One more fix to set the correct IO mapping for a clk gate in the
        lan966x driver"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: lan966x: Fix the lan966x clock gate register address
      714b82c1
  2. 24 7月, 2022 2 次提交
  3. 23 7月, 2022 3 次提交