1. 25 7月, 2022 40 次提交
    • C
      btrfs: split btrfs_submit_data_bio to read and write parts · c93104e7
      Christoph Hellwig 提交于
      Split btrfs_submit_data_bio into one helper for reads and one for writes.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c93104e7
    • C
      btrfs: simplify code flow in btrfs_submit_dio_bio · e6484bd4
      Christoph Hellwig 提交于
      There is no exit block and cleanup and the function is reasonably short
      so we can use inline return and not the goto. This makes the function
      more straight forward.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6484bd4
    • C
      btrfs: move more work into btrfs_end_bioc · b4c46bde
      Christoph Hellwig 提交于
      Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of
      duplicating the logic in the callers.  Also remove the bio argument as
      it always must be bioc->orig_bio and the now pointless bioc_error that
      did nothing but assign bi_sector to the same value just sampled in the
      caller.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4c46bde
    • O
      btrfs: send: enable support for stream v2 and compressed writes · d6815592
      Omar Sandoval 提交于
      Now that the new support is implemented, allow the ioctl to accept v2
      and the compressed flag, and update the version in sysfs.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d6815592
    • O
      btrfs: send: send compressed extents with encoded writes · 3ea4dc5b
      Omar Sandoval 提交于
      Now that all of the pieces are in place, we can use the ENCODED_WRITE
      command to send compressed extents when appropriate.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ea4dc5b
    • O
      btrfs: send: get send buffer pages for protocol v2 · a4b333f2
      Omar Sandoval 提交于
      For encoded writes in send v2, we will get the encoded data with
      btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
      pages. To avoid extra buffers and copies, we should read directly into
      the send buffer. Therefore, we need the raw pages for the send buffer.
      
      We currently allocate the send buffer with kvmalloc(), which may return
      a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
      pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
      However, the buffer size we use (144K) is not a power of two, which in
      theory is not guaranteed to return a page-aligned buffer, and in
      practice would waste a lot of memory due to rounding up to the next
      power of two. 144K is large enough that it usually gets allocated with
      vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
      and save the pages in an array.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a4b333f2
    • O
      btrfs: send: write larger chunks when using stream v2 · 356bbbb6
      Omar Sandoval 提交于
      The length field of the send stream TLV header is 16 bits. This means
      that the maximum amount of data that can be sent for one write is 64K
      minus one. However, encoded writes must be able to send the maximum
      compressed extent (128K) in one command, or more. To support this, send
      stream version 2 encodes the DATA attribute differently: it has no
      length field, and the length is implicitly up to the end of containing
      command (which has a 32bit length field). Although this is necessary
      for encoded writes, normal writes can benefit from it, too.
      
      Also add a check to enforce that the DATA attribute is last. It is only
      strictly necessary for v2, but we might as well make v1 consistent with
      it.
      
      For v2, let's bump up the send buffer to the maximum compressed extent
      size plus 16K for the other metadata (144K total). Since this will most
      likely be vmalloc'd (and always will be after the next commit), we round
      it up to the next page since we might as well use the rest of the page
      on systems with >16K pages.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      356bbbb6
    • O
      btrfs: send: add stream v2 definitions · b7c14f23
      Omar Sandoval 提交于
      This adds the definitions of the new commands for send stream version 2
      and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
      chattr), and encoded writes. It also documents two changes to the send
      stream format in v2: the receiver shouldn't assume a maximum command
      size, and the DATA attribute is encoded differently to allow for writes
      larger than 64k. These will be implemented in subsequent changes, and
      then the ioctl will accept the new version and flag.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7c14f23
    • O
      btrfs: send: explicitly number commands and attributes · 54cab6af
      Omar Sandoval 提交于
      Commit e77fbf99 ("btrfs: send: prepare for v2 protocol") added
      _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
      version plus 1, but as written this creates gaps in the number space.
      
      The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
      accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
      has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
      23 and 24 are valid commands.
      
      Instead, let's explicitly number all of the commands, attributes, and
      sentinel MAX constants.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54cab6af
    • O
      btrfs: send: remove unused send_ctx::{total,cmd}_send_size · ca182acc
      Omar Sandoval 提交于
      We collect these statistics but have never exposed them in any way. I
      also didn't find any patches that ever attempted to make use of them.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca182acc
    • S
      btrfs: sysfs: add force_chunk_alloc trigger to force allocation · 22c55e3b
      Stefan Roesch 提交于
      Adds write-only trigger to force new chunk allocation for a given block
      group type. It is at
      
        /sys/fs/btrfs/<uuid>/allocation/<type>/force_chunk_alloc
      
      Note: this is now only for debugging and testing and is enabled with the
            CONFIG_BTRFS_DEBUG configuration option. The transaction is
            started from sysfs context and can be problematic in some cases.
      Signed-off-by: NStefan Roesch <shr@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ Changes from the original submission:
        - update changelog
        - drop unnecessary error messages
        - switch value to bool and use kstrtobool
        - move BTRFS_ATTR_W definition
        - add comment for using transaction
      ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      22c55e3b
    • S
      btrfs: sysfs: export chunk size in space infos · 19fc516a
      Stefan Roesch 提交于
      Add new sysfs knob
      
        /sys/fs/btrfs/<uuid>/allocation/<type>/chunk_size.
      
      This allows to query the chunk size and also set the chunk size.
      
      Constraints:
      
      - can be changed by root only
      - system chunk size can't be set
      - maximum chunk size is 10% of the filesystem size
      - final value is rounded down to a multiple of 256M
      - cannot be set on zoned filesystem
      
      Note, that rounding and the 10% clamp will result to a different value
      on filesystems smaller than 10G, typically 768M.
      Signed-off-by: NStefan Roesch <shr@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ Changes to original submission:
        - document setting constraints
        - drop read-only requirement
        - drop unnecessary error messages
        - fix return values of _store callback
        - use memparse for the value
        - fix rounding down to 256M
      ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      19fc516a
    • S
      btrfs: store chunk size in space-info struct · f6fca391
      Stefan Roesch 提交于
      The chunk size is stored in the btrfs_space_info structure.  It is
      initialized at the start and is then used.
      
      A new API is added to update the current chunk size.  This API is used
      to be able to expose the chunk_size as a sysfs setting.
      Signed-off-by: NStefan Roesch <shr@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename and merge helpers, switch atomic type to u64, style fixes ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6fca391
    • J
      btrfs: do not batch insert non-consecutive dir indexes during log replay · 71b68e9e
      Josef Bacik 提交于
      While running generic/475 in a loop I got the following error
      
      BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
      <snip>
       item 65 key (263 96 517) itemoff 14132 itemsize 33
       item 66 key (263 96 523) itemoff 14099 itemsize 33
       item 67 key (263 96 525) itemoff 14066 itemsize 33
       item 68 key (263 96 531) itemoff 14033 itemsize 33
       item 69 key (263 96 524) itemoff 14000 itemsize 33
      
      As you can see here we have 3 dir index keys with the dir index value of
      523, 524, and 525 inserted between 517 and 524.  This occurs because our
      dir index insertion code will bulk insert all dir index items on the
      node regardless of their actual key value.
      
      This makes sense on a normally running system, because if there's a gap
      in between the items there was a deletion before the item was inserted,
      so there's not going to be an overlap of the dir index items that need
      to be inserted and what exists on disk.
      
      However during log replay this isn't necessarily true, we could have any
      number of dir indexes in the tree already.
      
      Fix this by seeing if we're replaying the log, and if we are simply skip
      batching if there's a gap in the key space.
      
      This file system was left broken from the fstest, I tested this patch
      against the broken fs to make sure it replayed the log properly, and
      then btrfs checked the file system after the log replay to verify
      everything was ok.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71b68e9e
    • F
      btrfs: reduce amount of reserved metadata for delayed item insertion · 763748b2
      Filipe Manana 提交于
      Whenever we want to create a new dir index item (when creating an inode,
      create a hard link, rename a file) we reserve 1 unit of metadata space
      for it in a transaction (that's 256K for a node/leaf size of 16K), and
      then create a delayed insertion item for it to be added later to the
      subvolume's tree. That unit of metadata is kept until the delayed item
      is inserted into the subvolume tree, which may take a while to happen
      (in the worst case, it's done only when the transaction commits). If we
      have multiple dir index items to insert for the same directory, say N
      index items, and they all fit in a single leaf of metadata, then we are
      holding N units of reserved metadata space when all we need is 1 unit.
      
      This change addresses that, whenever a new delayed dir index item is
      added, we release the unit of metadata the caller has reserved when it
      started the transaction if adding that new dir index item does not
      result in touching one more metadata leaf, otherwise the reservation
      is kept by transferring it from the transaction block reserve to the
      delayed items block reserve, just like before. Given that with a leaf
      size of 16K we can have a few hundred dir index items in a single leaf
      (the exact value depends on file name lengths), this reduces pressure on
      metadata reservation by releasing unnecessary space much sooner.
      
      The following fs_mark test showed some improvement when creating many
      files in parallel on machine running a non debug kernel (debian's default
      kernel config) with 12 cores:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        FILES=100000
        THREADS=$(nproc --all)
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     225991.3          5465891
           4      2400000            0     345728.1          5512106
           4      3600000            0     346959.5          5557653
           8      4800000            0     329643.0          5587548
           8      6000000            0     312657.4          5606717
           8      7200000            0     281707.5          5727985
          12      8400000            0      88309.8          5020422
          12      9600000            0      85835.9          5207496
          16     10800000            0      81039.2          5404964
          16     12000000            0      58548.6          5842468
      
      After:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     230604.5          5778375
           4      2400000            0     348908.3          5508072
           4      3600000            0     357028.7          5484337
           6      4800000            0     342898.3          5565703
           6      6000000            0     314670.8          5751555
           8      7200000            0     282548.2          5778177
          12      8400000            0      90844.9          5306819
          12      9600000            0      86963.1          5304689
          16     10800000            0      89113.2          5455248
          16     12000000            0      86693.5          5518933
      
      The "after" results are after applying this patch and all the other
      patches in the same patchset, which is comprised of the following
      changes:
      
        btrfs: balance btree dirty pages and delayed items after a rename
        btrfs: free the path earlier when creating a new inode
        btrfs: balance btree dirty pages and delayed items after clone and dedupe
        btrfs: add assertions when deleting batches of delayed items
        btrfs: deal with deletion errors when deleting delayed items
        btrfs: refactor the delayed item deletion entry point
        btrfs: improve batch deletion of delayed dir index items
        btrfs: assert that delayed item is a dir index item when adding it
        btrfs: improve batch insertion of delayed dir index items
        btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
        btrfs: set delayed item type when initializing it
        btrfs: reduce amount of reserved metadata for delayed item insertion
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      763748b2
    • F
      btrfs: set delayed item type when initializing it · c9d02ab4
      Filipe Manana 提交于
      Currently we set the type of a delayed item only after successfully
      inserting it into its respective rbtree. This is fine, as the type
      is not used anywhere before that point, but for the next patch in the
      series, there will be the need to check the type of a delayed item
      before inserting it into a rbtree.
      
      So set the type of a delayed item immediately after allocating it.
      This also makes the trivial wrappers for adding insertion and deletion
      useless, so it removes them as well.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9d02ab4
    • F
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item · 3bae13e9
      Filipe Manana 提交于
      At btrfs_insert_delayed_dir_index(), we don't expect the metadata
      reservation for the delayed dir index item insertion to fail, because the
      caller is supposed to have reserved 1 unit of metadata space for that.
      All callers are able to deal with an error in case that happens, so there
      is no need for something so drastic as a BUG_ON() in case of failure.
      Instead just emit a warning, so that's easily noticed during development
      (fstests in particular), and return the error to the caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3bae13e9
    • F
      btrfs: improve batch insertion of delayed dir index items · 06ac264f
      Filipe Manana 提交于
      Currently we group delayed dir index items for insertion as a single batch
      (a single btree operation) as long as their keys are sequential in the key
      space.
      
      For example we have delayed index items for the following index keys:
      
         10, 11, 12, 15, 16, 20, 21
      
      We end up building three batches:
      
      1) First one for index keys 10, 11 and 12;
      2) Second one for index keys 15 and 16;
      3) Third one for index keys 20 and 21.
      
      However, since the dir index numbers come from a monotonically increasing
      counter and are never reused, we could group all these items into a single
      batch. The existence of holes in the sequence happens only when we had
      delayed dir index items for insertion that got deleted before they were
      flushed to the subvolume's tree.
      
      The delayed items are stored in a rbtree based on their key order, so
      we can just group items into a batch as long as they all fit in a leaf,
      and ignore if there's a gap (key offset, index number) between two
      consecutive items. This is more efficient and reduces the amount of
      time spent when running delayed items if there are gaps between dir
      index items.
      
      For example running the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=100
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
             echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        sync
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nsync took $dur milliseconds"
      
        umount $MNT
      
      While having the following bpftrace script running in another shell:
      
        $ cat bpf-delayed-items-inserts.sh
        #!/usr/bin/bpftrace
      
        /* Must add 'noinline' to btrfs_insert_delayed_items(). */
        k:btrfs_insert_delayed_items
        {
            @start_insert_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_insert_empty_items
        /@start_insert_delayed_items[tid]/
        {
           @insert_batches = count();
        }
      
        kr:btrfs_insert_delayed_items
        /@start_insert_delayed_items[tid]/
        {
            $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
            @btrfs_insert_delayed_items_total_time = sum($dur);
            delete(@start_insert_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_insert_delayed_items_total_time: 576
      @insert_batches: 51
      
      After this change:
      
      @btrfs_insert_delayed_items_total_time: 174
      @insert_batches: 2
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      06ac264f
    • F
      btrfs: assert that delayed item is a dir index item when adding it · a176affe
      Filipe Manana 提交于
      All delayed items are for dir index items, we don't support any other item
      types at the moment. So simplify __btrfs_add_delayed_item() and add an
      assertion for checking the item's key type. This also allows the next
      change to be simpler and avoid to check key types. In case we add support
      for different item types in the future, then we'll hit the assertion
      during development and be able to adjust any code that is assuming delayed
      items are always associated to dir index items.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a176affe
    • F
      btrfs: improve batch deletion of delayed dir index items · 4bd02d90
      Filipe Manana 提交于
      Currently we group delayed dir index items for deletion in a single batch
      (single btree operation) as long as they all exist in the same leaf and as
      long as their keys are sequential in the key space. For example if we have
      a leaf that has dir index items with offsets:
      
          2, 3, 4, 6, 7, 10
      
      And we have delayed dir index items for deleting all these indexes, and
      no delayed items for any other index keys in between, then we end up
      deleting in 3 batches:
      
      1) First batch for indexes 2, 3 and 4;
      2) Second batch for indexes 6 and 7;
      3) Third batch for index 10.
      
      This is a waste because we can delete all the index keys in a single
      batch. What matters is that each consecutive delayed index key matches
      each consecutive dir index key in a leaf.
      
      So update the logic at btrfs_batch_delete_items() to check only for a
      key match between delayed dir index items and dir index items in a leaf.
      Also avoid the useless first iteration on comparing the key of the
      first slot to delete with the key of the first delayed item, as it's
      silly since they always match, as the delayed item's key was used for
      the btree search that gave us the path we have.
      
      This is more efficient and reduces runtime of running delayed items, as
      well as lock contention on the subvolume's tree.
      
      For example, the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=1000
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # Sync to force any delayed items to be flushed to the tree.
        sync
      
        start=$(date +%s%N)
        rm -fr $MNT/testdir
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nrm -fr took $dur milliseconds"
      
        umount $MNT
      
      Running that test script while having the following bpftrace script
      running in another shell:
      
        $ cat bpf-measure.sh
        #!/usr/bin/bpftrace
      
        /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */
        k:btrfs_delete_delayed_items
        {
            @start_delete_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_del_items
        /@start_delete_delayed_items[tid]/
        {
            @delete_batches = count();
        }
      
        kr:btrfs_delete_delayed_items
        /@start_delete_delayed_items[tid]/
        {
            $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000;
            @btrfs_delete_delayed_items_total_time = sum($dur);
            delete(@start_delete_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_delete_delayed_items_total_time: 9563
      @delete_batches: 1001
      
      After this change:
      
      @btrfs_delete_delayed_items_total_time: 7328
      @delete_batches: 509
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4bd02d90
    • F
      btrfs: refactor the delayed item deletion entry point · 36baa2c7
      Filipe Manana 提交于
      The delayed item deletion entry point, btrfs_delete_delayed_items(), is a
      bit convoluted for a few reasons:
      
      1) It's really a loop disguised with labels and goto statements;
      
      2) There's a 'delete_fail' label which isn't only for error cases, we can
         jump to that label even if no error happened, if we simply don't have
         more delayed items to delete;
      
      3) Unnecessarily keeps track of the current and previous items for no
         good reason, as after getting the next item and releasing the current
         one, it just jumps to the 'again' label just to look again for the
         first delayed item;
      
      4) When a delayed item is not in the tree (because it was already deleted
         before), it releases the item while holding a path locked, which is
         not necessary and adds more contention to the tree, specially taking
         into account that the path came from a deletion search, meaning we have
         write locks for nodes at levels 2, 1 and 0. And releasing the item is
         not computationally trivial (rb tree deletion, a kfree() and some
         trivial things).
      
      So refactor it to use a while loop and add some comments to make it more
      obvious why we can have delayed items without a matching item in the tree
      as well as why not keep the delayed node locked all the time when running
      all its deletion items. This is also a preparation for some upcoming work
      involving delayed items.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36baa2c7
    • F
      btrfs: deal with deletion errors when deleting delayed items · 2b1d260d
      Filipe Manana 提交于
      Currently, btrfs_delete_delayed_items() ignores any errors returned from
      btrfs_batch_delete_items(). This looks fishy but it's not a problem at
      the moment because:
      
      1) Two of the errors returned from btrfs_batch_delete_items() are for
         impossible cases, cases where a delayed item does not match any item
         in the leaf the path points to - btrfs_delete_delayed_items() always
         calls btrfs_batch_delete_items() with a path that points to a leaf
         that contains an item matching a delayed item;
      
      2) btrfs_batch_delete_items() may return an error from btrfs_del_items(),
         in which case it does not release the delayed items of the batch.
      
         At the moment this is harmless because btrfs_del_items() actually is
         always able to delete items, even if it returns an error - when it
         returns an error it's because it ended up with a leaf mostly empty
         (less than 1/3 full) and failed to migrate items from that leaf into
         its neighbour leaves - this is not critical, as all the items were
         deleted, we just left the tree a bit unbalanced, but it's still a
         valid tree and causes no harm, and future operations on the tree will
         eventually balance it.
      
         So even if we get an error from btrfs_del_items(), the delayed items
         will not be released but the next time we run delayed items we will
         find out, at btrfs_delete_delayed_items(), that they are not present
         in the tree anymore and then release them.
      
      This is all a bit subtle, and it's certainly prone to be a disaster in
      case btrfs_del_items() changes one day and may return errors before being
      able to delete all the requested items, in which case we could leave the
      filesystem in an inconsistent state as we would commit a transaction
      despite a failure from deleting items from the tree.
      
      So make btrfs_delete_delayed_items() check for any errors from the call
      to btrfs_batch_delete_items().
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b1d260d
    • F
      btrfs: add assertions when deleting batches of delayed items · 659192e6
      Filipe Manana 提交于
      There are a few impossible cases that btrfs_batch_delete_items() tries to
      deal with:
      
      1) Getting a path pointing to a NULL leaf;
      2) The leaf slot is pointing beyond the last item in the leaf;
      3) We can't find a single item to delete.
      
      The first case is impossible because the given path was returned by a
      successful call to btrfs_search_slot(). Replace the BUG_ON() with an
      ASSERT for this.
      
      The second case is impossible because we are always called when a delayed
      item matches an item in the given leaf. So add an ASSERT() for that and
      if that condition is not satisfied, trigger a warning and return an error.
      
      The third case is impossible exactly because of the same reason as the
      second case. The given delayed item matches one item in the leaf, so we
      know that our batch always has at least one item. Add an ASSERT to check
      that, trigger a warning if that expectation fails and return an error.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      659192e6
    • F
      btrfs: balance btree dirty pages and delayed items after clone and dedupe · 6fe81a3a
      Filipe Manana 提交于
      When reflinking extents (clone and deduplication), we need to touch the
      btree of the destination inode's subvolume, as well as potentially
      create a delayed inode for the destination inode (if it was not created
      before). However we are neither balancing the btree dirty pages nor the
      delayed items after such operations, so if we have a task that is doing
      a long series of clone or deduplication operations, it can result in
      accumulation of too many btree dirty pages and delayed items.
      
      So just call btrfs_btree_balance_dirty() after clone and deduplication,
      just like we do for every other system call that results on modifying a
      btree and adding delayed items.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6fe81a3a
    • F
      btrfs: free the path earlier when creating a new inode · 814e7718
      Filipe Manana 提交于
      When creating an inode, through btrfs_create_new_inode(), we release the
      path we allocated before once we don't need it anymore. But we keep it
      allocated until we return from that function, which is wasteful because
      after we release the path we do several things that can allocate yet
      another path: inheriting properties, setting the xattrs used by ACLs and
      secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
      dir item (for the non-O_TMPFILE case).
      
      So instead of releasing the path once we don't need it anymore, free it
      instead. This way we avoid having two paths allocated until we return
      from btrfs_create_new_inode().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      814e7718
    • F
      btrfs: balance btree dirty pages and delayed items after a rename · ca6dee6b
      Filipe Manana 提交于
      A rename operation modifies a subvolume's btree, to remove the old dir
      item, add the new dir item, remove an inode ref and add a new inode ref.
      It can also create the delayed inode for the inodes involved in the
      operation, and it creates two delayed dir index items, one to delete
      the old name and another one to add the new name.
      
      However we are neither balancing the btree dirty pages nor the delayed
      items after a rename, which can result in accumulation of too many
      btree dirty pages and delayed items, specially if a task is doing a
      series of rename operations (for example it can happen for package
      installations/upgrades through the zypper tool).
      
      So just call btrfs_btree_balance_dirty() after a rename, just like we
      do for every other system call that results on modifying a btree and
      adding delayed items.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca6dee6b
    • Q
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo 提交于
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Q
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo 提交于
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d100466
    • D
      btrfs: remove redundant calls to flush_dcache_page · 21a8935e
      David Sterba 提交于
      Both memzero_page and memcpy_to_page already call flush_dcache_page so
      we can remove the calls from btrfs code.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      21a8935e
    • Q
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo 提交于
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Q
      btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap · 381b9b4c
      Qu Wenruo 提交于
      Previously we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for scrub_parity.
      
      To be extra safe, add an ASSERT() making sure calclulated @nsectors is
      always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      381b9b4c
    • Q
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo 提交于
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67c68eb
    • N
      btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance · 099aa972
      Nikolay Borisov 提交于
      This eliminates 2 labels and makes the code generally more streamlined.
      Also rename the 'out_bargs' label to 'out_unlock' since bargs is going
      to be freed under the 'out' label. This also fixes a memory leak since
      bargs wasn't correctly freed in one of the condition which are now moved
      in btrfs_try_lock_balance.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      099aa972
    • N
      btrfs: introduce btrfs_try_lock_balance · 7fb10ed8
      Nikolay Borisov 提交于
      This function contains the factored out locking sequence of
      btrfs_ioctl_balance. Having this piece of code separate helps to
      simplify btrfs_ioctl_balance which has too complicated.  This will be
      used in the next patch to streamline the logic in btrfs_ioctl_balance.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7fb10ed8
    • C
      btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio · 1e87770c
      Christoph Hellwig 提交于
      Use the new btrfs_bio_for_each_sector iterator to simplify
      btrfs_check_read_dio_bio.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e87770c
    • Q
      btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks · 261d812b
      Qu Wenruo 提交于
      Add a helper that works similar to __bio_for_each_segment, but instead of
      iterating over PAGE_SIZE chunks it iterates over each sector.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [hch: split from a larger patch, and iterate over the offset instead of
            the offset bits]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add parameter comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      261d812b
    • C
      btrfs: factor out a btrfs_csum_ptr helper · a89ce08c
      Christoph Hellwig 提交于
      Add a helper to find the csum for a byte offset into the csum buffer.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a89ce08c
    • C
      btrfs: refactor end_bio_extent_readpage code flow · 97861cd1
      Christoph Hellwig 提交于
      Untangle the goto and move the code it jumps to so it goes in the order
      of the most likely states first.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97861cd1
    • C
      btrfs: factor out a helper to end a single sector buffer I/O · a5aa7ab6
      Christoph Hellwig 提交于
      Add a helper to end I/O on a single sector, which will come in handy
      with the new read repair code.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5aa7ab6
    • Q
      btrfs: remove duplicated parameters from submit_data_read_repair() · fd5a6f63
      Qu Wenruo 提交于
      The function submit_data_read_repair() is only called for buffered data
      read path, thus those members can be calculated using bvec directly:
      
      - start
        start = page_offset(bvec->bv_page) + bvec->bv_offset;
      
      - end
        end = start + bvec->bv_len - 1;
      
      - page
        page = bvec->bv_page;
      
      - pgoff
        pgoff = bvec->bv_offset;
      
      Thus we can safely replace those 4 parameters with just one bio_vec.
      
      Also remove the unused return value.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [hch: also remove the return value]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd5a6f63