1. 14 2月, 2023 1 次提交
  2. 06 12月, 2022 1 次提交
  3. 29 9月, 2022 1 次提交
  4. 26 9月, 2022 1 次提交
  5. 25 7月, 2022 1 次提交
  6. 14 3月, 2022 2 次提交
  7. 23 8月, 2021 1 次提交
  8. 22 7月, 2021 1 次提交
  9. 21 6月, 2021 2 次提交
    • Q
      btrfs: introduce btrfs_lookup_first_ordered_range() · c095f333
      Qu Wenruo 提交于
      Although we already have btrfs_lookup_first_ordered_extent() and
      btrfs_lookup_ordered_extent(), they all have their own limitations:
      
      - btrfs_lookup_ordered_extent() can't do extra range check
      
        It's only designed to lookup any ordered extent before certain bytenr.
      
      - btrfs_lookup_first_ordered_extent() may not return the first ordered
        extent in the range
      
        It doesn't ensure the first ordered extent is returned.
        The existing callers are only interested in exhausting all ordered
        extents in a range, the order is not important.
      
      For incoming btrfs_invalidatepage() refactoring, we need a way to
      properly iterate all ordered extents in their bytenr order of a range.
      
      So this patch will introduce a new function,
      btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
      order awareness and extra range check.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c095f333
    • Q
      btrfs: refactor how we finish ordered extent io for endio functions · e65f152e
      Qu Wenruo 提交于
      Btrfs has two endio functions to mark certain io range finished for
      ordered extents:
      
      - __endio_write_update_ordered()
        This is for direct IO
      
      - btrfs_writepage_endio_finish_ordered()
        This for buffered IO.
      
      However they go different routines to handle ordered extent io:
      
      - Whether to iterate through all ordered extents
        __endio_write_update_ordered() will but
        btrfs_writepage_endio_finish_ordered() will not.
      
        In fact, iterating through all ordered extents will benefit later
        subpage support, while for current PAGE_SIZE == sectorsize requirement
        this behavior makes no difference.
      
      - Whether to update page Private2 flag
        __endio_write_update_ordered() will not update page Private2 flag as
        for iomap direct IO, the page can not be even mapped.
        While btrfs_writepage_endio_finish_ordered() will clear Private2 to
        prevent double accounting against btrfs_invalidatepage().
      
      Those differences are pretty subtle, and the ordered extent iterations
      code in callers makes code much harder to read.
      
      So this patch will introduce a new function,
      btrfs_mark_ordered_io_finished(), to do the heavy lifting:
      
      - Iterate through all ordered extents in the range
      - Do the ordered extent accounting
      - Queue the work for finished ordered extent
      
      This function has two new feature:
      
      - Proper underflow detection and recovery
        The old underflow detection will only detect the problem, then
        continue.
        No proper info like root/inode/ordered extent info, nor noisy enough
        to be caught by fstests.
      
        Furthermore when underflow happens, the ordered extent will never
        finish.
      
        New error detection will reset the bytes_left to 0, do proper
        kernel warning, and output extra info including root, ino, ordered
        extent range, the underflow value.
      
      - Prevent double accounting based on Private2 flag
        Now if we find a range without Private2 flag, we will skip to next
        range.
        As that means someone else has already finished the accounting of
        ordered extent.
      
        This makes no difference for current code, but will be a critical part
        for incoming subpage support, as we can call
        btrfs_mark_ordered_io_finished() for multiple sectors if they are
        beyond inode size.
        Thus such double accounting prevention is a key feature for subpage.
      
      Now both endio functions only need to call that new function.
      
      And since the only caller of btrfs_dec_test_first_ordered_pending() is
      removed, also remove btrfs_dec_test_first_ordered_pending() completely.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e65f152e
  10. 19 4月, 2021 1 次提交
  11. 09 2月, 2021 4 次提交
    • N
      btrfs: zoned: use ZONE_APPEND write for zoned mode · d8e3fb10
      Naohiro Aota 提交于
      Enable zone append writing for zoned mode. When using zone append, a
      bio is issued to the start of a target zone and the device decides to
      place it inside the zone. Upon completion the device reports the actual
      written position back to the host.
      
      Three parts are necessary to enable zone append mode. First, modify the
      bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
      bi_sector to point the beginning of the zone.
      
      Second, record the returned physical address (and disk/partno) to the
      ordered extent in end_bio_extent_writepage() after the bio has been
      completed. We cannot resolve the physical address to the logical address
      because we can neither take locks nor allocate a buffer in this end_bio
      context. So, we need to record the physical address to resolve it later
      in btrfs_finish_ordered_io().
      
      And finally, rewrite the logical addresses of the extent mapping and
      checksum data according to the physical address using btrfs_rmap_block.
      If the returned address matches the originally allocated address, we can
      skip this rewriting process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8e3fb10
    • N
      btrfs: zoned: split ordered extent when bio is sent · d22002fd
      Naohiro Aota 提交于
      For a zone append write, the device decides the location the data is being
      written to. Therefore we cannot ensure that two bios are written
      consecutively on the device. In order to ensure that an ordered extent
      maps to a contiguous region on disk, we need to maintain a "one bio ==
      one ordered extent" rule.
      
      Implement splitting of an ordered extent and extent map on bio submission
      to adhere to the rule.
      
      extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
      corresponding ordered extent so that the ordered extent's region fits into
      one bio and the corresponding device limits.
      
      Several sanity checks need to be done in extract_ordered_extent() e.g.
      
      - We cannot split once end_bio'd ordered extent because we cannot divide
        ordered->bytes_left for the split ones
      - We do not expect a compressed ordered extent
      - We should not have checksum list because we omit the list splitting.
        Since the function is called before btrfs_wq_submit_bio() or
        btrfs_csum_one_bio(), this should be always ensured.
      
      We also need to split an extent map by creating a new one. If not,
      unpin_extent_cache() complains about the difference between the start of
      the extent map and the file's logical offset.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d22002fd
    • Q
      btrfs: rework the order of btrfs_ordered_extent::flags · 3c198fe0
      Qu Wenruo 提交于
      [BUG]
      There is a long existing bug in the last parameter of
      btrfs_add_ordered_extent(), in commit 771ed689 ("Btrfs: Optimize
      compressed writeback and reads") back to 2008.
      
      In that ancient commit btrfs_add_ordered_extent() expects the @type
      parameter to be one of the following:
      
      - BTRFS_ORDERED_REGULAR
      - BTRFS_ORDERED_NOCOW
      - BTRFS_ORDERED_PREALLOC
      - BTRFS_ORDERED_COMPRESSED
      
      But we pass 0 in cow_file_range(), which means BTRFS_ORDERED_IO_DONE.
      
      Ironically extra check in __btrfs_add_ordered_extent() won't set the bit
      if we see (type == IO_DONE || type == IO_COMPLETE), and avoid any
      obvious bug.
      
      But this still leads to regular COW ordered extent having no bit to
      indicate its type in various trace events, rendering REGULAR bit
      useless.
      
      [FIX]
      Change the following aspects to avoid such problem:
      
      - Reorder btrfs_ordered_extent::flags
        Now the type bits go first (REGULAR/NOCOW/PREALLCO/COMPRESSED), then
        DIRECT bit, finally extra status bits like IO_DONE/COMPLETE/IOERR.
      
      - Add extra ASSERT() for btrfs_add_ordered_extent_*()
      
      - Remove @type parameter for btrfs_add_ordered_extent_compress()
        As the only valid @type here is BTRFS_ORDERED_COMPRESSED.
      
      - Remove the unnecessary special check for IO_DONE/COMPLETE in
        __btrfs_add_ordered_extent()
        This is just to make the code work, with extra ASSERT(), there are
        limited values can be passed in.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c198fe0
    • Q
      btrfs: refactor btrfs_dec_test_* functions for ordered extents · 58f74b22
      Qu Wenruo 提交于
      The refactoring involves the following modifications:
      
      - Return bool instead of int
      
      - Parameter update for @cached of btrfs_dec_test_first_ordered_pending()
        For btrfs_dec_test_first_ordered_pending(), @cached is only used to
        return the finished ordered extent.
        Rename it to @finished_ret.
      
      - Comment updates
      
        * Change one stale comment
          Which still refers to btrfs_dec_test_ordered_pending(), but the
          context is calling  btrfs_dec_test_first_ordered_pending().
        * Follow the common comment style for both functions
          Add more detailed descriptions for parameters and the return value
        * Move the reason why test_and_set_bit() is used into the call sites
      
      - Change how the return value is calculated
        The most anti-human part of the return value is:
      
          if (...)
      	ret = 1;
          ...
          return ret == 0;
      
        This means, when we set ret to 1, the function returns 0.
        Change the local variable name to @finished, and directly return the
        value of it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58f74b22
  12. 10 12月, 2020 1 次提交
    • Q
      btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums · 9e46458a
      Qu Wenruo 提交于
      The function btrfs_lookup_bio_sums() is only called for read bios.
      While btrfs_find_ordered_sum() is to search ordered extent sums, which
      is only for write path.
      
      This means to read a page we either:
      
      - Submit read bio if it's not uptodate
        This means we only need to search csum tree for checksums.
      
      - The page is already uptodate
        It can be marked uptodate for previous read, or being marked dirty.
        As we always mark page uptodate for dirty page.
        In that case, we don't need to submit read bio at all, thus no need
        to search any checksums.
      
      Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
      And since btrfs_lookup_bio_sums() is the only caller for
      btrfs_find_ordered_sum(), also remove the implementation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e46458a
  13. 08 12月, 2020 3 次提交
  14. 07 10月, 2020 6 次提交
    • N
      btrfs: remove inode argument from btrfs_start_ordered_extent · c0a43603
      Nikolay Borisov 提交于
      The passed in ordered_extent struct is always well-formed and contains
      the inode making the explicit argument redundant.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c0a43603
    • N
    • N
    • N
    • N
    • F
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana 提交于
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48778179
  15. 27 7月, 2020 8 次提交
    • N
      btrfs: make btrfs_add_ordered_extent_dio take btrfs_inode · c1e09520
      Nikolay Borisov 提交于
      Simply forwards its argument so let's get rid of one extra BTRFS_I by
      taking btrfs_inode directly.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1e09520
    • N
      btrfs: make btrfs_dec_test_first_ordered_pending take btrfs_inode · 7095821e
      Nikolay Borisov 提交于
      It doesn't really need vfs_inode but btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7095821e
    • N
      btrfs: make btrfs_add_ordered_extent_compress take btrfs_inode · 4cc61209
      Nikolay Borisov 提交于
      It simpy forwards its inode argument to __btrfs_add_ordered_extent which
      already takes btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4cc61209
    • N
      btrfs: make btrfs_add_ordered_extent take btrfs_inode · e7fbf604
      Nikolay Borisov 提交于
      Preparation to converting its callers to taking btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7fbf604
    • N
      btrfs: make btrfs_lookup_ordered_extent take btrfs_inode · c3504372
      Nikolay Borisov 提交于
      It doesn't use the generic vfs inode for anything use btrfs_inode
      directly.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3504372
    • F
      btrfs: remove no longer used trans_list member of struct btrfs_ordered_extent · 3ef64143
      Filipe Manana 提交于
      The 'trans_list' member of an ordered extent was used to keep track of the
      ordered extents for which a transaction commit had to wait. These were
      ordered extents that were started and logged by an fsync. However we don't
      do that anymore and before we stopped doing it we changed the approach to
      wait for the ordered extents in commit 161c3549 ("Btrfs: change how
      we wait for pending ordered extents"), which stopped using that list and
      therefore the 'trans_list' member is not used anymore since that commit.
      So just remove it since it's doing nothing and making each ordered extent
      structure waste memory (2 pointers).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ef64143
    • F
      btrfs: remove no longer used log_list member of struct btrfs_ordered_extent · cd8d39f4
      Filipe Manana 提交于
      The 'log_list' member of an ordered extent was used keep track of which
      ordered extents we needed to wait after logging metadata, but is not used
      anymore since commit 5636cf7d ("btrfs: remove the logged extents
      infrastructure"), as we now always wait on ordered extent completion
      before logging metadata. So just remove it since it's doing nothing and
      making each ordered extent structure waste more memory (2 pointers).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8d39f4
    • Q
      btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak · 7dbeaad0
      Qu Wenruo 提交于
      [BUG]
      The following simple workload from fsstress can lead to qgroup reserved
      data space leak:
        0/0: creat f0 x:0 0 0
        0/0: creat add id=0,parent=-1
        0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
        0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
        0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
      
      This would cause btrfs qgroup to leak 20480 bytes for data reserved
      space.  If btrfs qgroup limit is enabled, such leak can lead to
      unexpected early EDQUOT and unusable space.
      
      [CAUSE]
      When doing direct IO, kernel will try to writeback existing buffered
      page cache, then invalidate them:
        generic_file_direct_write()
        |- filemap_write_and_wait_range();
        |- invalidate_inode_pages2_range();
      
      However for btrfs, the bi_end_io hook doesn't finish all its heavy work
      right after bio ends.  In fact, it delays its work further:
      
        submit_extent_page(end_io_func=end_bio_extent_writepage);
        end_bio_extent_writepage()
        |- btrfs_writepage_endio_finish_ordered()
           |- btrfs_init_work(finish_ordered_fn);
      
        <<< Work queue execution >>>
        finish_ordered_fn()
        |- btrfs_finish_ordered_io();
           |- Clear qgroup bits
      
      This means, when filemap_write_and_wait_range() returns,
      btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
      qgroup bits for related range are not cleared.
      
      Now into how the leak happens, this will only focus on the overlapping
      part of buffered and direct IO part.
      
      1. After buffered write
         The inode had the following range with QGROUP_RESERVED bit:
         	596		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      2. Writeback part for range [596K, 616K)
         Write back finished, but btrfs_finish_ordered_io() not get called
         yet.
         So we still have:
         	596K		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      3. Pages for range [596K, 616K) get released
         This will clear all qgroup bits, but don't update the reserved data
         space.
         So we have:
         	596K		616K
      	|		|
         Qgroup reserved data space: 20K
         That number doesn't match the qgroup bit range anymore.
      
      4. Dio prepare space for range [596K, 700K)
         Qgroup reserved data space for that range, we got:
         	596K		616K			700K
      	|///////////////|///////////////////////|
         Qgroup reserved data space: 20K + 104K = 124K
      
      5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
         Qgroup free reserved space for that range, we got:
         	596K		616K			700K
      	|		|///////////////////////|
         We need to free that range of reserved space.
         Qgroup reserved data space: 124K - 20K = 104K
      
      6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
         However qgroup bit for range [596K, 616K) is already cleared in
         previous step, so we only free 84K for qgroup reserved space.
         	596K		616K			700K
      	|		|			|
         We need to free that range of reserved space.
         Qgroup reserved data space: 104K - 84K = 20K
      
         Now there is no way to release that 20K unless disabling qgroup or
         unmounting the fs.
      
      [FIX]
      This patch will change the timing of btrfs_qgroup_release/free_data()
      call.  Here it uses buffered COW write as an example.
      
      	The new timing			|	The old timing
      ----------------------------------------+---------------------------------------
       btrfs_buffered_write()			| btrfs_buffered_write()
       |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
      					|
       btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
       |- btrfs_add_ordered_extent()  	|
          |- btrfs_qgroup_release_data()	|
             The reserved is passed into	|
             btrfs_ordered_extent structure	|
      					|
       btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
       |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
          btrfs_qgroup_record			|    The resereved space is passed
      					|    to btrfs_qgroup_recrod
      					|
       btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
       |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()
      
      The point of such change is to ensure, when ordered extents are
      submitted, the qgroup reserved space is already released, to keep the
      timing aligned with file_write_and_wait_range().
      
      So that qgroup data reserved space is all bound to btrfs_ordered_extent
      and solve the timing mismatch.
      
      Fixes: f695fdce ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7dbeaad0
  16. 24 3月, 2020 2 次提交
  17. 20 1月, 2020 1 次提交
  18. 19 11月, 2019 1 次提交
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
  19. 01 7月, 2019 2 次提交