1. 08 12月, 2020 2 次提交
  2. 07 10月, 2020 6 次提交
    • N
      btrfs: remove inode argument from btrfs_start_ordered_extent · c0a43603
      Nikolay Borisov 提交于
      The passed in ordered_extent struct is always well-formed and contains
      the inode making the explicit argument redundant.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c0a43603
    • N
    • N
    • N
    • N
    • F
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana 提交于
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48778179
  3. 27 7月, 2020 8 次提交
    • N
      btrfs: make btrfs_add_ordered_extent_dio take btrfs_inode · c1e09520
      Nikolay Borisov 提交于
      Simply forwards its argument so let's get rid of one extra BTRFS_I by
      taking btrfs_inode directly.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1e09520
    • N
      btrfs: make btrfs_dec_test_first_ordered_pending take btrfs_inode · 7095821e
      Nikolay Borisov 提交于
      It doesn't really need vfs_inode but btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7095821e
    • N
      btrfs: make btrfs_add_ordered_extent_compress take btrfs_inode · 4cc61209
      Nikolay Borisov 提交于
      It simpy forwards its inode argument to __btrfs_add_ordered_extent which
      already takes btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4cc61209
    • N
      btrfs: make btrfs_add_ordered_extent take btrfs_inode · e7fbf604
      Nikolay Borisov 提交于
      Preparation to converting its callers to taking btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7fbf604
    • N
      btrfs: make btrfs_lookup_ordered_extent take btrfs_inode · c3504372
      Nikolay Borisov 提交于
      It doesn't use the generic vfs inode for anything use btrfs_inode
      directly.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3504372
    • F
      btrfs: remove no longer used trans_list member of struct btrfs_ordered_extent · 3ef64143
      Filipe Manana 提交于
      The 'trans_list' member of an ordered extent was used to keep track of the
      ordered extents for which a transaction commit had to wait. These were
      ordered extents that were started and logged by an fsync. However we don't
      do that anymore and before we stopped doing it we changed the approach to
      wait for the ordered extents in commit 161c3549 ("Btrfs: change how
      we wait for pending ordered extents"), which stopped using that list and
      therefore the 'trans_list' member is not used anymore since that commit.
      So just remove it since it's doing nothing and making each ordered extent
      structure waste memory (2 pointers).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ef64143
    • F
      btrfs: remove no longer used log_list member of struct btrfs_ordered_extent · cd8d39f4
      Filipe Manana 提交于
      The 'log_list' member of an ordered extent was used keep track of which
      ordered extents we needed to wait after logging metadata, but is not used
      anymore since commit 5636cf7d ("btrfs: remove the logged extents
      infrastructure"), as we now always wait on ordered extent completion
      before logging metadata. So just remove it since it's doing nothing and
      making each ordered extent structure waste more memory (2 pointers).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8d39f4
    • Q
      btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak · 7dbeaad0
      Qu Wenruo 提交于
      [BUG]
      The following simple workload from fsstress can lead to qgroup reserved
      data space leak:
        0/0: creat f0 x:0 0 0
        0/0: creat add id=0,parent=-1
        0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
        0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
        0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
      
      This would cause btrfs qgroup to leak 20480 bytes for data reserved
      space.  If btrfs qgroup limit is enabled, such leak can lead to
      unexpected early EDQUOT and unusable space.
      
      [CAUSE]
      When doing direct IO, kernel will try to writeback existing buffered
      page cache, then invalidate them:
        generic_file_direct_write()
        |- filemap_write_and_wait_range();
        |- invalidate_inode_pages2_range();
      
      However for btrfs, the bi_end_io hook doesn't finish all its heavy work
      right after bio ends.  In fact, it delays its work further:
      
        submit_extent_page(end_io_func=end_bio_extent_writepage);
        end_bio_extent_writepage()
        |- btrfs_writepage_endio_finish_ordered()
           |- btrfs_init_work(finish_ordered_fn);
      
        <<< Work queue execution >>>
        finish_ordered_fn()
        |- btrfs_finish_ordered_io();
           |- Clear qgroup bits
      
      This means, when filemap_write_and_wait_range() returns,
      btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
      qgroup bits for related range are not cleared.
      
      Now into how the leak happens, this will only focus on the overlapping
      part of buffered and direct IO part.
      
      1. After buffered write
         The inode had the following range with QGROUP_RESERVED bit:
         	596		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      2. Writeback part for range [596K, 616K)
         Write back finished, but btrfs_finish_ordered_io() not get called
         yet.
         So we still have:
         	596K		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      3. Pages for range [596K, 616K) get released
         This will clear all qgroup bits, but don't update the reserved data
         space.
         So we have:
         	596K		616K
      	|		|
         Qgroup reserved data space: 20K
         That number doesn't match the qgroup bit range anymore.
      
      4. Dio prepare space for range [596K, 700K)
         Qgroup reserved data space for that range, we got:
         	596K		616K			700K
      	|///////////////|///////////////////////|
         Qgroup reserved data space: 20K + 104K = 124K
      
      5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
         Qgroup free reserved space for that range, we got:
         	596K		616K			700K
      	|		|///////////////////////|
         We need to free that range of reserved space.
         Qgroup reserved data space: 124K - 20K = 104K
      
      6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
         However qgroup bit for range [596K, 616K) is already cleared in
         previous step, so we only free 84K for qgroup reserved space.
         	596K		616K			700K
      	|		|			|
         We need to free that range of reserved space.
         Qgroup reserved data space: 104K - 84K = 20K
      
         Now there is no way to release that 20K unless disabling qgroup or
         unmounting the fs.
      
      [FIX]
      This patch will change the timing of btrfs_qgroup_release/free_data()
      call.  Here it uses buffered COW write as an example.
      
      	The new timing			|	The old timing
      ----------------------------------------+---------------------------------------
       btrfs_buffered_write()			| btrfs_buffered_write()
       |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
      					|
       btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
       |- btrfs_add_ordered_extent()  	|
          |- btrfs_qgroup_release_data()	|
             The reserved is passed into	|
             btrfs_ordered_extent structure	|
      					|
       btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
       |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
          btrfs_qgroup_record			|    The resereved space is passed
      					|    to btrfs_qgroup_recrod
      					|
       btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
       |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()
      
      The point of such change is to ensure, when ordered extents are
      submitted, the qgroup reserved space is already released, to keep the
      timing aligned with file_write_and_wait_range().
      
      So that qgroup data reserved space is all bound to btrfs_ordered_extent
      and solve the timing mismatch.
      
      Fixes: f695fdce ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7dbeaad0
  4. 24 3月, 2020 2 次提交
  5. 20 1月, 2020 1 次提交
  6. 19 11月, 2019 1 次提交
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
  7. 01 7月, 2019 2 次提交
  8. 30 4月, 2019 1 次提交
  9. 17 12月, 2018 2 次提交
  10. 06 8月, 2018 2 次提交
  11. 12 4月, 2018 1 次提交
  12. 26 3月, 2018 2 次提交
    • D
      btrfs: add more __cold annotations · e67c718b
      David Sterba 提交于
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e67c718b
    • N
      btrfs: Don't hardcode the csum size in btrfs_ordered_sum_size · af89e0dc
      Nikolay Borisov 提交于
      Currently the function uses a hardcoded value for the checksum size of
      a sector. This is fine, given that we currently support only a single
      algorithm, whose checksum is 4 bytes == sizeof(u32). Despite not
      having other algorithms, btrfs' design supports using a different
      algorithm whith different space requirements. To future-proof the code
      query the size of the currently used algorithm from the in-memory copy
      of the super block. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af89e0dc
  13. 30 6月, 2017 1 次提交
  14. 18 4月, 2017 1 次提交
  15. 28 2月, 2017 1 次提交
  16. 17 2月, 2017 1 次提交
  17. 14 2月, 2017 1 次提交
  18. 06 12月, 2016 1 次提交
  19. 30 5月, 2016 1 次提交
    • F
      Btrfs: fix race setting block group readonly during device replace · f0e9b7d6
      Filipe Manana 提交于
      When we do a device replace, for each device extent we find from the
      source device, we set the corresponding block group to readonly mode to
      prevent writes into it from happening while we are copying the device
      extent from the source to the target device. However just before we set
      the block group to readonly mode some concurrent task might have already
      allocated an extent from it or decided it could perform a nocow write
      into one of its extents, which can make the device replace process to
      miss copying an extent since it uses the extent tree's commit root to
      search for extents and only once it finishes searching for all extents
      belonging to the block group it does set the left cursor to the logical
      end address of the block group - this is a problem if the respective
      ordered extents finish while we are searching for extents using the
      extent tree's commit root and no transaction commit happens while we
      are iterating the tree, since it's the delayed references created by the
      ordered extents (when they complete) that insert the extent items into
      the extent tree (using the non-commit root of course).
      Example:
      
                CPU 1                                            CPU 2
      
       btrfs_dev_replace_start()
         btrfs_scrub_dev()
           scrub_enumerate_chunks()
             --> finds device extent belonging
                 to block group X
      
                                     <transaction N starts>
      
                                                            starts buffered write
                                                            against some inode
      
                                                            writepages is run against
                                                            that inode forcing dellaloc
                                                            to run
      
                                                            btrfs_writepages()
                                                              extent_writepages()
                                                                extent_write_cache_pages()
                                                                  __extent_writepage()
                                                                    writepage_delalloc()
                                                                      run_delalloc_range()
                                                                        cow_file_range()
                                                                          btrfs_reserve_extent()
                                                                            --> allocates an extent
                                                                                from block group X
                                                                                (which is not yet
                                                                                 in RO mode)
                                                                          btrfs_add_ordered_extent()
                                                                            --> creates ordered extent Y
                                                              flush_epd_write_bio()
                                                                --> bio against the extent from
                                                                    block group X is submitted
      
             btrfs_inc_block_group_ro(bg X)
               --> sets block group X to readonly
      
             scrub_chunk(bg X)
               scrub_stripe(device extent from srcdev)
                 --> keeps searching for extent items
                     belonging to the block group using
                     the extent tree's commit root
                 --> it never blocks due to
                     fs_info->scrub_pause_req as no
                     one tries to commit transaction N
                 --> copies all extents found from the
                     source device into the target device
                 --> finishes search loop
      
                                                              bio completes
      
                                                              ordered extent Y completes
                                                              and creates delayed data
                                                              reference which will add an
                                                              extent item to the extent
                                                              tree when run (typically
                                                              at transaction commit time)
      
                                                                --> so the task doing the
                                                                    scrub/device replace
                                                                    at CPU 1 misses this
                                                                    and does not copy this
                                                                    extent into the new/target
                                                                    device
      
             btrfs_dec_block_group_ro(bg X)
               --> turns block group X back to RW mode
      
             dev_replace->cursor_left is set to the
             logical end offset of block group X
      
      So fix this by waiting for all cow and nocow writes after setting a block
      group to readonly mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      f0e9b7d6
  20. 26 5月, 2016 1 次提交
  21. 13 5月, 2016 1 次提交
  22. 22 10月, 2015 1 次提交
    • J
      Btrfs: change how we wait for pending ordered extents · 161c3549
      Josef Bacik 提交于
      We have a mechanism to make sure we don't lose updates for ordered extents that
      were logged in the transaction that is currently running.  We add the ordered
      extent to a transaction list and then the transaction waits on all the ordered
      extents in that list.  However are substantially large file systems this list
      can be extremely large, and can give us soft lockups, since the ordered extents
      don't remove themselves from the list when they do complete.
      
      To fix this we simply add a counter to the transaction that is incremented any
      time we have a logged extent that needs to be completed in the current
      transaction.  Then when the ordered extent finally completes it decrements the
      per transaction counter and wakes up the transaction if we are the last ones.
      This will eliminate the softlockup.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      161c3549