1. 07 10月, 2020 5 次提交
    • F
      btrfs: reschedule if necessary when logging directory items · bb56f02f
      Filipe Manana 提交于
      Logging directories with many entries can take a significant amount of
      time, and in some cases monopolize a cpu/core for a long time if the
      logging task doesn't happen to block often enough.
      
      Johannes and Lu Fengqi reported test case generic/041 triggering a soft
      lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
      case we log an inode with 3002 hard links, and because the test removed
      one hard link before fsyncing the file, the inode logging causes the
      parent directory do be logged as well, which has 6004 directory items to
      log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
      so it can take a significant amount of time and trigger the soft lockup.
      
      So just make tree-log.c:log_dir_items() reschedule when necessary,
      releasing the current search path before doing so and then resume from
      where it was before the reschedule.
      
      The stack trace produced when the soft lockup happens is the following:
      
      [10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
      [10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
      [10480.284915] irq event stamp: 29646366
      [10480.285987] hardirqs last  enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
      [10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
      [10480.290856] softirqs last  enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
      [10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
      [10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
      [10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
      [10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
      [10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
      [10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
      [10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
      [10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
      [10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
      [10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
      [10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
      [10480.321104] FS:  00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
      [10480.323559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
      [10480.327259] Call Trace:
      [10480.328286]  ? overwrite_item+0x1f0/0x5a0 [btrfs]
      [10480.329784]  __kmalloc+0x831/0xa20
      [10480.331009]  ? btrfs_get_32+0xb0/0x1d0 [btrfs]
      [10480.332464]  overwrite_item+0x1f0/0x5a0 [btrfs]
      [10480.333948]  log_dir_items+0x2ee/0x570 [btrfs]
      [10480.335413]  log_directory_changes+0x82/0xd0 [btrfs]
      [10480.336926]  btrfs_log_inode+0xc9b/0xda0 [btrfs]
      [10480.338374]  ? init_once+0x20/0x20 [btrfs]
      [10480.339711]  btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
      [10480.341257]  ? dget_parent+0x97/0x2e0
      [10480.342480]  btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
      [10480.343977]  btrfs_sync_file+0x24b/0x5e0 [btrfs]
      [10480.345381]  do_fsync+0x38/0x70
      [10480.346483]  __x64_sys_fsync+0x10/0x20
      [10480.347703]  do_syscall_64+0x2d/0x70
      [10480.348891]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [10480.350444] RIP: 0033:0x7fa1dc80970b
      [10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
      [10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
      [10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
      [10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
      [10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
      [10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60
      
      Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/Reported-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      CC: stable@vger.kernel.org # 4.4+
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb56f02f
    • F
      btrfs: make fast fsyncs wait only for writeback · 48778179
      Filipe Manana 提交于
      Currently regardless of a full or a fast fsync we always wait for ordered
      extents to complete, and then start logging the inode after that. However
      for fast fsyncs we can just wait for the writeback to complete, we don't
      need to wait for the ordered extents to complete since we use the list of
      modified extents maps to figure out which extents we must log and we can
      get their checksums directly from the ordered extents that are still in
      flight, otherwise look them up from the checksums tree.
      
      Until commit b5e6c3e1 ("btrfs: always wait on ordered extents at
      fsync time"), for fast fsyncs, we used to start logging without even
      waiting for the writeback to complete first, we would wait for it to
      complete after logging, while holding a transaction open, which lead to
      performance issues when using cgroups and probably for other cases too,
      as wait for IO while holding a transaction handle should be avoided as
      much as possible. After that, for fast fsyncs, we started to wait for
      ordered extents to complete before starting to log, which adds some
      latency to fsyncs and we even got at least one report about a performance
      drop which bisected to that particular change:
      
      https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
      
      This change makes fast fsyncs only wait for writeback to finish before
      starting to log the inode, instead of waiting for both the writeback to
      finish and for the ordered extents to complete. This brings back part of
      the logic we had that extracts checksums from in flight ordered extents,
      which are not yet in the checksums tree, and making sure transaction
      commits wait for the completion of ordered extents previously logged
      (by far most of the time they have already completed by the time a
      transaction commit starts, resulting in no wait at all), to avoid any
      data loss if an ordered extent completes after the transaction used to
      log an inode is committed, followed by a power failure.
      
      When there are no other tasks accessing the checksums and the subvolume
      btrees, the ordered extent completion is pretty fast, typically taking
      100 to 200 microseconds only in my observations. However when there are
      other tasks accessing these btrees, ordered extent completion can take a
      lot more time due to lock contention on nodes and leaves of these btrees.
      I've seen cases over 2 milliseconds, which starts to be significant. In
      particular when we do have concurrent fsyncs against different files there
      is a lot of contention on the checksums btree, since we have many tasks
      writing the checksums into the btree and other tasks that already started
      the logging phase are doing lookups for checksums in the btree.
      
      This change also turns all ranged fsyncs into full ranged fsyncs, which
      is something we already did when not using the NO_HOLES features or when
      doing a full fsync. This is to guarantee we never miss checksums due to
      writeback having been triggered only for a part of an extent, and we end
      up logging the full extent but only checksums for the written range, which
      results in missing checksums after log replay. Allowing ranged fsyncs to
      operate again only in the original range, when using the NO_HOLES feature
      and doing a fast fsync is doable but requires some non trivial changes to
      the writeback path, which can always be worked on later if needed, but I
      don't think they are a very common use case.
      
      Several tests were performed using fio for different numbers of concurrent
      jobs, each writing and fsyncing its own file, for both sequential and
      random file writes. The tests were run on bare metal, no virtualization,
      on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
      with a kernel configuration that is the default of typical distributions
      (debian in this case), without debug options enabled (kasan, kmemleak,
      slub debug, debug of page allocations, lock debugging, etc).
      
      The following script that calls fio was used:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 5 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
        WRITE_MODE=$5
      
        if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
          echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
          exit 1
        fi
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=$WRITE_MODE
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The results were the following:
      
      *************************
      *** sequential writes ***
      *************************
      
      ==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
      
      After patch:
      
      WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
      (+9.8%, -8.8% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
      
      After patch:
      
      WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
      (+21.5% throughput, -17.8% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
      
      After patch:
      
      WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
      (+28.7% throughput, -22.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
      
      After patch:
      
      WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
      (+35.6% throughput, -25.2% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
      
      After patch:
      
      WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
      (+34.1% throughput, -25.6% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
      
      After patch:
      
      WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
      (+19.1% throughput, -16.4% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
      
      Before patch:
      
      WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
      
      After patch:
      
      WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
      (+23.1% throughput, -18.7% runtime)
      
      ************************
      ***   random writes  ***
      ************************
      
      ==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
      
      After patch:
      
      WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
      (+0.9% throughput, -1.7% runtime)
      
      ==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
      
      After patch:
      
      WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
      (+2.3% throughput, -2.0% runtime)
      
      ==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
      
      After patch:
      
      WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
      (+15.6% throughput, -13.3% runtime)
      
      ==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
      
      After patch:
      
      WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
      (+11.6% throughput, -10.7% runtime)
      
      ==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
      
      After patch:
      
      WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
      (+3.8% throughput, -3.8% runtime)
      
      ==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
      
      After patch:
      
      WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
      (+12.7% throughput, -11.2% runtime)
      
      ==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
      
      Before patch:
      
      WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
      
      After patch:
      
      WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
      (+6.3% throughput, -6.0% runtime)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48778179
    • F
      btrfs: do not commit logs and transactions during link and rename operations · 75b463d2
      Filipe Manana 提交于
      Since commit d4682ba0 ("Btrfs: sync log after logging new name") we
      started to commit logs, and fallback to transaction commits when we failed
      to log the new names or commit the logs, after link and rename operations
      when the target inodes (or their parents) were previously logged in the
      current transaction. This was to avoid losing directories despite an
      explicit fsync on them when they are ancestors of some inode that got a
      new named logged, due to a link or rename operation. However that adds the
      cost of starting IO and waiting for it to complete, which can cause higher
      latencies for applications.
      
      Instead of doing that, just make sure that when we log a new name for an
      inode we don't mark any of its ancestors as logged, so that if any one
      does an fsync against any of them, without doing any other change on them,
      the fsync commits the log. This way we only pay the cost of a log commit
      (or a transaction commit if something goes wrong or a new block group was
      created) if the application explicitly asks to fsync any of the parent
      directories.
      
      Using dbench, which mixes several filesystems operations including renames,
      revealed some significant latency gains. The following script that uses
      dbench was used to test this:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=16
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -t 300 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    10750455     0.011   155.088
       Close         7896674     0.001     0.243
       Rename         455222     2.158  1101.947
       Unlink        2171189     0.067   121.638
       Deltree           256     2.425     7.816
       Mkdir             128     0.002     0.003
       Qpathinfo     9744323     0.006    21.370
       Qfileinfo     1707092     0.001     0.146
       Qfsinfo       1786756     0.001    11.228
       Sfileinfo      875612     0.003    21.263
       Find          3767281     0.025     9.617
       WriteX        5356924     0.011   211.390
       ReadX        16852694     0.003     9.442
       LockX           35008     0.002     0.119
       UnlockX         35008     0.001     0.138
       Flush          753458     4.252  1102.249
      
      Throughput 1128.35 MB/sec  16 clients  16 procs  max_latency=1102.255 ms
      
      Results after this patch:
      
      16 clients, after
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11471098     0.012   448.281
       Close         8426396     0.001     0.925
       Rename         485746     0.123   267.183
       Unlink        2316477     0.080    63.433
       Deltree           288     2.830    11.144
       Mkdir             144     0.003     0.010
       Qpathinfo    10397420     0.006    10.288
       Qfileinfo     1822039     0.001     0.169
       Qfsinfo       1906497     0.002    14.039
       Sfileinfo      934433     0.004     2.438
       Find          4019879     0.026    10.200
       WriteX        5718932     0.011   200.985
       ReadX        17981671     0.003    10.036
       LockX           37352     0.002     0.076
       UnlockX         37352     0.001     0.109
       Flush          804018     5.015   778.033
      
      Throughput 1201.98 MB/sec  16 clients  16 procs  max_latency=778.036 ms
      (+6.5% throughput, -29.4% max latency, -75.8% rename latency)
      
      Test case generic/498 from fstests tests the scenario that the previously
      mentioned commit fixed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75b463d2
    • F
      btrfs: do not take the log_mutex of the subvolume when pinning the log · 5522a27e
      Filipe Manana 提交于
      During a rename we pin the log to make sure no one commits a log that
      reflects an ongoing rename operation, as it might result in a committed
      log where it recorded the unlink of the old name without having recorded
      the new name. However we are taking the subvolume's log_mutex before
      incrementing the log_writers counter, which is not necessary since that
      counter is atomic and we only remove the old name from the log and add
      the new name to the log after we have incremented log_writers, ensuring
      that no one can commit the log after we have removed the old name from
      the log and before we added the new name to the log.
      
      By taking the log_mutex lock we are just adding unnecessary contention on
      the lock, which can become visible for workloads that mix renames with
      fsyncs, writes for files opened with O_SYNC and unlink operations (if the
      inode or its parent were fsynced before in the current transaction).
      
      So just remove the lock and unlock of the subvolume's log_mutex at
      btrfs_pin_log_trans().
      
      Using dbench, which mixes different types of operations that end up taking
      that mutex (fsyncs, renames, unlinks and writes into files opened with
      O_SYNC) revealed some small gains. The following script that calls dbench
      was used:
      
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd -o space_cache=v2"
        MKFS_OPTIONS="-m single -d single"
        THREADS=32
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -s -t 600 -D $MNT $THREADS
      
        umount $MNT
      
      The test was run on bare metal, no virtualization, on a box with 12 cores
      (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
      configuration that is the default of typical distributions (debian in this
      case), without debug options enabled (kasan, kmemleak, slub debug, debug
      of page allocations, lock debugging, etc).
      
      Results before this patch:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4410848     0.017   738.640
       Close        3240222     0.001     0.834
       Rename        186850     7.478  1272.476
       Unlink        890875     0.128   785.018
       Deltree          128     2.846    12.081
       Mkdir             64     0.002     0.003
       Qpathinfo    3997659     0.009    11.171
       Qfileinfo     701307     0.001     0.478
       Qfsinfo       733494     0.002     1.103
       Sfileinfo     359362     0.004     3.266
       Find         1546226     0.041     4.128
       WriteX       2202803     7.905  1376.989
       ReadX        6917775     0.003     3.887
       LockX          14392     0.002     0.043
       UnlockX        14392     0.001     0.085
       Flush         309225     0.128  1033.936
      
      Throughput 231.555 MB/sec (sync open)  32 clients  32 procs  max_latency=1376.993 ms
      
      Results after this patch:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4603244     0.017   232.776
       Close        3381299     0.001     1.041
       Rename        194871     7.251  1073.165
       Unlink        929730     0.133   119.233
       Deltree          128     2.871    10.199
       Mkdir             64     0.002     0.004
       Qpathinfo    4171343     0.009    11.317
       Qfileinfo     731227     0.001     1.635
       Qfsinfo       765079     0.002     3.568
       Sfileinfo     374881     0.004     1.220
       Find         1612964     0.041     4.675
       WriteX       2296720     7.569  1178.204
       ReadX        7213633     0.003     3.075
       LockX          14976     0.002     0.076
       UnlockX        14976     0.001     0.061
       Flush         322635     0.102   579.505
      
      Throughput 241.4 MB/sec (sync open)  32 clients  32 procs  max_latency=1178.207 ms
      (+4.3% throughput, -14.4% max latency)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5522a27e
    • R
      btrfs: delete duplicated words + other fixes in comments · 260db43c
      Randy Dunlap 提交于
      Delete repeated words in fs/btrfs/.
      {to, the, a, and old}
      and change "into 2 part" to "into 2 parts".
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      260db43c
  2. 21 8月, 2020 1 次提交
    • J
      btrfs: check the right error variable in btrfs_del_dir_entries_in_log · fb2fecba
      Josef Bacik 提交于
      With my new locking code dbench is so much faster that I tripped over a
      transaction abort from ENOSPC.  This turned out to be because
      btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
      function sets err on error, and returns err.  So instead of properly
      marking the inode as needing a full commit, we were returning -ENOSPC
      and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
      variable so that we return the correct thing in the case of ENOSPC.
      
      The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
      can return -ENOENT if the dir item isn't in the tree log (which would
      happen if we hadn't fsync'ed this guy).  We actually handle that case in
      __btrfs_unlink_inode, so it's an expected error to get back.
      
      Fixes: 4a500fd1 ("Btrfs: Metadata ENOSPC handling for tree log")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note and comment about ENOENT ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb2fecba
  3. 11 8月, 2020 1 次提交
  4. 27 7月, 2020 6 次提交
    • F
      btrfs: reduce contention on log trees when logging checksums · 3ebac17c
      Filipe Manana 提交于
      The possibility of extents being shared (through clone and deduplication
      operations) requires special care when logging data checksums, to avoid
      having a log tree with different checksum items that cover ranges which
      overlap (which resulted in missing checksums after replaying a log tree).
      Such problems were fixed in the past by the following commits:
      
      commit 40e046ac ("Btrfs: fix missing data checksums after replaying a
                            log tree")
      
      commit e289f03e ("btrfs: fix corrupt log due to concurrent fsync of
                            inodes with shared extents")
      
      Test case generic/588 exercises the scenario solved by the first commit
      (purely sequential and deterministic) while test case generic/457 often
      triggered the case fixed by the second commit (not deterministic, requires
      specific timings under concurrency).
      
      The problems were addressed by deleting, from the log tree, any existing
      checksums before logging the new ones. And also by doing the deletion and
      logging of the cheksums while locking the checksum range in an extent io
      tree (root->log_csum_range), to deal with the case where we have concurrent
      fsyncs against files with shared extents.
      
      That however causes more contention on the leaves of a log tree where we
      store checksums (and all the nodes in the paths leading to them), even
      when we do not have shared extents, or all the shared extents were created
      by past transactions. It also adds a bit of contention on the spin lock of
      the log_csums_range extent io tree of the log root.
      
      This change adds a 'last_reflink_trans' field to the inode to keep track
      of the last transaction where a new extent was shared between inodes
      (through clone and deduplication operations). It is updated for both the
      source and destination inodes of reflink operations whenever a new extent
      (created in the current transaction) becomes shared by the inodes. This
      field is kept in memory only, not persisted in the inode item, similar
      to other existing fields (last_unlink_trans, logged_trans).
      
      When logging checksums for an extent, if the value of 'last_reflink_trans'
      is smaller then the current transaction's generation/id, we skip locking
      the extent range and deletion of checksums from the log tree, since we
      know we do not have new shared extents. This reduces contention on the
      log tree's leaves where checksums are stored.
      
      The following script, which uses fio, was used to measure the impact of
      this change:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following (the last line of fio's output was
      pasted). Starting with 16 jobs is where a significant difference is
      observable in this particular setup and hardware (differences highlighted
      below). The very small differences for tests with less than 16 jobs are
      possibly just noise and random.
      
          **** 1 job, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
      
      after this change:
      
      WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
      
          **** 2 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
      
      after this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
      
          **** 4 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
      
      after this change:
      
      WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
      
          **** 8 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
      
      after this change:
      
      WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
      
          **** 16 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
      
      after this change:
      
      WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
      (+40.1% throughput, -28.5% runtime)
      
          **** 32 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
      
      after this change:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
      (+29.4% throughput, -22.7% runtime)
      
          **** 64 jobs, file size 512M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
      
      after this change:
      
      WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
      (+29.3% throughput, -22.7% runtime)
      
          **** 128 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
      
      after this change:
      
      WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
      (+121.2% throughput, -54.6% runtime)
      
          **** 256 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
      
      after this change:
      
      WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
      (+74.9% throughput, -42.8% runtime)
      
          **** 512 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
      
      after this change:
      
      WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
      (+65.1% throughput, -39.3% runtime)
      
          **** 1024 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
      
      after this change:
      
      WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
      (+48.7% throughput, -33.1% runtime)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ebac17c
    • F
      btrfs: remove no longer needed use of log_writers for the log root tree · a93e0168
      Filipe Manana 提交于
      When syncing the log, we used to update the log root tree without holding
      neither the log_mutex of the subvolume root nor the log_mutex of log root
      tree.
      
      We used to have two critical sections delimited by the log_mutex of the
      log root tree, so in the first one we incremented the log_writers of the
      log root tree and on the second one we decremented it and waited for the
      log_writers counter to go down to zero. This was because the update of
      the log root tree happened between the two critical sections.
      
      The use of two critical sections allowed a little bit more of parallelism
      and required the use of the log_writers counter, necessary to make sure
      we didn't miss any log root tree update when we have multiple tasks trying
      to sync the log in parallel.
      
      However after commit 06989c79 ("Btrfs: fix race updating log root
      item during fsync") the log root tree update was moved into a critical
      section delimited by the subvolume's log_mutex. Later another commit
      moved the log tree update from that critical section into the second
      critical section delimited by the log_mutex of the log root tree. Both
      commits addressed different bugs.
      
      The end result is that the first critical section delimited by the
      log_mutex of the log root tree became pointless, since there's nothing
      done between it and the second critical section, we just have an unlock
      of the log_mutex followed by a lock operation. This means we can merge
      both critical sections, as the first one does almost nothing now, and we
      can stop using the log_writers counter of the log root tree, which was
      incremented in the first critical section and decremented in the second
      criticial section, used to make sure no one in the second critical section
      started writeback of the log root tree before some other task updated it.
      
      So just remove the mutex_unlock() followed by mutex_lock() of the log root
      tree, as well as the use of the log_writers counter for the log root tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a93e0168
    • F
      btrfs: stop incremening log_batch for the log root tree when syncing log · 28a95795
      Filipe Manana 提交于
      We are incrementing the log_batch atomic counter of the root log tree but
      we never use that counter, it's used only for the log trees of subvolume
      roots. We started doing it when we moved the log_batch and log_write
      counters from the global, per fs, btrfs_fs_info structure, into the
      btrfs_root structure in commit 7237f183 ("Btrfs: fix tree logs
      parallel sync").
      
      So just stop doing it for the log root tree and add a comment over the
      field declaration so inform it's used only for log trees of subvolume
      roots.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28a95795
    • F
      btrfs: only commit delayed items at fsync if we are logging a directory · 5aa7d1a7
      Filipe Manana 提交于
      When logging an inode we are committing its delayed items if either the
      inode is a directory or if it is a new inode, created in the current
      transaction.
      
      We need to do it for directories, since new directory indexes are stored
      as delayed items of the inode and when logging a directory we need to be
      able to access all indexes from the fs/subvolume tree in order to figure
      out which index ranges need to be logged.
      
      However for new inodes that are not directories, we do not need to do it
      because the only type of delayed item they can have is the inode item, and
      we are guaranteed to always log an up to date version of the inode item:
      
      *) for a full fsync we do it by committing the delayed inode and then
         copying the item from the fs/subvolume tree with
         copy_inode_items_to_log();
      
      *) for a fast fsync we always log the inode item based on the contents of
         the in-memory struct btrfs_inode. We guarantee this is always done since
         commit e4545de5 ("Btrfs: fix fsync data loss after append write").
      
      So stop running delayed items for a new inodes that are not directories,
      since that forces committing the delayed inode into the fs/subvolume tree,
      wasting time and adding contention to the tree when a full fsync is not
      required. We will only do it in case a fast fsync is needed.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5aa7d1a7
    • F
      btrfs: only commit the delayed inode when doing a full fsync · 8c8648dd
      Filipe Manana 提交于
      Commit 2c2c452b ("Btrfs: fix fsync when extend references are added
      to an inode") forced a commit of the delayed inode when logging an inode
      in order to ensure we would end up logging the inode item during a full
      fsync. By committing the delayed inode, we updated the inode item in the
      fs/subvolume tree and then later when copying items from leafs modified in
      the current transaction into the log tree (with copy_inode_items_to_log())
      we ended up copying the inode item from the fs/subvolume tree into the log
      tree. Logging an up to date version of the inode item is required to make
      sure at log replay time we get the link count fixup triggered among other
      things (replay xattr deletes, etc). The test case generic/040 from fstests
      exercises the bug which that commit fixed.
      
      However for a fast fsync we don't need to commit the delayed inode because
      we always log an up to date version of the inode item based on the struct
      btrfs_inode we have in-memory. We started doing this for fast fsyncs since
      commit e4545de5 ("Btrfs: fix fsync data loss after append write").
      
      So just stop committing the delayed inode if we are doing a fast fsync,
      we are only wasting time and adding contention on fs/subvolume tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c8648dd
    • N
      btrfs: make __btrfs_drop_extents take btrfs_inode · 906c448c
      Nikolay Borisov 提交于
      It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
      interface with the non  __ prefixed version. Will also makes converting
      its callers to btrfs_inode easier.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      906c448c
  5. 17 6月, 2020 1 次提交
    • F
      btrfs: check if a log root exists before locking the log_mutex on unlink · e7a79811
      Filipe Manana 提交于
      This brings back an optimization that commit e678934c ("btrfs:
      Remove unnecessary check from join_running_log_trans") removed, but in
      a different form. So it's almost equivalent to a revert.
      
      That commit removed an optimization where we avoid locking a root's
      log_mutex when there is no log tree created in the current transaction.
      The affected code path is triggered through unlink operations.
      
      That commit was based on the assumption that the optimization was not
      necessary because we used to have the following checks when the patch
      was authored:
      
        int btrfs_del_dir_entries_in_log(...)
        {
              (...)
              if (dir->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
         int btrfs_del_inode_ref_in_log(...)
         {
              (...)
              if (inode->logged_trans < trans->transid)
                  return 0;
      
              ret = join_running_log_trans(root);
              (...)
         }
      
      However before that patch was merged, another patch was merged first which
      replaced those checks because they were buggy.
      
      That other patch corresponds to commit 803f0f64 ("Btrfs: fix fsync
      not persisting dentry deletions due to inode evictions"). The assumption
      that if the logged_trans field of an inode had a smaller value then the
      current transaction's generation (transid) meant that the inode was not
      logged in the current transaction was only correct if the inode was not
      evicted and reloaded in the current transaction. So the corresponding bug
      fix changed those checks and replaced them with the following helper
      function:
      
        static bool inode_logged(struct btrfs_trans_handle *trans,
                                 struct btrfs_inode *inode)
        {
              if (inode->logged_trans == trans->transid)
                      return true;
      
              if (inode->last_trans == trans->transid &&
                  test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
                  !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
                      return true;
      
              return false;
        }
      
      So if we have a subvolume without a log tree in the current transaction
      (because we had no fsyncs), every time we unlink an inode we can end up
      trying to lock the log_mutex of the root through join_running_log_trans()
      twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
      and once for the parent directory (with btrfs_del_dir_entries_in_log()).
      
      This means if we have several unlink operations happening in parallel for
      inodes in the same subvolume, and the those inodes and/or their parent
      inode were changed in the current transaction, we end up having a lot of
      contention on the log_mutex.
      
      The test robots from intel reported a -30.7% performance regression for
      a REAIM test after commit e678934c ("btrfs: Remove unnecessary check
      from join_running_log_trans").
      
      So just bring back the optimization to join_running_log_trans() where we
      check first if a log root exists before trying to lock the log_mutex. This
      is done by checking for a bit that is set on the root when a log tree is
      created and removed when a log tree is freed (at transaction commit time).
      
      Commit e678934c ("btrfs: Remove unnecessary check from
      join_running_log_trans") was merged in the 5.4 merge window while commit
      803f0f64 ("Btrfs: fix fsync not persisting dentry deletions due to
      inode evictions") was merged in the 5.3 merge window. But the first
      commit was actually authored before the second commit (May 23 2019 vs
      June 19 2019).
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
      Fixes: e678934c ("btrfs: Remove unnecessary check from join_running_log_trans")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7a79811
  6. 25 5月, 2020 7 次提交
    • F
      btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents · e289f03e
      Filipe Manana 提交于
      When we have extents shared amongst different inodes in the same subvolume,
      if we fsync them in parallel we can end up with checksum items in the log
      tree that represent ranges which overlap.
      
      For example, consider we have inodes A and B, both sharing an extent that
      covers the logical range from X to X + 64KiB:
      
      1) Task A starts an fsync on inode A;
      
      2) Task B starts an fsync on inode B;
      
      3) Task A calls btrfs_csum_file_blocks(), and the first search in the
         log tree, through btrfs_lookup_csum(), returns -EFBIG because it
         finds an existing checksum item that covers the range from X - 64KiB
         to X;
      
      4) Task A checks that the checksum item has not reached the maximum
         possible size (MAX_CSUM_ITEMS) and then releases the search path
         before it does another path search for insertion (through a direct
         call to btrfs_search_slot());
      
      5) As soon as task A releases the path and before it does the search
         for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
         too, because there is an existing checksum item that has an end
         offset that matches the start offset (X) of the checksum range we want
         to log;
      
      6) Task B releases the path;
      
      7) Task A does the path search for insertion (through btrfs_search_slot())
         and then verifies that the checksum item that ends at offset X still
         exists and extends its size to insert the checksums for the range from
         X to X + 64KiB;
      
      8) Task A releases the path and returns from btrfs_csum_file_blocks(),
         having inserted the checksums into an existing checksum item that got
         its size extended. At this point we have one checksum item in the log
         tree that covers the logical range from X - 64KiB to X + 64KiB;
      
      9) Task B now does a search for insertion using btrfs_search_slot() too,
         but it finds that the previous checksum item no longer ends at the
         offset X, it now ends at an of offset X + 64KiB, so it leaves that item
         untouched.
      
         Then it releases the path and calls btrfs_insert_empty_item()
         that inserts a checksum item with a key offset corresponding to X and
         a size for inserting a single checksum (4 bytes in case of crc32c).
         Subsequent iterations end up extending this new checksum item so that
         it contains the checksums for the range from X to X + 64KiB.
      
         So after task B returns from btrfs_csum_file_blocks() we end up with
         two checksum items in the log tree that have overlapping ranges, one
         for the range from X - 64KiB to X + 64KiB, and another for the range
         from X to X + 64KiB.
      
      Having checksum items that represent ranges which overlap, regardless of
      being in the log tree or in the chekcsums tree, can lead to problems where
      checksums for a file range end up not being found. This type of problem
      has happened a few times in the past and the following commits fixed them
      and explain in detail why having checksum items with overlapping ranges is
      problematic:
      
        27b9a812 "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
        b84b8390 "Btrfs: fix file read corruption after extent cloning and fsync"
        40e046ac "Btrfs: fix missing data checksums after replaying a log tree"
      
      Since this specific instance of the problem can only happen when logging
      inodes, because it is the only case where concurrent attempts to insert
      checksums for the same range can happen, fix the issue by using an extent
      io tree as a range lock to serialize checksum insertion during inode
      logging.
      
      This issue could often be reproduced by the test case generic/457 from
      fstests. When it happens it produces the following trace:
      
       BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
       BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
       BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
            item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
            item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
            item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
            item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
            item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
            item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
       (...)
       BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Modules linked in: btrfs dm_thin_pool ...
       CPU: 1 PID: 15884 Comm: fsx Tainted: G        W         5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
       Code: c7 c7 ...
       RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
       RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
       RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
       R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
       FS:  00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btree_submit_bio_hook+0x67/0xc0 [btrfs]
        submit_one_bio+0x31/0x50 [btrfs]
        btree_write_cache_pages+0x2db/0x4b0 [btrfs]
        ? __filemap_fdatawrite_range+0xb1/0x110
        do_writepages+0x23/0x80
        __filemap_fdatawrite_range+0xd2/0x110
        btrfs_write_marked_extents+0x15e/0x180 [btrfs]
        btrfs_sync_log+0x206/0x10a0 [btrfs]
        ? kmem_cache_free+0x315/0x3b0
        ? btrfs_log_inode+0x1e8/0xf90 [btrfs]
        ? __mutex_unlock_slowpath+0x45/0x2a0
        ? lockref_put_or_lock+0x9/0x30
        ? dput+0x2d/0x580
        ? dput+0xb5/0x580
        ? btrfs_sync_file+0x464/0x4d0 [btrfs]
        btrfs_sync_file+0x464/0x4d0 [btrfs]
        do_fsync+0x38/0x60
        __x64_sys_fsync+0x10/0x20
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fb41953a6d0
       Code: 48 3d ...
       RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
       RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
       RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
       R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
       R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace d543fc76f5ad7fd8 ]---
      
      In that trace the tree checker detected the overlapping checksum items at
      the time when we triggered writeback for the log tree when syncing the
      log.
      
      Another trace that can happen is due to BUG_ON() when deleting checksum
      items while logging an inode:
      
       BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
       BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
       BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
        item 0 key (257 1 0) itemoff 16123 itemsize 160
                inode generation 7 size 262144 mode 100600
        item 1 key (257 12 256) itemoff 16103 itemsize 20
        item 2 key (257 108 0) itemoff 16050 itemsize 53
                extent data disk bytenr 13631488 nr 4096
                extent data offset 0 nr 131072 ram 131072
       (...)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:3153!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
       Code: 0f b6 ...
       RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
       RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
       RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
       R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
       R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
       FS:  00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_del_csums+0x2f4/0x540 [btrfs]
        copy_items+0x4b5/0x560 [btrfs]
        btrfs_log_inode+0x910/0xf90 [btrfs]
        btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        ? dget_parent+0x5/0x370
        btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        btrfs_sync_file+0x42b/0x4d0 [btrfs]
        __x64_sys_msync+0x199/0x200
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7fe586c65760
       Code: 00 f7 ...
       RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
       RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
       RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
       R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
       R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
       Modules linked in: dm_log_writes ...
       ---[ end trace c92a7f447a8515f5 ]---
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e289f03e
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • D
      btrfs: don't use set/get token for single assignment in overwrite_item · 60d48e2e
      David Sterba 提交于
      The set/get token is supposed to cache the last page that was accessed
      so it speeds up subsequential access to the eb. It does not make sense
      to use that for just one change, which is the case of inode size in
      overwrite_item.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60d48e2e
    • D
      btrfs: drop eb parameter from set/get token helpers · cc4c13d5
      David Sterba 提交于
      Now that all set/get helpers use the eb from the token, we don't need to
      pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
      stack space.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc4c13d5
    • F
      btrfs: remove useless check for copy_items() return value · 0bc2d3c0
      Filipe Manana 提交于
      At btrfs_log_prealloc_extents() we are checking if copy_items() returns a
      value greater than 0. That used to happen in the past to signal the caller
      that the path given to it was released and reused for other searches, but
      as of commit 0e56315c ("Btrfs: fix missing hole after hole punching
      and fsync when using NO_HOLES"), the copy_items() function does not have
      that behaviour anymore and always returns 0 or a negative value. So just
      remove that check at btrfs_log_prealloc_extents(), which the previously
      mentioned commit forgot to remove.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bc2d3c0
    • Q
      btrfs: remove the redundant parameter level in btrfs_bin_search() · e3b83361
      Qu Wenruo 提交于
      All callers pass the eb::level so we can get read it directly inside the
      btrfs_bin_search and key_search.
      
      This is inspired by the work of Marek in U-boot.
      
      CC: Marek Behun <marek.behun@nic.cz>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b83361
  7. 27 4月, 2020 1 次提交
    • F
      btrfs: fix partial loss of prealloc extent past i_size after fsync · f135cea3
      Filipe Manana 提交于
      When we have an inode with a prealloc extent that starts at an offset
      lower than the i_size and there is another prealloc extent that starts at
      an offset beyond i_size, we can end up losing part of the first prealloc
      extent (the part that starts at i_size) and have an implicit hole if we
      fsync the file and then have a power failure.
      
      Consider the following example with comments explaining how and why it
      happens.
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our test file with 2 consecutive prealloc extents, each with a
        # size of 128Kb, and covering the range from 0 to 256Kb, with a file
        # size of 0.
        $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
        $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
      
        # Fsync the file to record both extents in the log tree.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now do a redudant extent allocation for the range from 0 to 64Kb.
        # This will merely increase the file size from 0 to 64Kb. Instead we
        # could also do a truncate to set the file size to 64Kb.
        $ xfs_io -c "falloc 0 64K" /mnt/foo
      
        # Fsync the file, so we update the inode item in the log tree with the
        # new file size (64Kb). This also ends up setting the number of bytes
        # for the first prealloc extent to 64Kb. This is done by the truncation
        # at btrfs_log_prealloc_extents().
        # This means that if a power failure happens after this, a write into
        # the file range 64Kb to 128Kb will not use the prealloc extent and
        # will result in allocation of a new extent.
        $ xfs_io -c "fsync" /mnt/foo
      
        # Now set the file size to 256K with a truncate and then fsync the file.
        # Since no changes happened to the extents, the fsync only updates the
        # i_size in the inode item at the log tree. This results in an implicit
        # hole for the file range from 64Kb to 128Kb, something which fsck will
        # complain when not using the NO_HOLES feature if we replay the log
        # after a power failure.
        $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
      
      So instead of always truncating the log to the inode's current i_size at
      btrfs_log_prealloc_extents(), check first if there's a prealloc extent
      that starts at an offset lower than the i_size and with a length that
      crosses the i_size - if there is one, just make sure we truncate to a
      size that corresponds to the end offset of that prealloc extent, so
      that we don't lose the part of that extent that starts at i_size if a
      power failure happens.
      
      A test case for fstests follows soon.
      
      Fixes: 31d11b83 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f135cea3
  8. 09 4月, 2020 1 次提交
    • F
      btrfs: make full fsyncs always operate on the entire file again · 7af59743
      Filipe Manana 提交于
      This is a revert of commit 0a8068a3 ("btrfs: make ranged full
      fsyncs more efficient"), with updated comment in btrfs_sync_file.
      
      Commit 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      made full fsyncs operate on the given range only as it assumed it was safe
      when using the NO_HOLES feature, since the hole detection was simplified
      some time ago and no longer was a source for races with ordered extent
      completion of adjacent file ranges.
      
      However it's still not safe to have a full fsync only operate on the given
      range, because extent maps for new extents might not be present in memory
      due to inode eviction or extent cloning. Consider the following example:
      
      1) We are currently at transaction N;
      
      2) We write to the file range [0, 1MiB);
      
      3) Writeback finishes for the whole range and ordered extents complete,
         while we are still at transaction N;
      
      4) The inode is evicted;
      
      5) We open the file for writing, causing the inode to be loaded to
         memory again, which sets the 'full sync' bit on its flags. At this
         point the inode's list of modified extent maps is empty (figuring
         out which extents were created in the current transaction and were
         not yet logged by an fsync is expensive, that's why we set the
         'full sync' bit when loading an inode);
      
      6) We write to the file range [512KiB, 768KiB);
      
      7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
         This correctly flushes this range and logs its extent into the log
         tree. When the writeback started an extent map for range [512KiB, 768KiB)
         was added to the inode's list of modified extents, and when the fsync()
         finishes logging it removes that extent map from the list of modified
         extent maps. This fsync also clears the 'full sync' bit;
      
      8) We do a regular fsync() (full ranged). This fsync() ends up doing
         nothing because the inode's list of modified extents is empty and
         no other changes happened since the previous ranged fsync(), so
         it just returns success (0) and we end up never logging extents for
         the file ranges [0, 512KiB) and [768KiB, 1MiB).
      
      Another scenario where this can happen is if we replace steps 2 to 4 with
      cloning from another file into our test file, as that sets the 'full sync'
      bit in our inode's flags and does not populate its list of modified extent
      maps.
      
      This was causing test case generic/457 to fail sporadically when using the
      NO_HOLES feature, as it exercised this later case where the inode has the
      'full sync' bit set and has no extent maps in memory to represent the new
      extents due to extent cloning.
      
      Fix this by reverting commit 0a8068a3 ("btrfs: make ranged full fsyncs
      more efficient") since there is no easy way to work around it.
      
      Fixes: 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7af59743
  9. 24 3月, 2020 15 次提交
  10. 24 1月, 2020 1 次提交
    • F
      Btrfs: fix infinite loop during fsync after rename operations · b5e4ff9d
      Filipe Manana 提交于
      Recently fsstress (from fstests) sporadically started to trigger an
      infinite loop during fsync operations. This turned out to be because
      support for the rename exchange and whiteout operations was added to
      fsstress in fstests. These operations, unlike any others in fsstress,
      cause file names to be reused, whence triggering this issue. However
      it's not necessary to use rename exchange and rename whiteout operations
      trigger this issue, simple rename operations and file creations are
      enough to trigger the issue.
      
      The issue boils down to when we are logging inodes that conflict (that
      had the name of any inode we need to log during the fsync operation), we
      keep logging them even if they were already logged before, and after
      that we check if there's any other inode that conflicts with them and
      then add it again to the list of inodes to log. Skipping already logged
      inodes fixes the issue.
      
      Consider the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir                           # inode 257
      
        $ touch /mnt/testdir/zz                        # inode 258
        $ ln /mnt/testdir/zz /mnt/testdir/zz_link
      
        $ touch /mnt/testdir/a                         # inode 259
      
        $ sync
      
        # The following 3 renames achieve the same result as a rename exchange
        # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
      
        $ mv /mnt/testdir/a /mnt/testdir/a/tmp
        $ mv /mnt/testdir/zz_link /mnt/testdir/a
        $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
      
        # The following rename and file creation give the same result as a
        # rename whiteout operation (<rename_whiteout> zz to a2).
      
        $ mv /mnt/testdir/zz /mnt/testdir/a2
        $ touch /mnt/testdir/zz                        # inode 260
      
        $ xfs_io -c fsync /mnt/testdir/zz
          --> results in the infinite loop
      
      The following steps happen:
      
      1) When logging inode 260, we find that its reference named "zz" was
         used by inode 258 in the previous transaction (through the commit
         root), so inode 258 is added to the list of conflicting indoes that
         need to be logged;
      
      2) After logging inode 258, we find that its reference named "a" was
         used by inode 259 in the previous transaction, and therefore we add
         inode 259 to the list of conflicting inodes to be logged;
      
      3) After logging inode 259, we find that its reference named "zz_link"
         was used by inode 258 in the previous transaction - we add inode 258
         to the list of conflicting inodes to log, again - we had already
         logged it before at step 3. After logging it again, we find again
         that inode 259 conflicts with him, and we add again 259 to the list,
         etc - we end up repeating all the previous steps.
      
      So fix this by skipping logging of conflicting inodes that were already
      logged.
      
      Fixes: 6b5fc433 ("Btrfs: fix fsync after succession of renames of different files")
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5e4ff9d
  11. 20 1月, 2020 1 次提交
    • N
      btrfs: Remove redundant WARN_ON in walk_down_log_tree · 36ee0b44
      Nikolay Borisov 提交于
      level <0 and level >= BTRFS_MAX_LEVEL are already performed upon
      extent buffer read by tree checker in btrfs_check_node.
      go. As far as 'level <= 0'  we are guaranteed that level is '> 0'
      because the value of level _before_ reading 'next' is larger than 1
      (otherwise we wouldn't have executed that code at all) this in turn
      guarantees that 'level' after btrfs_read_buffer is 'level - 1' since
      we verify this invariant in:
      
          btrfs_read_buffer
           btree_read_extent_buffer_pages
            btrfs_verify_level_key
      
      This guarantees that level can never be '<= 0' so the warn on is
      never triggered.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36ee0b44