1. 07 10月, 2020 29 次提交
  2. 21 8月, 2020 1 次提交
    • B
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov 提交于
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a84d5d42
  3. 20 8月, 2020 1 次提交
    • J
      btrfs: handle errors from async submission · c965d640
      Johannes Thumshirn 提交于
      Btrfs' async submit mechanism is able to handle errors in the submission
      path and the meta-data async submit function correctly passes the error
      code to the caller.
      
      In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
      not handling the errors returned by btrfs_csum_one_bio() correctly though
      and simply call BUG_ON(). This is unnecessary as the caller of these two
      functions - run_one_async_start - correctly checks for the return values
      and sets the status of the async_submit_bio. The actual bio submission
      will be handled later on by run_one_async_done only if
      async_submit_bio::status is 0, so the data won't be written if we
      encountered an error in the checksum process.
      
      Simply return the error from btrfs_csum_one_bio() to the async submitters,
      like it's done in btree_submit_bio_start().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c965d640
  4. 11 8月, 2020 2 次提交
    • P
      btrfs: fix return value mixup in btrfs_get_extent · 881a3a11
      Pavel Machek 提交于
      btrfs_get_extent() sets variable ret, but out: error path expect error
      to be in variable err so the error code is lost.
      
      Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NPavel Machek (CIP) <pavel@denx.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      881a3a11
    • Q
      btrfs: inode: fix NULL pointer dereference if inode doesn't need compression · 1e6e238c
      Qu Wenruo 提交于
      [BUG]
      There is a bug report of NULL pointer dereference caused in
      compress_file_extent():
      
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
        NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
        LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
        Call Trace:
        [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
        [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
        [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
        [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
        [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
        [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
        [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
        ---[ end trace f16954aa20d822f6 ]---
      
      [CAUSE]
      For the following execution route of compress_file_range(), it's
      possible to hit NULL pointer dereference:
      
       compress_file_extent()
       |- pages = NULL;
       |- start = async_chunk->start = 0;
       |- end = async_chunk = 4095;
       |- nr_pages = 1;
       |- inode_need_compress() == false; <<< Possible, see later explanation
       |  Now, we have nr_pages = 1, pages = NULL
       |- cont:
       |- 		ret = cow_file_range_inline();
       |- 		if (ret <= 0) {
       |-		for (i = 0; i < nr_pages; i++) {
       |-			WARN_ON(pages[i]->mapping);	<<< Crash
      
      To enter above call execution branch, we need the following race:
      
          Thread 1 (chattr)     |            Thread 2 (writeback)
      --------------------------+------------------------------
                                | btrfs_run_delalloc_range
                                | |- inode_need_compress = true
                                | |- cow_file_range_async()
      btrfs_ioctl_set_flag()    |
      |- binode_flags |=        |
         BTRFS_INODE_NOCOMPRESS |
                                | compress_file_range()
                                | |- inode_need_compress = false
                                | |- nr_page = 1 while pages = NULL
                                | |  Then hit the crash
      
      [FIX]
      This patch will fix it by checking @pages before doing accessing it.
      This patch is only designed as a hot fix and easy to backport.
      
      More elegant fix may make btrfs only check inode_need_compress() once to
      avoid such race, but that would be another story.
      Reported-by: NLuciano Chavez <chavez@us.ibm.com>
      Fixes: 4d3a800e ("btrfs: merge nr_pages input and output parameter in compress_pages")
      CC: stable@vger.kernel.org # 4.14.x: cecc8d90: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e6e238c
  5. 27 7月, 2020 7 次提交
    • F
      btrfs: reduce contention on log trees when logging checksums · 3ebac17c
      Filipe Manana 提交于
      The possibility of extents being shared (through clone and deduplication
      operations) requires special care when logging data checksums, to avoid
      having a log tree with different checksum items that cover ranges which
      overlap (which resulted in missing checksums after replaying a log tree).
      Such problems were fixed in the past by the following commits:
      
      commit 40e046ac ("Btrfs: fix missing data checksums after replaying a
                            log tree")
      
      commit e289f03e ("btrfs: fix corrupt log due to concurrent fsync of
                            inodes with shared extents")
      
      Test case generic/588 exercises the scenario solved by the first commit
      (purely sequential and deterministic) while test case generic/457 often
      triggered the case fixed by the second commit (not deterministic, requires
      specific timings under concurrency).
      
      The problems were addressed by deleting, from the log tree, any existing
      checksums before logging the new ones. And also by doing the deletion and
      logging of the cheksums while locking the checksum range in an extent io
      tree (root->log_csum_range), to deal with the case where we have concurrent
      fsyncs against files with shared extents.
      
      That however causes more contention on the leaves of a log tree where we
      store checksums (and all the nodes in the paths leading to them), even
      when we do not have shared extents, or all the shared extents were created
      by past transactions. It also adds a bit of contention on the spin lock of
      the log_csums_range extent io tree of the log root.
      
      This change adds a 'last_reflink_trans' field to the inode to keep track
      of the last transaction where a new extent was shared between inodes
      (through clone and deduplication operations). It is updated for both the
      source and destination inodes of reflink operations whenever a new extent
      (created in the current transaction) becomes shared by the inodes. This
      field is kept in memory only, not persisted in the inode item, similar
      to other existing fields (last_unlink_trans, logged_trans).
      
      When logging checksums for an extent, if the value of 'last_reflink_trans'
      is smaller then the current transaction's generation/id, we skip locking
      the extent range and deletion of checksums from the log tree, since we
      know we do not have new shared extents. This reduces contention on the
      log tree's leaves where checksums are stored.
      
      The following script, which uses fio, was used to measure the impact of
      this change:
      
        $ cat test-fsync.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 3 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=write
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=64k
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were performed for different numbers of jobs, file sizes and
      fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
      12 cores, with cpu governance set to performance mode on all cores), 16GiB
      of ram (the host has 64GiB) and using a NVMe device directly (without an
      intermediary filesystem in the host). While running the tests, the host
      was not used for anything else, to avoid disturbing the tests.
      
      The obtained results were the following (the last line of fio's output was
      pasted). Starting with 16 jobs is where a significant difference is
      observable in this particular setup and hardware (differences highlighted
      below). The very small differences for tests with less than 16 jobs are
      possibly just noise and random.
      
          **** 1 job, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
      
      after this change:
      
      WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
      
          **** 2 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
      
      after this change:
      
      WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
      
          **** 4 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
      
      after this change:
      
      WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
      
          **** 8 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
      
      after this change:
      
      WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
      
          **** 16 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
      
      after this change:
      
      WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
      (+40.1% throughput, -28.5% runtime)
      
          **** 32 jobs, file size 1G, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
      
      after this change:
      
      WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
      (+29.4% throughput, -22.7% runtime)
      
          **** 64 jobs, file size 512M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
      
      after this change:
      
      WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
      (+29.3% throughput, -22.7% runtime)
      
          **** 128 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
      
      after this change:
      
      WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
      (+121.2% throughput, -54.6% runtime)
      
          **** 256 jobs, file size 256M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
      
      after this change:
      
      WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
      (+74.9% throughput, -42.8% runtime)
      
          **** 512 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
      
      after this change:
      
      WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
      (+65.1% throughput, -39.3% runtime)
      
          **** 1024 jobs, file size 128M, fsync frequency 1 ****
      
      before this change:
      
      WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
      
      after this change:
      
      WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
      (+48.7% throughput, -33.1% runtime)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ebac17c
    • N
      btrfs: increment device corruption error in case of checksum error · 814723e0
      Nikolay Borisov 提交于
      Now that btrfs_io_bio have access to btrfs_device we can safely
      increment the device corruption counter on error. There is one notable
      exception - repair bios for raid. Since those don't go through the
      normal submit_stripe_bio callpath but through raid56_parity_recover thus
      repair bios won't have their device set.
      
      Scrub increments the corruption counter for checksum mismatch as well
      but does not call this function.
      
      Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      814723e0
    • Q
      btrfs: free anon block device right after subvolume deletion · 082b6c97
      Qu Wenruo 提交于
      [BUG]
      When a lot of subvolumes are created, there is a user report about
      transaction aborted caused by slow anonymous block device reclaim:
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      The anonymous device pool is shared and its size is 1M. It's possible to
      hit that limit if the subvolume deletion is not fast enough and the
      subvolumes to be cleaned keep the ids allocated.
      
      [WORKAROUND]
      We can't avoid the anon device pool exhaustion but we can shorten the
      time the id is attached to the subvolume root once the subvolume becomes
      invisible to the user.
      Reported-by: NGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      082b6c97
    • N
      btrfs: make btrfs_qgroup_check_reserved_leak take btrfs_inode · cfdd4592
      Nikolay Borisov 提交于
      vfs_inode is used only for the inode number everything else requires
      btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ use btrfs_ino ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cfdd4592
    • N
      btrfs: make btrfs_set_inode_last_trans take btrfs_inode · d9094414
      Nikolay Borisov 提交于
      Instead of making multiple calls to BTRFS_I simply take btrfs_inode as
      an input paramter.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d9094414
    • N
      btrfs: remove BTRFS_I calls in btrfs_writepage_fixup_worker · 65d87f79
      Nikolay Borisov 提交于
      All of its children functions use btrfs_inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65d87f79
    • N
      btrfs: make btrfs_delalloc_reserve_space take btrfs_inode · e5b7231e
      Nikolay Borisov 提交于
      All of its children take btrfs_inode so bubble up this requirement to
      btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
      internally.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5b7231e