1. 21 10月, 2021 2 次提交
    • D
      btrfs: reset replace target device to allocation state on close · 442fd787
      Desmond Cheong Zhi Xi 提交于
      stable inclusion
      from stable-5.10.67
      commit c1b249e02a80347fe519b761a8c66008e5e7dcbf
      bugzilla: 182619 https://gitee.com/openeuler/kernel/issues/I4EWO7
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c1b249e02a80347fe519b761a8c66008e5e7dcbf
      
      --------------------------------
      
      commit 0d977e0e upstream.
      
      This crash was observed with a failed assertion on device close:
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
        CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
        R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
        FS:  0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
        Call Trace:
         flush_space+0x197/0x2f0 [btrfs]
         btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
         process_one_work+0x262/0x5e0
         worker_thread+0x4c/0x320
         ? process_one_work+0x5e0/0x5e0
         kthread+0x144/0x170
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x1f/0x30
        irq event stamp: 19334989
        hardirqs last  enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
        hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
        softirqs last  enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
        softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
        ---[ end trace 45939e308e0dd3c7 ]---
        BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
        BTRFS info (device vdd): forced readonly
        BTRFS warning (device vdd): failed setting block group ro: -30
        BTRFS info (device vdd): suspending dev_replace for unmount
        assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 1 PID: 3982 Comm: umount Tainted: G        W         5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
        RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
        RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
        R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
        FS:  00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
        Call Trace:
         btrfs_close_one_device.cold+0x11/0x55 [btrfs]
         close_fs_devices+0x44/0xb0 [btrfs]
         btrfs_close_devices+0x48/0x160 [btrfs]
         generic_shutdown_super+0x69/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x2c/0xa0
         cleanup_mnt+0x144/0x1b0
         task_work_run+0x59/0xa0
         exit_to_user_mode_loop+0xe7/0xf0
         exit_to_user_mode_prepare+0xaf/0xf0
         syscall_exit_to_user_mode+0x19/0x50
         do_syscall_64+0x4a/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This happens when close_ctree is called while a dev_replace hasn't
      completed. In close_ctree, we suspend the dev_replace, but keep the
      replace target around so that we can resume the dev_replace procedure
      when we mount the root again. This is the call trace:
      
        close_ctree():
          btrfs_dev_replace_suspend_for_unmount();
          btrfs_close_devices():
            btrfs_close_fs_devices():
              btrfs_close_one_device():
                ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
                       &device->dev_state));
      
      However, since the replace target sticks around, there is a device
      with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
      assertion in btrfs_close_one_device.
      
      To fix this, if we come across the replace target device when
      closing, we should properly reset it back to allocation state. This
      fix also ensures that if a non-target device has a corrupted state and
      has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
      catch the error.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Fixes: b2a61667 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      442fd787
    • J
      btrfs: wake up async_delalloc_pages waiters after submit · 00f0b637
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.67
      commit 0901af53da8f4cfee3bb364a35f66d0d4a9b93ba
      bugzilla: 182619 https://gitee.com/openeuler/kernel/issues/I4EWO7
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0901af53da8f4cfee3bb364a35f66d0d4a9b93ba
      
      --------------------------------
      
      commit ac98141d upstream.
      
      We use the async_delalloc_pages mechanism to make sure that we've
      completed our async work before trying to continue our delalloc
      flushing.  The reason for this is we need to see any ordered extents
      that were created by our delalloc flushing.  However we're waking up
      before we do the submit work, which is before we create the ordered
      extents.  This is a pretty wide race window where we could potentially
      think there are no ordered extents and thus exit shrink_delalloc
      prematurely.  Fix this by waking us up after we've done the work to
      create ordered extents.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      00f0b637
  2. 19 10月, 2021 4 次提交
    • F
      btrfs: fix race between marking inode needs to be logged and log syncing · 6b99745e
      Filipe Manana 提交于
      mainline inclusion
      from mainline-5.10.62
      commit d845f89d59fc3f17ea4e86321b82d8edf6c1719f
      bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d845f89d59fc3f17ea4e86321b82d8edf6c1719f
      
      --------------------------------
      
      commit bc0939fc upstream.
      
      We have a race between marking that an inode needs to be logged, either
      at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
      btrfs_sync_log(). The following steps describe how the race happens.
      
      1) We are at transaction N;
      
      2) Inode I was previously fsynced in the current transaction so it has:
      
          inode->logged_trans set to N;
      
      3) The inode's root currently has:
      
         root->log_transid set to 1
         root->last_log_commit set to 0
      
         Which means only one log transaction was committed to far, log
         transaction 0. When a log tree is created we set ->log_transid and
         ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
      
      4) One more range of pages is dirtied in inode I;
      
      5) Some task A starts an fsync against some other inode J (same root), and
         so it joins log transaction 1.
      
         Before task A calls btrfs_sync_log()...
      
      6) Task B starts an fsync against inode I, which currently has the full
         sync flag set, so it starts delalloc and waits for the ordered extent
         to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
      
      7) During ordered extent completion we have btrfs_update_inode() called
         against inode I, which in turn calls btrfs_set_inode_last_trans(),
         which does the following:
      
           spin_lock(&inode->lock);
           inode->last_trans = trans->transaction->transid;
           inode->last_sub_trans = inode->root->log_transid;
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         So ->last_trans is set to N and ->last_sub_trans set to 1.
         But before setting ->last_log_commit...
      
      8) Task A is at btrfs_sync_log():
      
         - it increments root->log_transid to 2
         - starts writeback for all log tree extent buffers
         - waits for the writeback to complete
         - writes the super blocks
         - updates root->last_log_commit to 1
      
         It's a lot of slow steps between updating root->log_transid and
         root->last_log_commit;
      
      9) The task doing the ordered extent completion, currently at
         btrfs_set_inode_last_trans(), then finally runs:
      
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         Which results in inode->last_log_commit being set to 1.
         The ordered extent completes;
      
      10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
          true because we have all the following conditions met:
      
          inode->logged_trans == N which matches fs_info->generation &&
          inode->last_subtrans (1) <= inode->last_log_commit (1) &&
          inode->last_subtrans (1) <= root->last_log_commit (1) &&
          list inode->extent_tree.modified_extents is empty
      
          And as a consequence we return without logging the inode, so the
          existing logged version of the inode does not point to the extent
          that was written after the previous fsync.
      
      It should be impossible in practice for one task be able to do so much
      progress in btrfs_sync_log() while another task is at
      btrfs_set_inode_last_trans() right after it reads root->log_transid and
      before it reads root->last_log_commit. Even if kernel preemption is enabled
      we know the task at btrfs_set_inode_last_trans() can not be preempted
      because it is holding the inode's spinlock.
      
      However there is another place where we do the same without holding the
      spinlock, which is in the memory mapped write path at:
      
        vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
        {
           (...)
           BTRFS_I(inode)->last_trans = fs_info->generation;
           BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
           BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
           (...)
      
      So with preemption happening after setting ->last_sub_trans and before
      setting ->last_log_commit, it is less of a stretch to have another task
      do enough progress at btrfs_sync_log() such that the task doing the memory
      mapped write ends up with ->last_sub_trans and ->last_log_commit set to
      the same value. It is still a big stretch to get there, as the task doing
      btrfs_sync_log() has to start writeback, wait for its completion and write
      the super blocks.
      
      So fix this in two different ways:
      
      1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
         value of ->last_sub_trans minus 1;
      
      2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
         like we do for buffered and direct writes at btrfs_file_write_iter(),
         which is all we need to make sure multiple writes and fsyncs to an
         inode in the same transaction never result in an fsync missing that
         the inode changed and needs to be logged. Turn this into a helper
         function and use it both at btrfs_page_mkwrite() and at
         btrfs_file_write_iter() - this also fixes the problem that at
         btrfs_page_mkwrite() we were setting those fields without the
         protection of the inode's spinlock.
      
      This is an extremely unlikely race to happen in practice.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6b99745e
    • Q
      Revert "btrfs: compression: don't try to compress if we don't have enough pages" · 6a0544e0
      Qu Wenruo 提交于
      mainline inclusion
      from mainline-5.10.62
      commit 3134292a8e79e089f0f19f28b8b20eb1f961575c
      bugzilla: 182217 https://gitee.com/openeuler/kernel/issues/I4EFOS
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3134292a8e79e089f0f19f28b8b20eb1f961575c
      
      --------------------------------
      
      commit 4e965576 upstream.
      
      This reverts commit f2165627.
      
      [BUG]
      It's no longer possible to create compressed inline extent after commit
      f2165627 ("btrfs: compression: don't try to compress if we don't
      have enough pages").
      
      [CAUSE]
      For compression code, there are several possible reasons we have a range
      that needs to be compressed while it's no more than one page.
      
      - Compressed inline write
        The data is always smaller than one sector and the test lacks the
        condition to properly recognize a non-inline extent.
      
      - Compressed subpage write
        For the incoming subpage compressed write support, we require page
        alignment of the delalloc range.
        And for 64K page size, we can compress just one page into smaller
        sectors.
      
      For those reasons, the requirement for the data to be more than one page
      is not correct, and is already causing regression for compressed inline
      data writeback.  The idea of skipping one page to avoid wasting CPU time
      could be revisited in the future.
      
      [FIX]
      Fix it by reverting the offending commit.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
      Fixes: f2165627 ("btrfs: compression: don't try to compress if we don't have enough pages")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6a0544e0
    • Q
      btrfs: fix NULL pointer dereference when deleting device by invalid id · 6bb7ddd9
      Qu Wenruo 提交于
      mainline inclusion
      from mainline
      commit e4571b8c
      bugzilla: 181007 https://gitee.com/openeuler/kernel/issues/I4DDEL
      CVE: CVE-2021-3739
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e4571b8c5e9ffa1e85c0c671995bd4dcc5c75091
      
      --------------------------------
      
      [BUG]
      It's easy to trigger NULL pointer dereference, just by removing a
      non-existing device id:
      
       # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
      				     /dev/test/scratch2
       # mount /dev/test/scratch1 /mnt/btrfs
       # btrfs device remove 3 /mnt/btrfs
      
      Then we have the following kernel NULL pointer dereference:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
        btrfs_ioctl+0x18bb/0x3190 [btrfs]
        ? lock_is_held_type+0xa5/0x120
        ? find_held_lock.constprop.0+0x2b/0x80
        ? do_user_addr_fault+0x201/0x6a0
        ? lock_release+0xd2/0x2d0
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      [CAUSE]
      Commit a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return
      btrfs_device directly") moves the "missing" device path check into
      btrfs_rm_device().
      
      But btrfs_rm_device() itself can have case where it only receives
      @devid, with NULL as @device_path.
      
      In that case, calling strcmp() on NULL will trigger the NULL pointer
      dereference.
      
      Before that commit, we handle the "missing" case inside
      btrfs_find_device_by_devspec(), which will not check @device_path at all
      if @devid is provided, thus no way to trigger the bug.
      
      [FIX]
      Before calling strcmp(), also make sure @device_path is not NULL.
      
      Fixes: a27a94c2 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
      CC: stable@vger.kernel.org # 5.4+
      Reported-by: Nbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6bb7ddd9
    • N
      btrfs: prevent rename2 from exchanging a subvol with a directory from different parents · 1dbbf013
      NeilBrown 提交于
      stable inclusion
      from stable-5.10.61
      commit 67fece6289a9dab4e2700a93004db45ac834523b
      bugzilla: 177029 https://gitee.com/openeuler/kernel/issues/I4EAXD
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=67fece6289a9dab4e2700a93004db45ac834523b
      
      --------------------------------
      
      [ Upstream commit 3f79f6f6 ]
      
      Cross-rename lacks a check when that would prevent exchanging a
      directory and subvolume from different parent subvolume. This causes
      data inconsistencies and is caught before commit by tree-checker,
      turning the filesystem to read-only.
      
      Calling the renameat2 with RENAME_EXCHANGE flags like
      
        renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))
      
      on two paths:
      
        namesrc = dir1/subvol1/dir2
       namedest = subvol2/subvol3
      
      will cause key order problem with following write time tree-checker
      report:
      
        [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
        [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
        [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
        [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
        [1194842.338793]                inode generation 3 size 16 mode 40755
        [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
        [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
        [1194842.338817]                dir oid 258 type 2
        [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
        [1194842.338830]                dir oid 257 type 2
        [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
        [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
        [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
        [1194842.338863]                inode generation 6 size 8 mode 40755
        [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
        [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
        [1194842.338883]                dir oid 256 type 2
        [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
        [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
        [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
        [1194842.339245] ------------[ cut here ]------------
        [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
        [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
        [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
        [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
        [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
        [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
        [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
        [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
        [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
        [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
        [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
        [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
        [1194842.512103] Call Trace:
        [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
        [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
        [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
        [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
        [1194842.647482]  process_one_work+0x262/0x5e0
        [1194842.647520]  worker_thread+0x4c/0x320
        [1194842.655935]  ? process_one_work+0x5e0/0x5e0
        [1194842.655946]  kthread+0x135/0x160
        [1194842.655953]  ? set_kthread_struct+0x40/0x40
        [1194842.655965]  ret_from_fork+0x1f/0x30
        [1194842.672465] irq event stamp: 1729
        [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
        [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
        [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
        [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0
      
      The corrupted data will not be written, and filesystem can be unmounted
      and mounted again (all changes since the last commit will be lost).
      
      Add the missing check for new_ino so that all non-subvolumes must reside
      under the same parent subvolume. There's an exception allowing to
      exchange two subvolumes from any parents as the directory representing a
      subvolume is only a logical link and does not have any other structures
      related to the parent subvolume, unlike files, directories etc, that
      are always in the inode namespace of the parent subvolume.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.7+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1dbbf013
  3. 15 10月, 2021 5 次提交
    • F
      btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction · 198d11fe
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.57
      commit 9e55b9278c47ded8508fbb436a8a7e9148e4faed
      bugzilla: 176179 https://gitee.com/openeuler/kernel/issues/I4DZ64
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9e55b9278c47ded8508fbb436a8a7e9148e4faed
      
      --------------------------------
      
      [ Upstream commit ecc64fab ]
      
      When checking if we need to log the new name of a renamed inode, we are
      checking if the inode and its parent inode have been logged before, and if
      not we don't log the new name. The check however is buggy, as it directly
      compares the logged_trans field of the inodes versus the ID of the current
      transaction. The problem is that logged_trans is a transient field, only
      stored in memory and never persisted in the inode item, so if an inode
      was logged before, evicted and reloaded, its logged_trans field is set to
      a value of 0, meaning the check will return false and the new name of the
      renamed inode is not logged. If the old parent directory was previously
      fsynced and we deleted the logged directory entries corresponding to the
      old name, we end up with a log that when replayed will delete the renamed
      inode.
      
      The following example triggers the problem:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ echo -n "hello world" > /mnt/A/foo
      
        $ sync
      
        # Add some new file to A and fsync directory A.
        $ touch /mnt/A/bar
        $ xfs_io -c "fsync" /mnt/A
      
        # Now trigger inode eviction. We are only interested in triggering
        # eviction for the inode of directory A.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Move foo from directory A to directory B.
        # This deletes the directory entries for foo in A from the log, and
        # does not add the new name for foo in directory B to the log, because
        # logged_trans of A is 0, which is less than the current transaction ID.
        $ mv /mnt/A/foo /mnt/B/foo
      
        # Now make an fsync to anything except A, B or any file inside them,
        # like for example create a file at the root directory and fsync this
        # new file. This syncs the log that contains all the changes done by
        # previous rename operation.
        $ touch /mnt/baz
        $ xfs_io -c "fsync" /mnt/baz
      
        <power fail>
      
        # Mount the filesystem and replay the log.
        $ mount /dev/sdc /mnt
      
        # Check the filesystem content.
        $ ls -1R /mnt
        /mnt/:
        A
        B
        baz
      
        /mnt/A:
        bar
      
        /mnt/B:
        $
      
        # File foo is gone, it's neither in A/ nor in B/.
      
      Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
      safely checks if an inode was logged before in the current transaction.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      198d11fe
    • F
      btrfs: fix race causing unnecessary inode logging during link and rename · cd3b81c8
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.57
      commit e2419c570986fe374b01a6db4ddd7a3b2483ab49
      bugzilla: 176179 https://gitee.com/openeuler/kernel/issues/I4DZ64
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e2419c570986fe374b01a6db4ddd7a3b2483ab49
      
      --------------------------------
      
      [ Upstream commit de53d892 ]
      
      When we are doing a rename or a link operation for an inode that was logged
      in the previous transaction and that transaction is still committing, we
      have a time window where we incorrectly consider that the inode was logged
      previously in the current transaction and therefore decide to log it to
      update it in the log. The following steps give an example on how this
      happens during a link operation:
      
      1) Inode X is logged in transaction 1000, so its logged_trans field is set
         to 1000;
      
      2) Task A starts to commit transaction 1000;
      
      3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;
      
      4) Task B starts a link operation for inode X, and as a consequence it
         starts transaction 1001;
      
      5) Task A is still committing transaction 1000, therefore the value stored
         at fs_info->last_trans_committed is still 999;
      
      6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
         fs_info->last_trans_committed and because the logged_trans field of
         inode X has a value of 1000, the function does not return immediately,
         instead it proceeds to logging the inode, which should not happen
         because the inode was logged in the previous transaction (1000) and
         not in the current one (1001).
      
      This is not a functional problem, just wasted time and space logging an
      inode that does not need to be logged, contributing to higher latency
      for link and rename operations.
      
      So fix this by comparing the inodes' logged_trans field with the
      generation of the current transaction instead of comparing with the value
      stored in fs_info->last_trans_committed.
      
      This case is often hit when running dbench for a long enough duration, as
      it does lots of rename operations.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      cd3b81c8
    • G
      btrfs: mark compressed range uptodate only if all bio succeed · cf1eea5b
      Goldwyn Rodrigues 提交于
      stable inclusion
      from stable-5.10.56
      commit 0a421a2fc516f39caf3d253c04b76a12fe632011
      bugzilla: 176004 https://gitee.com/openeuler/kernel/issues/I4DYZ4
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0a421a2fc516f39caf3d253c04b76a12fe632011
      
      --------------------------------
      
      commit 240246f6 upstream.
      
      In compression write endio sequence, the range which the compressed_bio
      writes is marked as uptodate if the last bio of the compressed (sub)bios
      is completed successfully. There could be previous bio which may
      have failed which is recorded in cb->errors.
      
      Set the writeback range as uptodate only if cb->errors is zero, as opposed
      to checking only the last bio's status.
      
      Backporting notes: in all versions up to 4.4 the last argument is always
      replaced by "!cb->errors".
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      cf1eea5b
    • D
      btrfs: fix rw device counting in __btrfs_free_extra_devids · 30fe79cf
      Desmond Cheong Zhi Xi 提交于
      stable inclusion
      from stable-5.10.56
      commit 4e1a57d75264dd4f10f3497c35dda521947368df
      bugzilla: 176004 https://gitee.com/openeuler/kernel/issues/I4DYZ4
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4e1a57d75264dd4f10f3497c35dda521947368df
      
      --------------------------------
      
      commit b2a61667 upstream.
      
      When removing a writeable device in __btrfs_free_extra_devids, the rw
      device count should be decremented.
      
      This error was caught by Syzbot which reported a warning in
      close_fs_devices:
      
        WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        Modules linked in:
        CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
        RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
        RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
        R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
        R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
        FS:  00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
         open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
         btrfs_fill_super fs/btrfs/super.c:1382 [inline]
         btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         fc_mount fs/namespace.c:993 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
         btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         do_new_mount fs/namespace.c:2905 [inline]
         path_mount+0x196f/0x2be0 fs/namespace.c:3235
         do_mount fs/namespace.c:3248 [inline]
         __do_sys_mount fs/namespace.c:3456 [inline]
         __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Because fs_devices->rw_devices was not 0 after
      closing all devices. Here is the call trace that was observed:
      
        btrfs_mount_root():
          btrfs_scan_one_device():
            device_list_add();   <---------------- device added
          btrfs_open_devices():
            open_fs_devices():
              btrfs_open_one_device();   <-------- writable device opened,
      	                                     rw device count ++
          btrfs_fill_super():
            open_ctree():
              btrfs_free_extra_devids():
      	  __btrfs_free_extra_devids();  <--- writable device removed,
      	                              rw device count not decremented
      	  fail_tree_roots:
      	    btrfs_close_devices():
      	      close_fs_devices();   <------- rw device count off by 1
      
      As a note, prior to commit cf89af14 ("btrfs: dev-replace: fail
      mount if we don't have replace item with target device"), rw_devices
      was decremented on removing a writable device in
      __btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
      was not set for the device. However, this check does not need to be
      reinstated as it is now redundant and incorrect.
      
      In __btrfs_free_extra_devids, we skip removing the device if it is the
      target for replacement. This is done by checking whether device->devid
      == BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
      only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
      should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
      and so it's redundant to test for that bit.
      
      Additionally, following commit 82372bc8 ("Btrfs: make
      the logic of source device removing more clear"), rw_devices is
      incremented whenever a writeable device is added to the alloc
      list (including the target device in btrfs_dev_replace_finishing), so
      all removals of writable devices from the alloc list should also be
      accompanied by a decrement to rw_devices.
      
      Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Fixes: cf89af14 ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
      CC: stable@vger.kernel.org # 5.10+
      Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      30fe79cf
    • A
      btrfs: check for missing device in btrfs_trim_fs · 4c361563
      Anand Jain 提交于
      stable inclusion
      from stable-5.10.54
      commit 755971dc7ee84fb5d0b6373aa9537c4f62b9e0b4
      bugzilla: 175586 https://gitee.com/openeuler/kernel/issues/I4DVDU
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=755971dc7ee84fb5d0b6373aa9537c4f62b9e0b4
      
      --------------------------------
      
      commit 16a200f6 upstream.
      
      A fstrim on a degraded raid1 can trigger the following null pointer
      dereference:
      
        BTRFS info (device loop0): allowing degraded mounts
        BTRFS info (device loop0): disk space caching is enabled
        BTRFS info (device loop0): has skinny extents
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS info (device loop0): enabling ssd optimizations
        BUG: kernel NULL pointer dereference, address: 0000000000000620
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31
        Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
        RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs]
        RSP: 0018:ffff959541797d28 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608
        RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0
        RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000
        R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8
        FS:  00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0
        Call Trace:
        btrfs_ioctl_fitrim+0x167/0x260 [btrfs]
        btrfs_ioctl+0x1c00/0x2fe0 [btrfs]
        ? selinux_file_ioctl+0x140/0x240
        ? syscall_trace_enter.constprop.0+0x188/0x240
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
      
      Reproducer:
      
        $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
        $ mount /dev/loop0 /btrfs
        $ umount /btrfs
        $ btrfs dev scan --forget
        $ mount -o degraded /dev/loop0 /btrfs
      
        $ fstrim /btrfs
      
      The reason is we call btrfs_trim_free_extents() for the missing device,
      which uses device->bdev (NULL for missing device) to find if the device
      supports discard.
      
      Fix is to check if the device is missing before calling
      btrfs_trim_free_extents().
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      4c361563
  4. 13 10月, 2021 9 次提交
  5. 03 7月, 2021 2 次提交
  6. 15 6月, 2021 11 次提交
    • A
      btrfs: fix unmountable seed device after fstrim · 3ea08020
      Anand Jain 提交于
      stable inclusion
      from stable-5.10.43
      commit fe910d20e2d8e0736bbea9c1efe6a49535e807ea
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit 5e753a81 upstream.
      
      The following test case reproduces an issue of wrongly freeing in-use
      blocks on the readonly seed device when fstrim is called on the rw sprout
      device. As shown below.
      
      Create a seed device and add a sprout device to it:
      
        $ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
        $ btrfstune -S 1 /dev/loop0
        $ mount /dev/loop0 /btrfs
        $ btrfs dev add -f /dev/loop1 /btrfs
        BTRFS info (device loop0): relocating block group 290455552 flags system
        BTRFS info (device loop0): relocating block group 1048576 flags system
        BTRFS info (device loop0): disk added /dev/loop1
        $ umount /btrfs
      
      Mount the sprout device and run fstrim:
      
        $ mount /dev/loop1 /btrfs
        $ fstrim /btrfs
        $ umount /btrfs
      
      Now try to mount the seed device, and it fails:
      
        $ mount /dev/loop0 /btrfs
        mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
      
      Block 5292032 is missing on the readonly seed device:
      
       $ dmesg -kt | tail
       <snip>
       BTRFS error (device loop0): bad tree block start, want 5292032 have 0
       BTRFS warning (device loop0): couldn't read-tree root
       BTRFS error (device loop0): open_ctree failed
      
      >From the dump-tree of the seed device (taken before the fstrim). Block
      5292032 belonged to the block group starting at 5242880:
      
        $ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
        <snip>
        item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
        	block group used 114688 chunk_objectid 256 flags METADATA
        <snip>
      
      >From the dump-tree of the sprout device (taken before the fstrim).
      fstrim used block-group 5242880 to find the related free space to free:
      
        $ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
        <snip>
        item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
        	block group used 32768 chunk_objectid 256 flags METADATA
        <snip>
      
      BPF kernel tracing the fstrim command finds the missing block 5292032
      within the range of the discarded blocks as below:
      
        kprobe:btrfs_discard_extent {
        	printf("freeing start %llu end %llu num_bytes %llu:\n",
        		arg1, arg1+arg2, arg2);
        }
      
        freeing start 5259264 end 5406720 num_bytes 147456
        <snip>
      
      Fix this by avoiding the discard command to the readonly seed device.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3ea08020
    • F
      btrfs: fix deadlock when cloning inline extents and low on available space · 7abe8761
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.43
      commit baa6763123e2b63b8289943c7211ba0e3220432f
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit 76a6d5cd upstream.
      
      There are a few cases where cloning an inline extent requires copying data
      into a page of the destination inode. For these cases we are allocating
      the required data and metadata space while holding a leaf locked. This can
      result in a deadlock when we are low on available space because allocating
      the space may flush delalloc and two deadlock scenarios can happen:
      
      1) When starting writeback for an inode with a very small dirty range that
         fits in an inline extent, we deadlock during the writeback when trying
         to insert the inline extent, at cow_file_range_inline(), if the extent
         is going to be located in the leaf for which we are already holding a
         read lock;
      
      2) After successfully starting writeback, for non-inline extent cases,
         the async reclaim thread will hang waiting for an ordered extent to
         complete if the ordered extent completion needs to modify the leaf
         for which the clone task is holding a read lock (for adding or
         replacing file extent items). So the cloning task will wait forever
         on the async reclaim thread to make progress, which in turn is
         waiting for the ordered extent completion which in turn is waiting
         to acquire a write lock on the same leaf.
      
      So fix this by making sure we release the path (and therefore the leaf)
      every time we need to copy the inline extent's data into a page of the
      destination inode, as by that time we do not need to have the leaf locked.
      
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7abe8761
    • J
      btrfs: abort in rename_exchange if we fail to insert the second ref · 21714f16
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.43
      commit 0df50d47d17401f9f140dfbe752a65e5d72f9932
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit dc09ef35 upstream.
      
      Error injection stress uncovered a problem where we'd leave a dangling
      inode ref if we failed during a rename_exchange.  This happens because
      we insert the inode ref for one side of the rename, and then for the
      other side.  If this second inode ref insert fails we'll leave the first
      one dangling and leave a corrupt file system behind.  Fix this by
      aborting if we did the insert for the first inode ref.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      21714f16
    • J
      btrfs: fixup error handling in fixup_inode_link_counts · f7a793f2
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.43
      commit 48568f3944ee7357e8fed394804745bd981e978a
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit 011b28ac upstream.
      
      This function has the following pattern
      
      	while (1) {
      		ret = whatever();
      		if (ret)
      			goto out;
      	}
      	ret = 0
      out:
      	return ret;
      
      However several places in this while loop we simply break; when there's
      a problem, thus clearing the return value, and in one case we do a
      return -EIO, and leak the memory for the path.
      
      Fix this by re-arranging the loop to deal with ret == 1 coming from
      btrfs_search_slot, and then simply delete the
      
      	ret = 0;
      out:
      
      bit so everybody can break if there is an error, which will allow for
      proper error handling to occur.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f7a793f2
    • J
      btrfs: return errors from btrfs_del_csums in cleanup_ref_head · 7ef76414
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.43
      commit 466d83fdbbe345f3cfd5f7b2633f740ecad67853
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit 856bd270 upstream.
      
      We are unconditionally returning 0 in cleanup_ref_head, despite the fact
      that btrfs_del_csums could fail.  We need to return the error so the
      transaction gets aborted properly, fix this by returning ret from
      btrfs_del_csums in cleanup_ref_head.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7ef76414
    • J
      btrfs: fix error handling in btrfs_del_csums · b71bc6fb
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.43
      commit 5a89982fa2bba459b82323655df986945a853bbe
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit b86652be upstream.
      
      Error injection stress would sometimes fail with checksums on disk that
      did not have a corresponding extent.  This occurred because the pattern
      in btrfs_del_csums was
      
      	while (1) {
      		ret = btrfs_search_slot();
      		if (ret < 0)
      			break;
      	}
      	ret = 0;
      out:
      	btrfs_free_path(path);
      	return ret;
      
      If we got an error from btrfs_search_slot we'd clear the error because
      we were breaking instead of goto out.  Instead of using goto out, simply
      handle the cases where we may leave a random value in ret, and get rid
      of the
      
      	ret = 0;
      out:
      
      pattern and simply allow break to have the proper error reporting.  With
      this fix we properly abort the transaction and do not commit thinking we
      successfully deleted the csum.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b71bc6fb
    • J
      btrfs: mark ordered extent and inode with error if we fail to finish · d005f5ff
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.43
      commit b547a16b24918edd63042f9d81c0d310212d2e94
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit d61bec08 upstream.
      
      While doing error injection testing I saw that sometimes we'd get an
      abort that wouldn't stop the current transaction commit from completing.
      This abort was coming from finish ordered IO, but at this point in the
      transaction commit we should have gotten an error and stopped.
      
      It turns out the abort came from finish ordered io while trying to write
      out the free space cache.  It occurred to me that any failure inside of
      finish_ordered_io isn't actually raised to the person doing the writing,
      so we could have any number of failures in this path and think the
      ordered extent completed successfully and the inode was fine.
      
      Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
      marking the mapping of the inode with mapping_set_error, so any callers
      that simply call fdatawait will also get the error.
      
      With this we're seeing the IO error on the free space inode when we fail
      to do the finish_ordered_io.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d005f5ff
    • J
      btrfs: tree-checker: do not error out if extent ref hash doesn't match · 60d38948
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.43
      commit 1d62b7ac83e0791200563b9c2deb658c2d196dcf
      bugzilla: 109284
      CVE: NA
      
      --------------------------------
      
      commit 1119a72e upstream.
      
      The tree checker checks the extent ref hash at read and write time to
      make sure we do not corrupt the file system.  Generally extent
      references go inline, but if we have enough of them we need to make an
      item, which looks like
      
      key.objectid	= <bytenr>
      key.type	= <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
      key.offset	= hash(tree, owner, offset)
      
      However if key.offset collide with an unrelated extent reference we'll
      simply key.offset++ until we get something that doesn't collide.
      Obviously this doesn't match at tree checker time, and thus we error
      while writing out the transaction.  This is relatively easy to
      reproduce, simply do something like the following
      
        xfs_io -f -c "pwrite 0 1M" file
        offset=2
      
        for i in {0..10000}
        do
      	  xfs_io -c "reflink file 0 ${offset}M 1M" file
      	  offset=$(( offset + 2 ))
        done
      
        xfs_io -c "reflink file 0 17999258914816 1M" file
        xfs_io -c "reflink file 0 35998517829632 1M" file
        xfs_io -c "reflink file 0 53752752058368 1M" file
      
        btrfs filesystem sync
      
      And the sync will error out because we'll abort the transaction.  The
      magic values above are used because they generate hash collisions with
      the first file in the main subvol.
      
      The fix for this is to remove the hash value check from tree checker, as
      we have no idea which offset ours should belong to.
      Reported-by: NTuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
      Fixes: 0785a9aa ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      60d38948
    • J
      btrfs: do not BUG_ON in link_to_fixup_dir · be0a0aa3
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.42
      commit 7e13db503918820e6333811cdc6f151dcea5090a
      bugzilla: 55093
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 91df99a6 ]
      
      While doing error injection testing I got the following panic
      
        kernel BUG at fs/btrfs/tree-log.c:1862!
        invalid opcode: 0000 [#1] SMP NOPTI
        CPU: 1 PID: 7836 Comm: mount Not tainted 5.13.0-rc1+ #305
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:link_to_fixup_dir+0xd5/0xe0
        RSP: 0018:ffffb5800180fa30 EFLAGS: 00010216
        RAX: fffffffffffffffb RBX: 00000000fffffffb RCX: ffff8f595287faf0
        RDX: ffffb5800180fa37 RSI: ffff8f5954978800 RDI: 0000000000000000
        RBP: ffff8f5953af9450 R08: 0000000000000019 R09: 0000000000000001
        R10: 000151f408682970 R11: 0000000120021001 R12: ffff8f5954978800
        R13: ffff8f595287faf0 R14: ffff8f5953c77dd0 R15: 0000000000000065
        FS:  00007fc5284c8c40(0000) GS:ffff8f59bbd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc5287f47c0 CR3: 000000011275e002 CR4: 0000000000370ee0
        Call Trace:
         replay_one_buffer+0x409/0x470
         ? btree_read_extent_buffer_pages+0xd0/0x110
         walk_up_log_tree+0x157/0x1e0
         walk_log_tree+0xa6/0x1d0
         btrfs_recover_log_trees+0x1da/0x360
         ? replay_one_extent+0x7b0/0x7b0
         open_ctree+0x1486/0x1720
         btrfs_mount_root.cold+0x12/0xea
         ? __kmalloc_track_caller+0x12f/0x240
         legacy_get_tree+0x24/0x40
         vfs_get_tree+0x22/0xb0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x10d/0x380
         ? vfs_parse_fs_string+0x4d/0x90
         legacy_get_tree+0x24/0x40
         vfs_get_tree+0x22/0xb0
         path_mount+0x433/0xa10
         __x64_sys_mount+0xe3/0x120
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      We can get -EIO or any number of legitimate errors from
      btrfs_search_slot(), panicing here is not the appropriate response.  The
      error path for this code handles errors properly, simply return the
      error.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      be0a0aa3
    • F
      btrfs: release path before starting transaction when cloning inline extent · 69c0df05
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.42
      commit 88f566beb1cf843876cffc4153640e5267d1171b
      bugzilla: 55093
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 6416954c ]
      
      When cloning an inline extent there are a few cases, such as when we have
      an implicit hole at file offset 0, where we start a transaction while
      holding a read lock on a leaf. Starting the transaction results in a call
      to sb_start_intwrite(), which results in doing a read lock on a percpu
      semaphore. Lockdep doesn't like this and complains about it:
      
        [46.580704] ======================================================
        [46.580752] WARNING: possible circular locking dependency detected
        [46.580799] 5.13.0-rc1 #28 Not tainted
        [46.580832] ------------------------------------------------------
        [46.580877] cloner/3835 is trying to acquire lock:
        [46.580918] c00000001301d638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
        [46.581167]
        [46.581167] but task is already holding lock:
        [46.581217] c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
        [46.581293]
        [46.581293] which lock already depends on the new lock.
        [46.581293]
        [46.581351]
        [46.581351] the existing dependency chain (in reverse order) is:
        [46.581410]
        [46.581410] -> #1 (btrfs-tree-00){++++}-{3:3}:
        [46.581464]        down_read_nested+0x68/0x200
        [46.581536]        __btrfs_tree_read_lock+0x70/0x1d0
        [46.581577]        btrfs_read_lock_root_node+0x88/0x200
        [46.581623]        btrfs_search_slot+0x298/0xb70
        [46.581665]        btrfs_set_inode_index+0xfc/0x260
        [46.581708]        btrfs_new_inode+0x26c/0x950
        [46.581749]        btrfs_create+0xf4/0x2b0
        [46.581782]        lookup_open.isra.57+0x55c/0x6a0
        [46.581855]        path_openat+0x418/0xd20
        [46.581888]        do_filp_open+0x9c/0x130
        [46.581920]        do_sys_openat2+0x2ec/0x430
        [46.581961]        do_sys_open+0x90/0xc0
        [46.581993]        system_call_exception+0x3d4/0x410
        [46.582037]        system_call_common+0xec/0x278
        [46.582078]
        [46.582078] -> #0 (sb_internal#2){.+.+}-{0:0}:
        [46.582135]        __lock_acquire+0x1e90/0x2c50
        [46.582176]        lock_acquire+0x2b4/0x5b0
        [46.582263]        start_transaction+0x3cc/0x950
        [46.582308]        clone_copy_inline_extent+0xe4/0x5a0
        [46.582353]        btrfs_clone+0x5fc/0x880
        [46.582388]        btrfs_clone_files+0xd8/0x1c0
        [46.582434]        btrfs_remap_file_range+0x3d8/0x590
        [46.582481]        do_clone_file_range+0x10c/0x270
        [46.582558]        vfs_clone_file_range+0x1b0/0x310
        [46.582605]        ioctl_file_clone+0x90/0x130
        [46.582651]        do_vfs_ioctl+0x874/0x1ac0
        [46.582697]        sys_ioctl+0x6c/0x120
        [46.582733]        system_call_exception+0x3d4/0x410
        [46.582777]        system_call_common+0xec/0x278
        [46.582822]
        [46.582822] other info that might help us debug this:
        [46.582822]
        [46.582888]  Possible unsafe locking scenario:
        [46.582888]
        [46.582942]        CPU0                    CPU1
        [46.582984]        ----                    ----
        [46.583028]   lock(btrfs-tree-00);
        [46.583062]                                lock(sb_internal#2);
        [46.583119]                                lock(btrfs-tree-00);
        [46.583174]   lock(sb_internal#2);
        [46.583212]
        [46.583212]  *** DEADLOCK ***
        [46.583212]
        [46.583266] 6 locks held by cloner/3835:
        [46.583299]  #0: c00000001301d448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
        [46.583382]  #1: c00000000f6d3768 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
        [46.583477]  #2: c00000000f6d72a8 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
        [46.583574]  #3: c00000000f6d7138 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
        [46.583657]  #4: c00000000f6d35f8 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
        [46.583743]  #5: c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
        [46.583828]
        [46.583828] stack backtrace:
        [46.583872] CPU: 1 PID: 3835 Comm: cloner Not tainted 5.13.0-rc1 #28
        [46.583931] Call Trace:
        [46.583955] [c0000000167c7200] [c000000000c1ee78] dump_stack+0xec/0x144 (unreliable)
        [46.584052] [c0000000167c7240] [c000000000274058] print_circular_bug.isra.32+0x3a8/0x400
        [46.584123] [c0000000167c72e0] [c0000000002741f4] check_noncircular+0x144/0x190
        [46.584191] [c0000000167c73b0] [c000000000278fc0] __lock_acquire+0x1e90/0x2c50
        [46.584259] [c0000000167c74f0] [c00000000027aa94] lock_acquire+0x2b4/0x5b0
        [46.584317] [c0000000167c75e0] [c000000000a0d6cc] start_transaction+0x3cc/0x950
        [46.584388] [c0000000167c7690] [c000000000af47a4] clone_copy_inline_extent+0xe4/0x5a0
        [46.584457] [c0000000167c77c0] [c000000000af525c] btrfs_clone+0x5fc/0x880
        [46.584514] [c0000000167c7990] [c000000000af5698] btrfs_clone_files+0xd8/0x1c0
        [46.584583] [c0000000167c7a00] [c000000000af5b58] btrfs_remap_file_range+0x3d8/0x590
        [46.584652] [c0000000167c7ae0] [c0000000005d81dc] do_clone_file_range+0x10c/0x270
        [46.584722] [c0000000167c7b40] [c0000000005d84f0] vfs_clone_file_range+0x1b0/0x310
        [46.584793] [c0000000167c7bb0] [c00000000058bf80] ioctl_file_clone+0x90/0x130
        [46.584861] [c0000000167c7c10] [c00000000058c894] do_vfs_ioctl+0x874/0x1ac0
        [46.584922] [c0000000167c7d10] [c00000000058db4c] sys_ioctl+0x6c/0x120
        [46.584978] [c0000000167c7d60] [c0000000000364a4] system_call_exception+0x3d4/0x410
        [46.585046] [c0000000167c7e10] [c00000000000d45c] system_call_common+0xec/0x278
        [46.585114] --- interrupt: c00 at 0x7ffff7e22990
        [46.585160] NIP:  00007ffff7e22990 LR: 00000001000010ec CTR: 0000000000000000
        [46.585224] REGS: c0000000167c7e80 TRAP: 0c00   Not tainted  (5.13.0-rc1)
        [46.585280] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
        [46.585374] IRQMASK: 0
        [46.585374] GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f17100 0000000000000004
        [46.585374] GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
        [46.585374] GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
        [46.585374] GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
        [46.585374] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        [46.585374] GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
        [46.585374] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
        [46.585374] GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
        [46.585919] NIP [00007ffff7e22990] 0x7ffff7e22990
        [46.585964] LR [00000001000010ec] 0x1000010ec
        [46.586010] --- interrupt: c00
      
      This should be a false positive, as both locks are acquired in read mode.
      Nevertheless, we don't need to hold a leaf locked when we start the
      transaction, so just release the leaf (path) before starting it.
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/linux-btrfs/20210513214404.xks77p566fglzgum@riteshh-domain/Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      69c0df05
    • B
      btrfs: return whole extents in fiemap · 71b80734
      Boris Burkov 提交于
      stable inclusion
      from stable-5.10.42
      commit c7e0c6047c4f2a9664e48e3671af80c513220a44
      bugzilla: 55093
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 15c7745c ]
      
        `xfs_io -c 'fiemap <off> <len>' <file>`
      
      can give surprising results on btrfs that differ from xfs.
      
      btrfs prints out extents trimmed to fit the user input. If the user's
      fiemap request has an offset, then rather than returning each whole
      extent which intersects that range, we also trim the start extent to not
      have start < off.
      
      Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
      that returning the whole extent is expected.
      
      Some cases which all yield the same fiemap in xfs, but not btrfs:
        dd if=/dev/zero of=$f bs=4k count=1
        sudo xfs_io -c 'fiemap 0 1024' $f
          0: [0..7]: 26624..26631
        sudo xfs_io -c 'fiemap 2048 1024' $f
          0: [4..7]: 26628..26631
        sudo xfs_io -c 'fiemap 2048 4096' $f
          0: [4..7]: 26628..26631
        sudo xfs_io -c 'fiemap 3584 512' $f
          0: [7..7]: 26631..26631
        sudo xfs_io -c 'fiemap 4091 5' $f
          0: [7..6]: 26631..26630
      
      I believe this is a consequence of the logic for merging contiguous
      extents represented by separate extent items. That logic needs to track
      the last offset as it loops through the extent items, which happens to
      pick up the start offset on the first iteration, and trim off the
      beginning of the full extent. To fix it, start `off` at 0 rather than
      `start` so that we keep the iteration/merging intact without cutting off
      the start of the extent.
      
      after the fix, all the above commands give:
      
        0: [0..7]: 26624..26631
      
      The merging logic is exercised by fstest generic/483, and I have written
      a new fstest for checking we don't have backwards or zero-length fiemaps
      for cases like those above.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      71b80734
  7. 03 6月, 2021 7 次提交
    • J
      btrfs: avoid RCU stalls while running delayed iputs · 7bbdf491
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.40
      commit 56001dda032f84116c3b16d5140d64d77ae5a367
      bugzilla: 51882
      CVE: NA
      
      --------------------------------
      
      commit 71795ee5 upstream.
      
      Generally a delayed iput is added when we might do the final iput, so
      usually we'll end up sleeping while processing the delayed iputs
      naturally.  However there's no guarantee of this, especially for small
      files.  In production we noticed 5 instances of RCU stalls while testing
      a kernel release overnight across 1000 machines, so this is relatively
      common:
      
        host count: 5
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208
         	(t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1
        CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1
        Call Trace:
          <IRQ> dump_stack+0x50/0x70
          nmi_cpu_backtrace.cold.6+0x30/0x65
          ? lapic_can_unplug_cpu.cold.30+0x40/0x40
          nmi_trigger_cpumask_backtrace+0xba/0xca
          rcu_dump_cpu_stacks+0x99/0xc7
          rcu_sched_clock_irq.cold.90+0x1b2/0x3a3
          ? trigger_load_balance+0x5c/0x200
          ? tick_sched_do_timer+0x60/0x60
          ? tick_sched_do_timer+0x60/0x60
          update_process_times+0x24/0x50
          tick_sched_timer+0x37/0x70
          __hrtimer_run_queues+0xfe/0x270
          hrtimer_interrupt+0xf4/0x210
          smp_apic_timer_interrupt+0x5e/0x120
          apic_timer_interrupt+0xf/0x20 </IRQ>
         RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0
         RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
         RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029
         RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200
         RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000
         R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200
         R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870
         ? kzalloc.constprop.57+0x30/0x30
         list_lru_add+0x5a/0x100
         inode_lru_list_add+0x20/0x40
         iput+0x1c1/0x1f0
         run_delayed_iput_locked+0x46/0x90
         btrfs_run_delayed_iputs+0x3f/0x60
         cleaner_kthread+0xf2/0x120
         kthread+0x10b/0x130
      
      Fix this by adding a cond_resched_lock() to the loop processing delayed
      iputs so we can avoid these sort of stalls.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7bbdf491
    • F
      btrfs: fix race leading to unpersisted data and metadata on fsync · 3ba0e7ac
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.38
      commit bccb7dd137adea29ba406a936445dccc078e36cb
      bugzilla: 51875
      CVE: NA
      
      --------------------------------
      
      commit 626e9f41 upstream.
      
      When doing a fast fsync on a file, there is a race which can result in the
      fsync returning success to user space without logging the inode and without
      durably persisting new data.
      
      The following example shows one possible scenario for this:
      
         $ mkfs.btrfs -f /dev/sdc
         $ mount /dev/sdc /mnt
      
         $ touch /mnt/bar
         $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
      
         # Now we have:
         # file bar == inode 257
         # file baz == inode 258
      
         $ mv /mnt/baz /mnt/foo
      
         # Now we have:
         # file bar == inode 257
         # file foo == inode 258
      
         $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
         # fsync bar before foo, it is important to trigger the race.
         $ xfs_io -c "fsync" /mnt/bar
         $ xfs_io -c "fsync" /mnt/foo
      
         # After this:
         # inode 257, file bar, is empty
         # inode 258, file foo, has 1M filled with 0xcd
      
         <power failure>
      
         # Replay the log:
         $ mount /dev/sdc /mnt
      
         # After this point file foo should have 1M filled with 0xcd and not 0xab
      
      The following steps explain how the race happens:
      
      1) Before the first fsync of inode 258, when it has the "baz" name, its
         ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
         The inode also has the full sync flag set;
      
      2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
         the generation of the current transaction, and set ->last_log_commit
         to 0, which is the current value of ->last_sub_trans (done at
         btrfs_log_inode()).
      
         The full sync flag is cleared from the inode during the fsync.
      
         The log sub transaction that was committed had an ID of 0 and when we
         synced the log, at btrfs_sync_log(), we incremented root->log_transid
         from 0 to 1;
      
      3) During the rename:
      
         We update inode 258, through btrfs_update_inode(), and that causes its
         ->last_sub_trans to be set to 1 (the current log transaction ID), and
         ->last_log_commit remains with a value of 0.
      
         After updating inode 258, because we have previously logged the inode
         in the previous fsync, we log again the inode through the call to
         btrfs_log_new_name(). This results in updating the inode's
         ->last_log_commit from 0 to 1 (the current value of its
         ->last_sub_trans).
      
         The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
         the next log transaction;
      
      4) Then a buffered write against inode 258 is made. This leaves the value
         of ->last_sub_trans as 1 (the ID of the current log transaction, stored
         at root->log_transid);
      
      5) Then an fsync against inode 257 (or any other inode other than 258),
         happens. This results in committing the log transaction with ID 1,
         which results in updating root->last_log_commit to 1 and bumping
         root->log_transid from 1 to 2;
      
      6) Then an fsync against inode 258 starts. We flush delalloc and wait only
         for writeback to complete, since the full sync flag is not set in the
         inode's runtime flags - we do not wait for ordered extents to complete.
      
         Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
         ordered extent completes. The call returns true:
      
           static inline bool btrfs_inode_in_log(...)
           {
               bool ret = false;
      
               spin_lock(&inode->lock);
               if (inode->logged_trans == generation &&
                   inode->last_sub_trans <= inode->last_log_commit &&
                   inode->last_sub_trans <= inode->root->last_log_commit)
                       ret = true;
               spin_unlock(&inode->lock);
               return ret;
           }
      
         generation has a value of 6 (fs_info->generation), ->logged_trans also
         has a value of 6 (set when we logged the inode during the first fsync
         and when logging it during the rename), ->last_sub_trans has a value
         of 1, set during the rename (step 3), ->last_log_commit also has a
         value of 1 (set in step 3) and root->last_log_commit has a value of 1,
         which was set in step 5 when fsyncing inode 257.
      
         As a consequence we don't log the inode, any new extents and do not
         sync the log, resulting in a data loss if a power failure happens
         after the fsync and before the current transaction commits.
         Also, because we do not log the inode, after a power failure the mtime
         and ctime of the inode do not match those we had before.
      
         When the ordered extent completes before we call btrfs_inode_in_log(),
         then the call returns false and we log the inode and sync the log,
         since at the end of ordered extent completion we update the inode and
         set ->last_sub_trans to 2 (the value of root->log_transid) and
         ->last_log_commit to 1.
      
      This problem is found after removing the check for the emptiness of the
      inode's list of modified extents in the recent commit 209ecbb8
      ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
      added in the 5.13 merge window. However checking the emptiness of the
      list is not really the way to solve this problem, and was never intended
      to, because while that solves the problem for COW writes, the problem
      persists for NOCOW writes because in that case the list is always empty.
      
      In the case of NOCOW writes, even though we wait for the writeback to
      complete before returning from btrfs_sync_file(), we end up not logging
      the inode, which has a new mtime/ctime, and because we don't sync the log,
      we never issue disk barriers (send REQ_PREFLUSH to the device) since that
      only happens when we sync the log (when we write super blocks at
      btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
      btrfs_sync_file() to user space, we are not guaranteeing that the data is
      durably persisted on disk.
      
      Also, while the example above uses a rename exchange to show how the
      problem happens, it is not the only way to trigger it. An alternative
      could be adding a new hard link to inode 258, since that also results
      in calling btrfs_log_new_name() and updating the inode in the log.
      An example reproducer using the addition of a hard link instead of a
      rename operation:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/bar
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo
      
        $ ln /mnt/foo /mnt/foo_link
        $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
        $ xfs_io -c "fsync" /mnt/bar
        $ xfs_io -c "fsync" /mnt/foo
      
        <power failure>
      
        # Replay the log:
        $ mount /dev/sdc /mnt
      
        # After this point file foo often has 1M filled with 0xab and not 0xcd
      
      The reasons leading to the final fsync of file foo, inode 258, not
      persisting the new data are the same as for the previous example with
      a rename operation.
      
      So fix by never skipping logging and log syncing when there are still any
      ordered extents in flight. To avoid making the conditional if statement
      that checks if logging an inode is needed harder to read, place all the
      logic into an helper function with separate if statements to make it more
      manageable and easier to read.
      
      A test case for fstests will follow soon.
      
      For NOCOW writes, the problem existed before commit b5e6c3e1
      ("btrfs: always wait on ordered extents at fsync time"), introduced in
      kernel 4.19, then it went away with that commit since we started to always
      wait for ordered extent completion before logging.
      
      The problem came back again once the fast fsync path was changed again to
      avoid waiting for ordered extent completion, in commit 48778179
      ("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.
      
      However, for COW writes, the race only happens after the recent
      commit 209ecbb8 ("btrfs: remove stale comment and logic from
      btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
      writes, the bug existed before that commit. So tag 5.10+ as the release
      for stable backports.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3ba0e7ac
    • F
      btrfs: fix race when picking most recent mod log operation for an old root · 1226f171
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.36
      commit 1d852d6bb4d44baac57452be5c2857741139fc59
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit f9690f42 ]
      
      Commit dbcc7d57 ("btrfs: fix race when cloning extent buffer during
      rewind of an old root"), fixed a race when we need to rewind the extent
      buffer of an old root. It was caused by picking a new mod log operation
      for the extent buffer while getting a cloned extent buffer with an outdated
      number of items (off by -1), because we cloned the extent buffer without
      locking it first.
      
      However there is still another similar race, but in the opposite direction.
      The cloned extent buffer has a number of items that does not match the
      number of tree mod log operations that are going to be replayed. This is
      because right after we got the last (most recent) tree mod log operation to
      replay and before locking and cloning the extent buffer, another task adds
      a new pointer to the extent buffer, which results in adding a new tree mod
      log operation and incrementing the number of items in the extent buffer.
      So after cloning we have mismatch between the number of items in the extent
      buffer and the number of mod log operations we are going to apply to it.
      This results in hitting a BUG_ON() that produces the following stack trace:
      
         ------------[ cut here ]------------
         kernel BUG at fs/btrfs/tree-mod-log.c:675!
         invalid opcode: 0000 [#1] SMP KASAN PTI
         CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G        W         5.12.0-7d1efdf501f8-misc-next+ #99
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
         RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
         Code: 05 48 8d 74 10 (...)
         RSP: 0018:ffffc90001027090 EFLAGS: 00010293
         RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
         RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
         RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
         R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
         R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
         FS:  00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
         Call Trace:
          btrfs_get_old_root+0x16a/0x5c0
          ? lock_downgrade+0x400/0x400
          btrfs_search_old_slot+0x192/0x520
          ? btrfs_search_slot+0x1090/0x1090
          ? free_extent_buffer.part.61+0xd7/0x140
          ? free_extent_buffer+0x13/0x20
          resolve_indirect_refs+0x3e9/0xfc0
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? add_prelim_ref.part.11+0x150/0x150
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? lock_acquired+0xbb/0x620
          ? __kasan_check_write+0x14/0x20
          ? do_raw_spin_unlock+0xa8/0x140
          ? rb_insert_color+0x340/0x360
          ? prelim_ref_insert+0x12d/0x430
          find_parent_nodes+0x5c3/0x1830
          ? stack_trace_save+0x87/0xb0
          ? resolve_indirect_refs+0xfc0/0xfc0
          ? fs_reclaim_acquire+0x67/0xf0
          ? __kasan_check_read+0x11/0x20
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? fs_reclaim_acquire+0x67/0xf0
          ? __kasan_check_read+0x11/0x20
          ? ___might_sleep+0x10f/0x1e0
          ? __kasan_kmalloc+0x9d/0xd0
          ? trace_hardirqs_on+0x55/0x120
          btrfs_find_all_roots_safe+0x142/0x1e0
          ? find_parent_nodes+0x1830/0x1830
          ? trace_hardirqs_on+0x55/0x120
          ? ulist_free+0x1f/0x30
          ? btrfs_inode_flags_to_xflags+0x50/0x50
          iterate_extent_inodes+0x20e/0x580
          ? tree_backref_for_extent+0x230/0x230
          ? release_extent_buffer+0x225/0x280
          ? read_extent_buffer+0xdd/0x110
          ? lock_downgrade+0x400/0x400
          ? __kasan_check_read+0x11/0x20
          ? lock_acquired+0xbb/0x620
          ? __kasan_check_write+0x14/0x20
          ? do_raw_spin_unlock+0xa8/0x140
          ? _raw_spin_unlock+0x22/0x30
          ? release_extent_buffer+0x225/0x280
          iterate_inodes_from_logical+0x129/0x170
          ? iterate_inodes_from_logical+0x129/0x170
          ? btrfs_inode_flags_to_xflags+0x50/0x50
          ? iterate_extent_inodes+0x580/0x580
          ? __vmalloc_node+0x92/0xb0
          ? init_data_container+0x34/0xb0
          ? init_data_container+0x34/0xb0
          ? kvmalloc_node+0x60/0x80
          btrfs_ioctl_logical_to_ino+0x158/0x230
          btrfs_ioctl+0x2038/0x4360
          ? __kasan_check_write+0x14/0x20
          ? mmput+0x3b/0x220
          ? btrfs_ioctl_get_supported_features+0x30/0x30
          ? __kasan_check_read+0x11/0x20
          ? __kasan_check_read+0x11/0x20
          ? lock_release+0xc8/0x650
          ? __might_fault+0x64/0xd0
          ? __kasan_check_read+0x11/0x20
          ? lock_downgrade+0x400/0x400
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? lockdep_hardirqs_on_prepare+0x13/0x210
          ? _raw_spin_unlock_irqrestore+0x51/0x63
          ? __kasan_check_read+0x11/0x20
          ? do_vfs_ioctl+0xfc/0x9d0
          ? ioctl_file_clone+0xe0/0xe0
          ? lock_downgrade+0x400/0x400
          ? lockdep_hardirqs_on_prepare+0x210/0x210
          ? __kasan_check_read+0x11/0x20
          ? lock_release+0xc8/0x650
          ? __task_pid_nr_ns+0xd3/0x250
          ? __kasan_check_read+0x11/0x20
          ? __fget_files+0x160/0x230
          ? __fget_light+0xf2/0x110
          __x64_sys_ioctl+0xc3/0x100
          do_syscall_64+0x37/0x80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
         RIP: 0033:0x7f29ae85b427
         Code: 00 00 90 48 8b (...)
         RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
         RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
         RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
         RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
         R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
         R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
         Modules linked in:
         ---[ end trace 85e5fce078dfbe04 ]---
      
        (gdb) l *(tree_mod_log_rewind+0x3b1)
        0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
        670                      * the modification. As we're going backwards, we do the
        671                      * opposite of each operation here.
        672                      */
        673                     switch (tm->op) {
        674                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        675                             BUG_ON(tm->slot < n);
        676                             fallthrough;
        677                     case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        678                     case BTRFS_MOD_LOG_KEY_REMOVE:
        679                             btrfs_set_node_key(eb, &tm->key, tm->slot);
        (gdb) quit
      
      The following steps explain in more detail how it happens:
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
         This is task A;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
      
      3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
         of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
         pointing to ebX->start. We only have one item in eb X, so we create
         only one tree mod log operation, and store in the "tm_list" array;
      
      4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
         log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
         to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
         set to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         "tm_list" as argument. After that, tree_mod_log_free_eb() calls
         tree_mod_log_insert(). This inserts the mod log operation of type
         BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
         with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
      
      6) Then, after inserting the "tm_list" single element into the tree mod
         log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
         gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
      
      7) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      8) Later some other task B allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      9) The tree mod log user task calls btrfs_search_old_slot(), which calls
         btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
         with time_seq == 1 and eb_root == eb Y;
      
      10) The first iteration of the while loop finds the tree mod log element
          with sequence number 3, for the logical address of eb Y and of type
          BTRFS_MOD_LOG_ROOT_REPLACE;
      
      11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
          break out of the loop, and set root_logical to point to
          tm->old_root.logical, which corresponds to the logical address of
          eb X;
      
      12) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
          corresponds to the old slot 0 of eb X (eb X had only 1 item in it
          before being freed at step 7);
      
      13) We then break out of the while loop and return the tree mod log
          operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
          for slot 0 of eb X, to btrfs_get_old_root();
      
      14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
          operation and set "logical" to the logical address of eb X, which was
          the old root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      15) But before calling tree_mod_log_search(), task B locks eb X, adds a
          key to eb X, which results in adding a tree mod log operation of type
          BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
          log, and increments the number of items in eb X from 0 to 1.
          Now fs_info->tree_mod_seq has a value of 4;
      
      16) Task A then calls tree_mod_log_search(), which returns the most recent
          tree mod log operation for eb X, which is the one just added by task B
          at the previous step, with a sequence number of 4, a type of
          BTRFS_MOD_LOG_KEY_ADD and for slot 0;
      
      17) Before task A locks and clones eb X, task A adds another key to eb X,
          which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
          with a sequence number of 5, for slot 1 of eb X, increments the
          number of items in eb X from 1 to 2, and unlocks eb X.
          Now fs_info->tree_mod_seq has a value of 5;
      
      18) Task A then locks eb X and clones it. The clone has a value of 2 for
          the number of items and the pointer "tm" points to the tree mod log
          operation with sequence number 4, not the most recent one with a
          sequence number of 5, so there is mismatch between the number of
          mod log operations that are going to be applied to the cloned version
          of eb X and the number of items in the clone;
      
      19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
          tree mod log operation with sequence number 4 and a type of
          BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
      
      20) At tree_mod_log_rewind(), we set the local variable "n" with a value
          of 2, which is the number of items in the clone of eb X.
      
          Then in the first iteration of the while loop, we process the mod log
          operation with sequence number 4, which is targeted at slot 0 and has
          a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
          2 to 1.
      
          Then we pick the next tree mod log operation for eb X, which is the
          tree mod log operation with a sequence number of 2, a type of
          BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
          added in step 5 to the tree mod log tree.
      
          We go back to the top of the loop to process this mod log operation,
          and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
      
              (...)
              switch (tm->op) {
              case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                      BUG_ON(tm->slot < n);
                      fallthrough;
      	(...)
      
      Fix this by checking for a more recent tree mod log operation after locking
      and cloning the extent buffer of the old root node, and use it as the first
      operation to apply to the cloned extent buffer when rewinding it.
      
      Stable backport notes: due to moved code and renames, in =< 5.11 the
      change should be applied to ctree.c:get_old_root.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
      Fixes: 834328a8 ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1226f171
    • J
      btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s · 5b953513
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.36
      commit 9c60c881d662a8aa3c70717d53eccbbe951c979f
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 7a9213a9 ]
      
      A few BUG_ON()'s in replace_path are purely to keep us from making
      logical mistakes, so replace them with ASSERT()'s.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5b953513
    • J
      btrfs: do proper error handling in btrfs_update_reloc_root · 2893b014
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.36
      commit f32b84d7c977e1906a4781b93b3c93090b6cd675
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 592fbcd5 ]
      
      We call btrfs_update_root in btrfs_update_reloc_root, which can fail for
      all sorts of reasons, including IO errors.  Instead of panicing the box
      lets return the error, now that all callers properly handle those
      errors.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      2893b014
    • J
      btrfs: do proper error handling in create_reloc_root · cc47c520
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.36
      commit 224c654a2eca6a29009b80c887bcf3ac4b2cab30
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 84c50ba5 ]
      
      We do memory allocations here, read blocks from disk, all sorts of
      operations that could easily fail at any given point.  Instead of
      panicing the box, simply return the error back up the chain, all callers
      at this point have proper error handling.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      cc47c520
    • F
      btrfs: fix race between transaction aborts and fsyncs leading to use-after-free · 8eff4d5b
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.36
      commit a4794be7b00b7eda4b45fffd283ab7d76df7e5d6
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit 061dde82 upstream.
      
      There is a race between a task aborting a transaction during a commit,
      a task doing an fsync and the transaction kthread, which leads to an
      use-after-free of the log root tree. When this happens, it results in a
      stack trace like the following:
      
        BTRFS info (device dm-0): forced readonly
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure
        BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5)
        BTRFS warning (device dm-0): Skipping commit of aborted transaction.
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10
        BTRFS error (device dm-0): error writing primary super block to device 1
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10
        BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10
        BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers)
        BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure
        general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        RIP: 0010:__mutex_lock+0x139/0xa40
        Code: c0 74 19 (...)
        RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202
        RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002
        RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040
        R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358
        FS:  00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
         ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
         ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
         btrfs_sync_log+0x7c1/0xf20 [btrfs]
         btrfs_sync_file+0x40c/0x580 [btrfs]
         do_fsync+0x38/0x70
         __x64_sys_fsync+0x10/0x20
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fa9142a55c3
        Code: 8b 15 09 (...)
        RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
        RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3
        RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005
        RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340
        R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0
        Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
        ---[ end trace ee2f1b19327d791d ]---
      
      The steps that lead to this crash are the following:
      
      1) We are at transaction N;
      
      2) We have two tasks with a transaction handle attached to transaction N.
         Task A and Task B. Task B is doing an fsync;
      
      3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
         into a local variable named 'log_root_tree' at the top of
         btrfs_sync_log(). Task B is about to call write_all_supers(), but
         before that...
      
      4) Task A calls btrfs_commit_transaction(), and after it sets the
         transaction state to TRANS_STATE_COMMIT_START, an error happens before
         it waits for the transaction's 'num_writers' counter to reach a value
         of 1 (no one else attached to the transaction), so it jumps to the
         label "cleanup_transaction";
      
      5) Task A then calls cleanup_transaction(), where it aborts the
         transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state,
         setting the ->aborted field of the transaction and the handle to an
         errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state.
      
         After that, at cleanup_transaction(), it deletes the transaction from
         the list of transactions (fs_info->trans_list), sets the transaction
         to the state TRANS_STATE_COMMIT_DOING and then waits for the number
         of writers to go down to 1, as it's currently 2 (1 for task A and 1
         for task B);
      
      6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR
         is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction().
      
         There it sees the list fs_info->trans_list is empty, and then proceeds
         into calling btrfs_drop_all_logs(), which frees the log root tree with
         a call to btrfs_free_log_root_tree();
      
      7) Task B calls write_all_supers() and, shortly after, under the label
         'out_wake_log_root', it deferences the pointer stored in
         'log_root_tree', which was already freed in the previous step by the
         transaction kthread. This results in a use-after-free leading to a
         crash.
      
      Fix this by deleting the transaction from the list of transactions at
      cleanup_transaction() only after setting the transaction state to
      TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are
      attached to the transaction to release their transaction handles.
      This makes the transaction kthread wait for all the tasks attached to
      the transaction to be done with the transaction before dropping the
      log roots and doing other cleanups.
      
      Fixes: ef67963d ("btrfs: drop logs when we've aborted a transaction")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8eff4d5b