1. 05 12月, 2019 4 次提交
  2. 01 12月, 2019 3 次提交
  3. 21 11月, 2019 1 次提交
    • F
      Btrfs: fix log context list corruption after rename exchange operation · 47d06a15
      Filipe Manana 提交于
      commit e6c617102c7e4ac1398cb0b98ff1f0727755b520 upstream.
      
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
      
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        (...)
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        (...)
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        (...)
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: NSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47d06a15
  4. 06 11月, 2019 5 次提交
    • F
      Btrfs: fix deadlock on tree root leaf when finding free extent · cb38a17c
      Filipe Manana 提交于
      [ Upstream commit 4222ea7100c0e37adace2790c8822758bbeee179 ]
      
      When we are writing out a free space cache, during the transaction commit
      phase, we can end up in a deadlock which results in a stack trace like the
      following:
      
       schedule+0x28/0x80
       btrfs_tree_read_lock+0x8e/0x120 [btrfs]
       ? finish_wait+0x80/0x80
       btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
       btrfs_search_slot+0xf6/0x9f0 [btrfs]
       ? evict_refill_and_join+0xd0/0xd0 [btrfs]
       ? inode_insert5+0x119/0x190
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_iget+0x113/0x690 [btrfs]
       __lookup_free_space_inode+0xd8/0x150 [btrfs]
       lookup_free_space_inode+0x5b/0xb0 [btrfs]
       load_free_space_cache+0x7c/0x170 [btrfs]
       ? cache_block_group+0x72/0x3b0 [btrfs]
       cache_block_group+0x1b3/0x3b0 [btrfs]
       ? finish_wait+0x80/0x80
       find_free_extent+0x799/0x1010 [btrfs]
       btrfs_reserve_extent+0x9b/0x180 [btrfs]
       btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
       __btrfs_cow_block+0x11d/0x500 [btrfs]
       btrfs_cow_block+0xdc/0x180 [btrfs]
       btrfs_search_slot+0x3bd/0x9f0 [btrfs]
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_update_inode_item+0x46/0x100 [btrfs]
       cache_save_setup+0xe4/0x3a0 [btrfs]
       btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
       btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
      
      At cache_save_setup() we need to update the inode item of a block group's
      cache which is located in the tree root (fs_info->tree_root), which means
      that it may result in COWing a leaf from that tree. If that happens we
      need to find a free metadata extent and while looking for one, if we find
      a block group which was not cached yet we attempt to load its cache by
      calling cache_block_group(). However this function will try to load the
      inode of the free space cache, which requires finding the matching inode
      item in the tree root - if that inode item is located in the same leaf as
      the inode item of the space cache we are updating at cache_save_setup(),
      we end up in a deadlock, since we try to obtain a read lock on the same
      extent buffer that we previously write locked.
      
      So fix this by using the tree root's commit root when searching for a
      block group's free space cache inode item when we are attempting to load
      a free space cache. This is safe since block groups once loaded stay in
      memory forever, as well as their caches, so after they are first loaded
      we will never need to read their inode items again. For new block groups,
      once they are created they get their ->cached field set to
      BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
      Reported-by: NAndrew Nelson <andrew.s.nelson@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
      Fixes: 9d66e233 ("Btrfs: load free space cache if it exists")
      Tested-by: NAndrew Nelson <andrew.s.nelson@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cb38a17c
    • Q
      btrfs: tracepoints: Fix wrong parameter order for qgroup events · d8ab4185
      Qu Wenruo 提交于
      [ Upstream commit fd2b007eaec898564e269d1f478a2da0380ecf51 ]
      
      [BUG]
      For btrfs:qgroup_meta_reserve event, the trace event can output garbage:
      
        qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
      
      The diff should always be alinged to sector size (4k), so there is
      definitely something wrong.
      
      [CAUSE]
      For the wrong @diff, it's caused by wrong parameter order.
      The correct parameters are:
      
        struct btrfs_root, s64 diff, int type.
      
      However the parameters used are:
      
        struct btrfs_root, int type, s64 diff.
      
      Fixes: 4ee0d883 ("btrfs: qgroup: Update trace events for metadata reservation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d8ab4185
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 6bcbe350
      Qu Wenruo 提交于
      [ Upstream commit 8702ba9396bf7bbae2ab93c94acd4bd37cfa4f09 ]
      
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6bcbe350
    • F
      Btrfs: fix memory leak due to concurrent append writes with fiemap · 96b9b946
      Filipe Manana 提交于
      [ Upstream commit c67d970f0ea8dcc423e112137d34334fa0abb8ec ]
      
      When we have a buffered write that starts at an offset greater than or
      equals to the file's size happening concurrently with a full ranged
      fiemap, we can end up leaking an extent state structure.
      
      Suppose we have a file with a size of 1Mb, and before the buffered write
      and fiemap are performed, it has a single extent state in its io tree
      representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
      
      The following sequence diagram shows how the memory leak happens if a
      fiemap a buffered write, starting at offset 1Mb and with a length of
      4Kb, are performed concurrently.
      
                CPU 1                                                  CPU 2
      
        extent_fiemap()
          --> it's a full ranged fiemap
              range from 0 to LLONG_MAX - 1
              (9223372036854775807)
      
          --> locks range in the inode's
              io tree
            --> after this we have 2 extent
                states in the io tree:
                --> 1 for range [0, 1Mb[ with
                    the bits EXTENT_LOCKED and
                    EXTENT_DELALLOC_BITS set
                --> 1 for the range
                    [1Mb, LLONG_MAX[ with
                    the EXTENT_LOCKED bit set
      
                                                        --> start buffered write at offset
                                                            1Mb with a length of 4Kb
      
                                                        btrfs_file_write_iter()
      
                                                          btrfs_buffered_write()
                                                            --> cached_state is NULL
      
                                                            lock_and_cleanup_extent_if_need()
                                                              --> returns 0 and does not lock
                                                                  range because it starts
                                                                  at current i_size / eof
      
                                                            --> cached_state remains NULL
      
                                                            btrfs_dirty_pages()
                                                              btrfs_set_extent_delalloc()
                                                                (...)
                                                                __set_extent_bit()
      
                                                                  --> splits extent state for range
                                                                      [1Mb, LLONG_MAX[ and now we
                                                                      have 2 extent states:
      
                                                                      --> one for the range
                                                                          [1Mb, 1Mb + 4Kb[ with
                                                                          EXTENT_LOCKED set
                                                                      --> another one for the range
                                                                          [1Mb + 4Kb, LLONG_MAX[ with
                                                                          EXTENT_LOCKED set as well
      
                                                                  --> sets EXTENT_DELALLOC on the
                                                                      extent state for the range
                                                                      [1Mb, 1Mb + 4Kb[
                                                                  --> caches extent state
                                                                      [1Mb, 1Mb + 4Kb[ into
                                                                      @cached_state because it has
                                                                      the bit EXTENT_LOCKED set
      
                                                          --> btrfs_buffered_write() ends up
                                                              with a non-NULL cached_state and
                                                              never calls anything to release its
                                                              reference on it, resulting in a
                                                              memory leak
      
      Fix this by calling free_extent_state() on cached_state if the range was
      not locked by lock_and_cleanup_extent_if_need().
      
      The same issue can happen if anything else other than fiemap locks a range
      that covers eof and beyond.
      
      This could be triggered, sporadically, by test case generic/561 from the
      fstests suite, which makes duperemove run concurrently with fsstress, and
      duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
      the leak is reported in dmesg/syslog when removing the btrfs module with
      a message like the following:
      
        [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
      
      Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
      
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      96b9b946
    • F
      Btrfs: fix inode cache block reserve leak on failure to allocate data space · 692aa7d5
      Filipe Manana 提交于
      [ Upstream commit 29d47d00e0ae61668ee0c5d90bef2893c8abbafa ]
      
      If we failed to allocate the data extent(s) for the inode space cache, we
      were bailing out without releasing the previously reserved metadata. This
      was triggering the following warnings when unmounting a filesystem:
      
        $ cat -n fs/btrfs/inode.c
        (...)
        9268  void btrfs_destroy_inode(struct inode *inode)
        9269  {
        (...)
        9276          WARN_ON(BTRFS_I(inode)->block_rsv.reserved);
        9277          WARN_ON(BTRFS_I(inode)->block_rsv.size);
        (...)
        9281          WARN_ON(BTRFS_I(inode)->csum_bytes);
        9282          WARN_ON(BTRFS_I(inode)->defrag_bytes);
        (...)
      
      Several fstests test cases triggered this often, such as generic/083,
      generic/102, generic/172, generic/269 and generic/300 at least, producing
      stack traces like the following in dmesg/syslog:
      
        [82039.079546] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9276 btrfs_destroy_inode+0x203/0x270 [btrfs]
        (...)
        [82039.081543] CPU: 2 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.081912] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.082673] RIP: 0010:btrfs_destroy_inode+0x203/0x270 [btrfs]
        (...)
        [82039.083913] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010206
        [82039.084320] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002
        [82039.084736] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8dde29b34660
        [82039.085156] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.085578] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0
        [82039.086000] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000
        [82039.086416] FS:  00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000
        [82039.086837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.087253] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0
        [82039.087672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.088089] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.088504] Call Trace:
        [82039.088918]  destroy_inode+0x3b/0x70
        [82039.089340]  btrfs_free_fs_root+0x16/0xa0 [btrfs]
        [82039.089768]  btrfs_free_fs_roots+0xd8/0x160 [btrfs]
        [82039.090183]  ? wait_for_completion+0x65/0x1a0
        [82039.090607]  close_ctree+0x172/0x370 [btrfs]
        [82039.091021]  generic_shutdown_super+0x6c/0x110
        [82039.091427]  kill_anon_super+0xe/0x30
        [82039.091832]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.092233]  deactivate_locked_super+0x3a/0x70
        [82039.092636]  cleanup_mnt+0x3b/0x80
        [82039.093039]  task_work_run+0x93/0xc0
        [82039.093457]  exit_to_usermode_loop+0xfa/0x100
        [82039.093856]  do_syscall_64+0x162/0x1d0
        [82039.094244]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.094634] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.095876] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.096290] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.096700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.097110] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.097522] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.097937] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.098350] irq event stamp: 0
        [82039.098750] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.099150] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.099545] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.099925] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.100292] ---[ end trace f2521afa616ddccc ]---
        [82039.100707] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9277 btrfs_destroy_inode+0x1ac/0x270 [btrfs]
        (...)
        [82039.103050] CPU: 2 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.103428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.104203] RIP: 0010:btrfs_destroy_inode+0x1ac/0x270 [btrfs]
        (...)
        [82039.105461] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010206
        [82039.105866] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002
        [82039.106270] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8dde29b34660
        [82039.106673] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.107078] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0
        [82039.107487] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000
        [82039.107894] FS:  00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000
        [82039.108309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.108723] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0
        [82039.109146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.109567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.109989] Call Trace:
        [82039.110405]  destroy_inode+0x3b/0x70
        [82039.110830]  btrfs_free_fs_root+0x16/0xa0 [btrfs]
        [82039.111257]  btrfs_free_fs_roots+0xd8/0x160 [btrfs]
        [82039.111675]  ? wait_for_completion+0x65/0x1a0
        [82039.112101]  close_ctree+0x172/0x370 [btrfs]
        [82039.112519]  generic_shutdown_super+0x6c/0x110
        [82039.112988]  kill_anon_super+0xe/0x30
        [82039.113439]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.113861]  deactivate_locked_super+0x3a/0x70
        [82039.114278]  cleanup_mnt+0x3b/0x80
        [82039.114685]  task_work_run+0x93/0xc0
        [82039.115083]  exit_to_usermode_loop+0xfa/0x100
        [82039.115476]  do_syscall_64+0x162/0x1d0
        [82039.115863]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.116254] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.117463] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.117882] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.118330] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.118743] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.119159] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.119574] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.119987] irq event stamp: 0
        [82039.120387] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.120787] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.121182] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.121563] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.121933] ---[ end trace f2521afa616ddccd ]---
        [82039.122353] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9278 btrfs_destroy_inode+0x1bc/0x270 [btrfs]
        (...)
        [82039.124606] CPU: 2 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.125008] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.125801] RIP: 0010:btrfs_destroy_inode+0x1bc/0x270 [btrfs]
        (...)
        [82039.126998] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010202
        [82039.127399] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002
        [82039.127803] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8dde29b34660
        [82039.128206] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.128611] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0
        [82039.129020] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000
        [82039.129428] FS:  00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000
        [82039.129846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.130261] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0
        [82039.130684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.131142] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.131561] Call Trace:
        [82039.131990]  destroy_inode+0x3b/0x70
        [82039.132417]  btrfs_free_fs_root+0x16/0xa0 [btrfs]
        [82039.132844]  btrfs_free_fs_roots+0xd8/0x160 [btrfs]
        [82039.133262]  ? wait_for_completion+0x65/0x1a0
        [82039.133688]  close_ctree+0x172/0x370 [btrfs]
        [82039.134157]  generic_shutdown_super+0x6c/0x110
        [82039.134575]  kill_anon_super+0xe/0x30
        [82039.134997]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.135415]  deactivate_locked_super+0x3a/0x70
        [82039.135832]  cleanup_mnt+0x3b/0x80
        [82039.136239]  task_work_run+0x93/0xc0
        [82039.136637]  exit_to_usermode_loop+0xfa/0x100
        [82039.137029]  do_syscall_64+0x162/0x1d0
        [82039.137418]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.137812] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.139059] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.139475] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.139890] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.140302] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.140719] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.141138] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.141597] irq event stamp: 0
        [82039.142043] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.142443] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.142839] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.143220] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.143588] ---[ end trace f2521afa616ddcce ]---
        [82039.167472] WARNING: CPU: 3 PID: 13167 at fs/btrfs/extent-tree.c:10120 btrfs_free_block_groups+0x30d/0x460 [btrfs]
        (...)
        [82039.173800] CPU: 3 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.174847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.177031] RIP: 0010:btrfs_free_block_groups+0x30d/0x460 [btrfs]
        (...)
        [82039.180397] RSP: 0018:ffffac0b426a7dd8 EFLAGS: 00010206
        [82039.181574] RAX: ffff8de010a1db40 RBX: ffff8de010a1db40 RCX: 0000000000170014
        [82039.182711] RDX: ffff8ddff4380040 RSI: ffff8de010a1da58 RDI: 0000000000000246
        [82039.183817] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.184925] R10: ffff8de036404380 R11: ffffffffb8a5ea00 R12: ffff8de010a1b2b8
        [82039.186090] R13: ffff8de010a1b2b8 R14: 0000000000000000 R15: dead000000000100
        [82039.187208] FS:  00007f8db96d12c0(0000) GS:ffff8de036b80000(0000) knlGS:0000000000000000
        [82039.188345] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.189481] CR2: 00007fb044005170 CR3: 00000002315cc006 CR4: 00000000003606e0
        [82039.190674] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.191829] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.192978] Call Trace:
        [82039.194160]  close_ctree+0x19a/0x370 [btrfs]
        [82039.195315]  generic_shutdown_super+0x6c/0x110
        [82039.196486]  kill_anon_super+0xe/0x30
        [82039.197645]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.198696]  deactivate_locked_super+0x3a/0x70
        [82039.199619]  cleanup_mnt+0x3b/0x80
        [82039.200559]  task_work_run+0x93/0xc0
        [82039.201505]  exit_to_usermode_loop+0xfa/0x100
        [82039.202436]  do_syscall_64+0x162/0x1d0
        [82039.203339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.204091] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.206360] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.207132] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.207906] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.208621] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.209285] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.209984] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.210642] irq event stamp: 0
        [82039.211306] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.211971] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.212643] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.213304] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.213875] ---[ end trace f2521afa616ddccf ]---
      
      Fix this by releasing the reserved metadata on failure to allocate data
      extent(s) for the inode cache.
      
      Fixes: 69fe2d75 ("btrfs: make the delalloc block rsv per inode")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      692aa7d5
  5. 29 10月, 2019 3 次提交
  6. 18 10月, 2019 2 次提交
    • J
      btrfs: fix uninitialized ret in ref-verify · e0805d7f
      Josef Bacik 提交于
      commit c5f4987e86f6692fdb12533ea1fc7a7bb98e555a upstream.
      
      Coverity caught a case where we could return with a uninitialized value
      in ret in process_leaf.  This is actually pretty likely because we could
      very easily run into a block group item key and have a garbage value in
      ret and think there was an errror.  Fix this by initializing ret to 0.
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0805d7f
    • J
      btrfs: fix incorrect updating of log root tree · f7313de4
      Josef Bacik 提交于
      commit 4203e968947071586a98b5314fd7ffdea3b4f971 upstream.
      
      We've historically had reports of being unable to mount file systems
      because the tree log root couldn't be read.  Usually this is the "parent
      transid failure", but could be any of the related errors, including
      "fsid mismatch" or "bad tree block", depending on which block got
      allocated.
      
      The modification of the individual log root items are serialized on the
      per-log root root_mutex.  This means that any modification to the
      per-subvol log root_item is completely protected.
      
      However we update the root item in the log root tree outside of the log
      root tree log_mutex.  We do this in order to allow multiple subvolumes
      to be updated in each log transaction.
      
      This is problematic however because when we are writing the log root
      tree out we update the super block with the _current_ log root node
      information.  Since these two operations happen independently of each
      other, you can end up updating the log root tree in between writing out
      the dirty blocks and setting the super block to point at the current
      root.
      
      This means we'll point at the new root node that hasn't been written
      out, instead of the one we should be pointing at.  Thus whatever garbage
      or old block we end up pointing at complains when we mount the file
      system later and try to replay the log.
      
      Fix this by copying the log's root item into a local root item copy.
      Then once we're safely under the log_root_tree->log_mutex we update the
      root item in the log_root_tree.  This way we do not modify the
      log_root_tree while we're committing it, fixing the problem.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NChris Mason <clm@fb.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7313de4
  7. 05 10月, 2019 7 次提交
    • F
      Btrfs: fix race setting up and completing qgroup rescan workers · bacff03b
      Filipe Manana 提交于
      commit 13fc1d271a2e3ab8a02071e711add01fab9271f6 upstream.
      
      There is a race between setting up a qgroup rescan worker and completing
      a qgroup rescan worker that can lead to callers of the qgroup rescan wait
      ioctl to either not wait for the rescan worker to complete or to hang
      forever due to missing wake ups. The following diagram shows a sequence
      of steps that illustrates the race.
      
              CPU 1                                                         CPU 2                                  CPU 3
      
       btrfs_ioctl_quota_rescan()
        btrfs_qgroup_rescan()
         qgroup_rescan_init()
          mutex_lock(&fs_info->qgroup_rescan_lock)
          spin_lock(&fs_info->qgroup_lock)
      
          fs_info->qgroup_flags |=
            BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
          init_completion(
            &fs_info->qgroup_rescan_completion)
      
          fs_info->qgroup_rescan_running = true
      
          mutex_unlock(&fs_info->qgroup_rescan_lock)
          spin_unlock(&fs_info->qgroup_lock)
      
          btrfs_init_work()
           --> starts the worker
      
                                                              btrfs_qgroup_rescan_worker()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_flags &=
                                                                 ~BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               starts transaction, updates qgroup status
                                                               item, etc
      
                                                                                                                 btrfs_ioctl_quota_rescan()
                                                                                                                  btrfs_qgroup_rescan()
                                                                                                                   qgroup_rescan_init()
                                                                                                                    mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_lock(&fs_info->qgroup_lock)
      
                                                                                                                    fs_info->qgroup_flags |=
                                                                                                                      BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                                                                                    init_completion(
                                                                                                                      &fs_info->qgroup_rescan_completion)
      
                                                                                                                    fs_info->qgroup_rescan_running = true
      
                                                                                                                    mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_unlock(&fs_info->qgroup_lock)
      
                                                                                                                    btrfs_init_work()
                                                                                                                     --> starts another worker
      
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_rescan_running = false
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
      							 complete_all(&fs_info->qgroup_rescan_completion)
      
      Before the rescan worker started by the task at CPU 3 completes, if
      another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
      because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
      fs_info->qgroup_flags, which is expected and correct behaviour.
      
      However if other task calls btrfs_ioctl_quota_rescan_wait() before the
      rescan worker started by the task at CPU 3 completes, it will return
      immediately without waiting for the new rescan worker to complete,
      because fs_info->qgroup_rescan_running is set to false by CPU 2.
      
      This race is making test case btrfs/171 (from fstests) to fail often:
      
        btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
      #      --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
      #      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
      #      @@ -1,2 +1,3 @@
      #       QA output created by 171
      #      +ERROR: quota rescan failed: Operation now in progress
      #       Silence is golden
      #      ...
      #      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)
      
      That is because the test calls the btrfs-progs commands "qgroup quota
      rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
      calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
      commands 'qgroup assign' and 'qgroup remove' often call the rescan start
      ioctl after calling the qgroup assign ioctl,
      btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
      for a rescan worker to complete.
      
      Another problem the race can cause is missing wake ups for waiters,
      since the call to complete_all() happens outside a critical section and
      after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
      diagram above, if we have a waiter for the first rescan task (executed
      by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
      if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
      before it calls complete_all() against
      fs_info->qgroup_rescan_completion, the task at CPU 3 calls
      init_completion() against fs_info->qgroup_rescan_completion which
      re-initilizes its wait queue to an empty queue, therefore causing the
      rescan worker at CPU 2 to call complete_all() against an empty queue,
      never waking up the task waiting for that rescan worker.
      
      Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
      fs_info->qgroup_rescan_running to false in the same critical section,
      delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
      call to complete_all() in that same critical section. This gives the
      protection needed to avoid rescan wait ioctl callers not waiting for a
      running rescan worker and the lost wake ups problem, since setting that
      rescan flag and boolean as well as initializing the wait queue is done
      already in a critical section delimited by that mutex (at
      qgroup_rescan_init()).
      
      Fixes: 57254b6e ("Btrfs: add ioctl to wait for qgroup rescan completion")
      Fixes: d2c609b8 ("btrfs: properly track when rescan worker is running")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bacff03b
    • Q
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · b5c42ef0
      Qu Wenruo 提交于
      commit d4e204948fe3e0dc8e1fbf3f8f3290c9c2823be3 upstream.
      
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5c42ef0
    • Q
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · c521bfa8
      Qu Wenruo 提交于
      commit bab32fc069ce8829c416e8737c119f62a57970f9 upstream.
      
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c521bfa8
    • N
      btrfs: Relinquish CPUs in btrfs_compare_trees · 067f82a0
      Nikolay Borisov 提交于
      commit 6af112b11a4bc1b560f60a618ac9c1dcefe9836e upstream.
      
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      067f82a0
    • F
      Btrfs: fix use-after-free when using the tree modification log · b08344be
      Filipe Manana 提交于
      commit efad8a853ad2057f96664328a0d327a05ce39c76 upstream.
      
      At ctree.c:get_old_root(), we are accessing a root's header owner field
      after we have freed the respective extent buffer. This results in an
      use-after-free that can lead to crashes, and when CONFIG_DEBUG_PAGEALLOC
      is set, results in a stack trace like the following:
      
        [ 3876.799331] stack segment: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [ 3876.799363] CPU: 0 PID: 15436 Comm: pool Not tainted 5.3.0-rc3-btrfs-next-54 #1
        [ 3876.799385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [ 3876.799433] RIP: 0010:btrfs_search_old_slot+0x652/0xd80 [btrfs]
        (...)
        [ 3876.799502] RSP: 0018:ffff9f08c1a2f9f0 EFLAGS: 00010286
        [ 3876.799518] RAX: ffff8dd300000000 RBX: ffff8dd85a7a9348 RCX: 000000038da26000
        [ 3876.799538] RDX: 0000000000000000 RSI: ffffe522ce368980 RDI: 0000000000000246
        [ 3876.799559] RBP: dae1922adadad000 R08: 0000000008020000 R09: ffffe522c0000000
        [ 3876.799579] R10: ffff8dd57fd788c8 R11: 000000007511b030 R12: ffff8dd781ddc000
        [ 3876.799599] R13: ffff8dd9e6240578 R14: ffff8dd6896f7a88 R15: ffff8dd688cf90b8
        [ 3876.799620] FS:  00007f23ddd97700(0000) GS:ffff8dda20200000(0000) knlGS:0000000000000000
        [ 3876.799643] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 3876.799660] CR2: 00007f23d4024000 CR3: 0000000710bb0005 CR4: 00000000003606f0
        [ 3876.799682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 3876.799703] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 3876.799723] Call Trace:
        [ 3876.799735]  ? do_raw_spin_unlock+0x49/0xc0
        [ 3876.799749]  ? _raw_spin_unlock+0x24/0x30
        [ 3876.799779]  resolve_indirect_refs+0x1eb/0xc80 [btrfs]
        [ 3876.799810]  find_parent_nodes+0x38d/0x1180 [btrfs]
        [ 3876.799841]  btrfs_check_shared+0x11a/0x1d0 [btrfs]
        [ 3876.799870]  ? extent_fiemap+0x598/0x6e0 [btrfs]
        [ 3876.799895]  extent_fiemap+0x598/0x6e0 [btrfs]
        [ 3876.799913]  do_vfs_ioctl+0x45a/0x700
        [ 3876.799926]  ksys_ioctl+0x70/0x80
        [ 3876.799938]  ? trace_hardirqs_off_thunk+0x1a/0x20
        [ 3876.799953]  __x64_sys_ioctl+0x16/0x20
        [ 3876.799965]  do_syscall_64+0x62/0x220
        [ 3876.799977]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 3876.799993] RIP: 0033:0x7f23e0013dd7
        (...)
        [ 3876.800056] RSP: 002b:00007f23ddd96ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [ 3876.800078] RAX: ffffffffffffffda RBX: 00007f23d80210f8 RCX: 00007f23e0013dd7
        [ 3876.800099] RDX: 00007f23d80210f8 RSI: 00000000c020660b RDI: 0000000000000003
        [ 3876.800626] RBP: 000055fa2a2a2440 R08: 0000000000000000 R09: 00007f23ddd96d7c
        [ 3876.801143] R10: 00007f23d8022000 R11: 0000000000000246 R12: 00007f23ddd96d80
        [ 3876.801662] R13: 00007f23ddd96d78 R14: 00007f23d80210f0 R15: 00007f23ddd96d80
        (...)
        [ 3876.805107] ---[ end trace e53161e179ef04f9 ]---
      
      Fix that by saving the root's header owner field into a local variable
      before freeing the root's extent buffer, and then use that local variable
      when needed.
      
      Fixes: 30b0463a ("Btrfs: fix accessing the root pointer in tree mod log functions")
      CC: stable@vger.kernel.org # 3.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b08344be
    • C
      btrfs: fix allocation of free space cache v1 bitmap pages · 4874c6fe
      Christophe Leroy 提交于
      commit 3acd48507dc43eeeb0a1fe965b8bad91cab904a7 upstream.
      
      Various notifications of type "BUG kmalloc-4096 () : Redzone
      overwritten" have been observed recently in various parts of the kernel.
      After some time, it has been made a relation with the use of BTRFS
      filesystem and with SLUB_DEBUG turned on.
      
      [   22.809700] BUG kmalloc-4096 (Tainted: G        W        ): Redzone overwritten
      
      [   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
      [   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
      [   22.811193] 	__slab_alloc.constprop.26+0x44/0x70
      [   22.811345] 	kmem_cache_alloc_trace+0xf0/0x2ec
      [   22.811588] 	__load_free_space_cache+0x588/0x780 [btrfs]
      [   22.811848] 	load_free_space_cache+0xf4/0x1b0 [btrfs]
      [   22.812090] 	cache_block_group+0x1d0/0x3d0 [btrfs]
      [   22.812321] 	find_free_extent+0x680/0x12a4 [btrfs]
      [   22.812549] 	btrfs_reserve_extent+0xec/0x220 [btrfs]
      [   22.812785] 	btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
      [   22.813032] 	__btrfs_cow_block+0x150/0x5d4 [btrfs]
      [   22.813262] 	btrfs_cow_block+0x194/0x298 [btrfs]
      [   22.813484] 	commit_cowonly_roots+0x44/0x294 [btrfs]
      [   22.813718] 	btrfs_commit_transaction+0x63c/0xc0c [btrfs]
      [   22.813973] 	close_ctree+0xf8/0x2a4 [btrfs]
      [   22.814107] 	generic_shutdown_super+0x80/0x110
      [   22.814250] 	kill_anon_super+0x18/0x30
      [   22.814437] 	btrfs_kill_super+0x18/0x90 [btrfs]
      [   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
      [   22.814841] 	proc_cgroup_show+0xc0/0x248
      [   22.814967] 	proc_single_show+0x54/0x98
      [   22.815086] 	seq_read+0x278/0x45c
      [   22.815190] 	__vfs_read+0x28/0x17c
      [   22.815289] 	vfs_read+0xa8/0x14c
      [   22.815381] 	ksys_read+0x50/0x94
      [   22.815475] 	ret_from_syscall+0x0/0x38
      
      Commit 69d24804 ("btrfs: use copy_page for copying pages instead of
      memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
      have the size of a page, they were allocated with kzalloc().
      
      Most of the time, kzalloc() allocates aligned blocks of memory, so
      copy_page() can be used. But when some debug options like SLAB_DEBUG are
      activated, kzalloc() may return unaligned pointer.
      
      On powerpc, memcpy(), copy_page() and other copying functions use
      'dcbz' instruction which provides an entire zeroed cacheline to avoid
      memory read when the intention is to overwrite a full line. Functions
      like memcpy() are writen to care about partial cachelines at the start
      and end of the destination, but copy_page() assumes it gets pages. As
      pages are naturally cache aligned, copy_page() doesn't care about
      partial lines. This means that when copy_page() is called with a
      misaligned pointer, a few leading bytes are zeroed.
      
      To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
      The cache pool is created with PAGE_SIZE alignment constraint.
      Reported-by: NErhard F. <erhard_f@mailbox.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
      Fixes: 69d24804 ("btrfs: use copy_page for copying pages instead of memcpy")
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_free_space_bitmap ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4874c6fe
    • Q
      btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type · c5dbd74f
      Qu Wenruo 提交于
      [ Upstream commit 2a28468e525f3924efed7f29f2bc5a2926e7e19a ]
      
      [BUG]
      With fuzzed image and MIXED_GROUPS super flag, we can hit the following
      BUG_ON():
      
        kernel BUG at fs/btrfs/delayed-ref.c:491!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 1849 Comm: sync Tainted: G           O      5.2.0-custom #27
        RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs]
        Call Trace:
         add_delayed_ref_head+0x20c/0x2d0 [btrfs]
         btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs]
         btrfs_free_tree_block+0x123/0x380 [btrfs]
         __btrfs_cow_block+0x435/0x500 [btrfs]
         btrfs_cow_block+0x110/0x240 [btrfs]
         btrfs_search_slot+0x230/0xa00 [btrfs]
         ? __lock_acquire+0x105e/0x1e20
         btrfs_insert_empty_items+0x67/0xc0 [btrfs]
         alloc_reserved_file_extent+0x9e/0x340 [btrfs]
         __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs]
         ? kvm_clock_read+0x18/0x30
         ? __sched_clock_gtod_offset+0x21/0x50
         btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs]
         btrfs_run_delayed_refs+0x23/0x30 [btrfs]
         btrfs_commit_transaction+0x53/0x9f0 [btrfs]
         btrfs_sync_fs+0x7c/0x1c0 [btrfs]
         ? __ia32_sys_fdatasync+0x20/0x20
         sync_fs_one_sb+0x23/0x30
         iterate_supers+0x95/0x100
         ksys_sync+0x62/0xb0
         __ia32_sys_sync+0xe/0x20
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      This situation is caused by several factors:
      - Fuzzed image
        The extent tree of this fs missed one backref for extent tree root.
        So we can allocated space from that slot.
      
      - MIXED_BG feature
        Super block has MIXED_BG flag.
      
      - No mixed block groups exists
        All block groups are just regular ones.
      
      This makes data space_info->block_groups[] contains metadata block
      groups.  And when we reserve space for data, we can use space in
      metadata block group.
      
      Then we hit the following file operations:
      
      - fallocate
        We need to allocate data extents.
        find_free_extent() choose to use the metadata block to allocate space
        from, and choose the space of extent tree root, since its backref is
        missing.
      
        This generate one delayed ref head with is_data = 1.
      
      - extent tree update
        We need to update extent tree at run_delayed_ref time.
      
        This generate one delayed ref head with is_data = 0, for the same
        bytenr of old extent tree root.
      
      Then we trigger the BUG_ON().
      
      [FIX]
      The quick fix here is to check block_group->flags before using it.
      
      The problem can only happen for MIXED_GROUPS fs. Regular filesystems
      won't have space_info with DATA|METADATA flag, and no way to hit the
      bug.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255Reported-by: NJungyeon Yoon <jungyeon.yoon@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c5dbd74f
  8. 19 9月, 2019 1 次提交
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 7cbd49cf
      Filipe Manana 提交于
      commit 410f954cb1d1c79ae485dd83a175f21954fd87cd upstream.
      
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cbd49cf
  9. 16 9月, 2019 12 次提交
    • J
      btrfs: correctly validate compression type · 1c13c9c4
      Johannes Thumshirn 提交于
      [ Upstream commit aa53e3bfac7205fb3a8815ac1c937fd6ed01b41e ]
      
      Nikolay reported the following KASAN splat when running btrfs/048:
      
      [ 1843.470920] ==================================================================
      [ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
      [ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979
      
      [ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
      [ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 1843.476322] Call Trace:
      [ 1843.476674]  dump_stack+0x7c/0xbb
      [ 1843.477132]  ? strncmp+0x66/0xb0
      [ 1843.477587]  print_address_description+0x114/0x320
      [ 1843.478256]  ? strncmp+0x66/0xb0
      [ 1843.478740]  ? strncmp+0x66/0xb0
      [ 1843.479185]  __kasan_report+0x14e/0x192
      [ 1843.479759]  ? strncmp+0x66/0xb0
      [ 1843.480209]  kasan_report+0xe/0x20
      [ 1843.480679]  strncmp+0x66/0xb0
      [ 1843.481105]  prop_compression_validate+0x24/0x70
      [ 1843.481798]  btrfs_xattr_handler_set_prop+0x65/0x160
      [ 1843.482509]  __vfs_setxattr+0x71/0x90
      [ 1843.483012]  __vfs_setxattr_noperm+0x84/0x130
      [ 1843.483606]  vfs_setxattr+0xac/0xb0
      [ 1843.484085]  setxattr+0x18c/0x230
      [ 1843.484546]  ? vfs_setxattr+0xb0/0xb0
      [ 1843.485048]  ? __mod_node_page_state+0x1f/0xa0
      [ 1843.485672]  ? _raw_spin_unlock+0x24/0x40
      [ 1843.486233]  ? __handle_mm_fault+0x988/0x1290
      [ 1843.486823]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487330]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487842]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.488442]  ? debug_lockdep_rcu_enabled+0x22/0x40
      [ 1843.489089]  ? rcu_sync_lockdep_assert+0xe/0x70
      [ 1843.489707]  ? __sb_start_write+0x158/0x200
      [ 1843.490278]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.490855]  ? __mnt_want_write+0x98/0xe0
      [ 1843.491397]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.492201]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 1843.493201]  do_syscall_64+0x6c/0x230
      [ 1843.493988]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1843.495041] RIP: 0033:0x7fa7a8a7707a
      [ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
      [ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
      [ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
      [ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
      [ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
      [ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
      [ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff
      
      [ 1843.505268] Allocated by task 3979:
      [ 1843.505771]  save_stack+0x19/0x80
      [ 1843.506211]  __kasan_kmalloc.constprop.5+0xa0/0xd0
      [ 1843.506836]  setxattr+0xeb/0x230
      [ 1843.507264]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.507886]  do_syscall_64+0x6c/0x230
      [ 1843.508429]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [ 1843.509558] Freed by task 0:
      [ 1843.510188] (stack is not available)
      
      [ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
                      which belongs to the cache kmalloc-8 of size 8
      [ 1843.514095] The buggy address is located 2 bytes inside of
                      8-byte region [ffff888111e369e0, ffff888111e369e8)
      [ 1843.516524] The buggy address belongs to the page:
      [ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
      [ 1843.519993] flags: 0x4404000010200(slab|head)
      [ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
      [ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
      [ 1843.524281] page dumped because: kasan: bad access detected
      
      [ 1843.525936] Memory state around the buggy address:
      [ 1843.526975]  ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.528479]  ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
      [ 1843.531877]                                                        ^
      [ 1843.533287]  ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.534874]  ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.536468] ==================================================================
      
      This is caused by supplying a too short compression value ('lz') in the
      test-case and comparing it to 'lzo' with strncmp() and a length of 3.
      strncmp() read past the 'lz' when looking for the 'o' and thus caused an
      out-of-bounds read.
      
      Introduce a new check 'btrfs_compress_is_valid_type()' which not only
      checks the user-supplied value against known compression types, but also
      employs checks for too short values.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Fixes: 272e5326c783 ("btrfs: prop: fix vanished compression property after failed set")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1c13c9c4
    • F
      Btrfs: fix race between block group removal and block group allocation · 1d064876
      Filipe Manana 提交于
      [ Upstream commit 8eaf40c0e24e98899a0f3ac9d25a33aafe13822a ]
      
      If a task is removing the block group that currently has the highest start
      offset amongst all existing block groups, there is a short time window
      where it races with a concurrent block group allocation, resulting in a
      transaction abort with an error code of EEXIST.
      
      The following diagram explains the race in detail:
      
            Task A                                                        Task B
      
       btrfs_remove_block_group(bg offset X)
      
         remove_extent_mapping(em offset X)
           -> removes extent map X from the
              tree of extent maps
              (fs_info->mapping_tree), so the
              next call to find_next_chunk()
              will return offset X
      
                                                         btrfs_alloc_chunk()
                                                           find_next_chunk()
                                                             --> returns offset X
      
                                                           __btrfs_alloc_chunk(offset X)
                                                             btrfs_make_block_group()
                                                               btrfs_create_block_group_cache()
                                                                 --> creates btrfs_block_group_cache
                                                                     object with a key corresponding
                                                                     to the block group item in the
                                                                     extent, the key is:
                                                                     (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
                                                               --> adds the btrfs_block_group_cache object
                                                                   to the list new_bgs of the transaction
                                                                   handle
      
                                                         btrfs_end_transaction(trans handle)
                                                           __btrfs_end_transaction()
                                                             btrfs_create_pending_block_groups()
                                                               --> sees the new btrfs_block_group_cache
                                                                   in the new_bgs list of the transaction
                                                                   handle
                                                               --> its call to btrfs_insert_item() fails
                                                                   with -EEXIST when attempting to insert
                                                                   the block group item key
                                                                   (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
                                                                   because task A has not removed that key yet
                                                               --> aborts the running transaction with
                                                                   error -EEXIST
      
         btrfs_del_item()
           -> removes the block group's key from
              the extent tree, key is
              (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
      A sample transaction abort trace:
      
        [78912.403537] ------------[ cut here ]------------
        [78912.403811] BTRFS: Transaction aborted (error -17)
        [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G        W         5.0.0-btrfs-next-46 #1
        [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
        [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
        [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
        [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
        [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
        [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
        [78912.409899] FS:  00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
        [78912.410285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
        [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [78912.411898] Call Trace:
        [78912.412318]  __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
        [78912.412746]  btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
        [78912.413179]  scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
        [78912.413622]  ? __mutex_unlock_slowpath+0x100/0x2a0
        [78912.414078]  btrfs_scrub_dev+0x2ef/0x720 [btrfs]
        [78912.414535]  ? __sb_start_write+0xd4/0x1c0
        [78912.414963]  ? mnt_want_write_file+0x24/0x50
        [78912.415403]  btrfs_ioctl+0x17fb/0x3120 [btrfs]
        [78912.415832]  ? lock_acquire+0xa6/0x190
        [78912.416256]  ? do_vfs_ioctl+0xa2/0x6f0
        [78912.416685]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [78912.417116]  do_vfs_ioctl+0xa2/0x6f0
        [78912.417534]  ? __fget+0x113/0x200
        [78912.417954]  ksys_ioctl+0x70/0x80
        [78912.418369]  __x64_sys_ioctl+0x16/0x20
        [78912.418812]  do_syscall_64+0x60/0x1b0
        [78912.419231]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [78912.419644] RIP: 0033:0x7f3880252dd7
        (...)
        [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
        [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
        [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
        [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
        [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
        [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---
      
      Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
      only at the very end, after removing the block group item key from the
      extent tree (and removing the free space tree entry if we are using the
      free space tree feature).
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1d064876
    • D
      btrfs: init csum_list before possible free · 476ecc14
      Dan Robertson 提交于
      [ Upstream commit e49be14b8d80e23bb7c53d78c21717a474ade76b ]
      
      The scrub_ctx csum_list member must be initialized before scrub_free_ctx
      is called. If the csum_list is not initialized beforehand, the
      list_empty call in scrub_free_csums will result in a null deref if the
      allocation fails in the for loop.
      
      Fixes: a2de733c ("btrfs: scrub")
      CC: stable@vger.kernel.org # 3.0+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDan Robertson <dan@dlrobertson.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      476ecc14
    • A
      btrfs: scrub: fix circular locking dependency warning · 936690bd
      Anand Jain 提交于
      [ Upstream commit 1cec3f27168d7835ff3d23ab371cd548440131bb ]
      
      This fixes a longstanding lockdep warning triggered by
      fstests/btrfs/011.
      
      Circular locking dependency check reports warning[1], that's because the
      btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
      held. The test case leading to this warning:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /btrfs
        $ btrfs scrub start -B /btrfs
      
      In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
      of the scrub workers are needed. So once we have incremented and decremented
      the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
      scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
      patch drops the scrub_lock before calling btrfs_destroy_workqueue().
      
        [359.258534] ======================================================
        [359.260305] WARNING: possible circular locking dependency detected
        [359.261938] 5.0.0-rc6-default #461 Not tainted
        [359.263135] ------------------------------------------------------
        [359.264672] btrfs/20975 is trying to acquire lock:
        [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
        [359.268416]
        [359.268416] but task is already holding lock:
        [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.272418]
        [359.272418] which lock already depends on the new lock.
        [359.272418]
        [359.274692]
        [359.274692] the existing dependency chain (in reverse order) is:
        [359.276671]
        [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
        [359.278187]        __mutex_lock+0x86/0x9c0
        [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
        [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
        [359.281931]        close_ctree+0x30b/0x350 [btrfs]
        [359.283208]        generic_shutdown_super+0x64/0x100
        [359.284516]        kill_anon_super+0x14/0x30
        [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
        [359.286964]        deactivate_locked_super+0x29/0x60
        [359.288242]        cleanup_mnt+0x3b/0x70
        [359.289310]        task_work_run+0x98/0xc0
        [359.290428]        exit_to_usermode_loop+0x83/0x90
        [359.291445]        do_syscall_64+0x15b/0x180
        [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.294011]
        [359.294011] -> #2 (sb_internal#2){.+.+}:
        [359.295432]        __sb_start_write+0x113/0x1d0
        [359.296394]        start_transaction+0x369/0x500 [btrfs]
        [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
        [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
        [359.299698]        process_one_work+0x246/0x610
        [359.300898]        worker_thread+0x3c/0x390
        [359.302020]        kthread+0x116/0x130
        [359.303053]        ret_from_fork+0x24/0x30
        [359.304152]
        [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
        [359.306100]        process_one_work+0x21f/0x610
        [359.307302]        worker_thread+0x3c/0x390
        [359.308465]        kthread+0x116/0x130
        [359.309357]        ret_from_fork+0x24/0x30
        [359.310229]
        [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
        [359.311812]        lock_acquire+0x90/0x180
        [359.312929]        flush_workqueue+0xaa/0x540
        [359.313845]        drain_workqueue+0xa1/0x180
        [359.314761]        destroy_workqueue+0x17/0x240
        [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
        [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.322908]        do_vfs_ioctl+0xa2/0x6d0
        [359.324021]        ksys_ioctl+0x3a/0x70
        [359.325066]        __x64_sys_ioctl+0x16/0x20
        [359.326236]        do_syscall_64+0x54/0x180
        [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.328772]
        [359.328772] other info that might help us debug this:
        [359.328772]
        [359.330990] Chain exists of:
        [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
        [359.330990]
        [359.334376]  Possible unsafe locking scenario:
        [359.334376]
        [359.336020]        CPU0                    CPU1
        [359.337070]        ----                    ----
        [359.337821]   lock(&fs_info->scrub_lock);
        [359.338506]                                lock(sb_internal#2);
        [359.339506]                                lock(&fs_info->scrub_lock);
        [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
        [359.342437]
        [359.342437]  *** DEADLOCK ***
        [359.342437]
        [359.343745] 1 lock held by btrfs/20975:
        [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.346778]
        [359.346778] stack backtrace:
        [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
        [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [359.350501] Call Trace:
        [359.350931]  dump_stack+0x67/0x90
        [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
        [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
        [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
        [359.356505]  __lock_acquire+0xb84/0xf10
        [359.357505]  lock_acquire+0x90/0x180
        [359.358271]  ? flush_workqueue+0x87/0x540
        [359.359098]  flush_workqueue+0xaa/0x540
        [359.359912]  ? flush_workqueue+0x87/0x540
        [359.360740]  ? drain_workqueue+0x1e/0x180
        [359.361565]  ? drain_workqueue+0xa1/0x180
        [359.362391]  drain_workqueue+0xa1/0x180
        [359.363193]  destroy_workqueue+0x17/0x240
        [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
        [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
        [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.370186]  ? __lock_acquire+0x263/0xf10
        [359.370777]  ? kvm_clock_read+0x14/0x30
        [359.371392]  ? kvm_sched_clock_read+0x5/0x10
        [359.372248]  ? sched_clock+0x5/0x10
        [359.372786]  ? sched_clock_cpu+0xc/0xc0
        [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
        [359.374552]  do_vfs_ioctl+0xa2/0x6d0
        [359.375378]  ? do_sigaction+0xff/0x250
        [359.376233]  ksys_ioctl+0x3a/0x70
        [359.376954]  __x64_sys_ioctl+0x16/0x20
        [359.377772]  do_syscall_64+0x54/0x180
        [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.380422] RIP: 0033:0x7f5429296a97
      
      Backporting to older kernels: scrub_nocow_workers must be freed the same
      way as the others.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      936690bd
    • D
      btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex · ff55333f
      David Sterba 提交于
      [ Upstream commit 0e94c4f45d14cf89d1f40c91b0a8517e791672a7 ]
      
      The scrub context is allocated with GFP_KERNEL and called from
      btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
      regarding reclaim that could try to flush filesystem data in order to
      get the memory. And the device_list_mutex is held during superblock
      commit, so this would cause a lockup.
      
      Move the alocation and initialization before any changes that require
      the mutex.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ff55333f
    • D
      btrfs: scrub: pass fs_info to scrub_setup_ctx · 8ba3169d
      David Sterba 提交于
      [ Upstream commit 92f7ba434f51e8e9317f1d166105889aa230abd2 ]
      
      We can pass fs_info directly as this is the only member of btrfs_device
      that's bing used inside scrub_setup_ctx.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8ba3169d
    • Q
      btrfs: Use real device structure to verify dev extent · be77686f
      Qu Wenruo 提交于
      [ Upstream commit 1b3922a8bc74231f9a767d1be6d9a061a4d4eeab ]
      
      [BUG]
      Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel
      message:
      
        BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0
        BTRFS error (device dm-6): failed to verify dev extents against chunks: -117
        BTRFS error (device dm-6): open_ctree failed
      
      [CAUSE]
      Commit cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent
      mapping check") introduced strict check on dev extents.
      
      We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and
      only dependent on @devid to find the real device.
      
      For seed devices, we call clone_fs_devices() in open_seed_devices() to
      allow us search seed devices directly.
      
      However clone_fs_devices() just populates devices with devid and dev
      uuid, without populating other essential members, like disk_total_bytes.
      
      This makes any device returned by btrfs_find_device(fs_info, devid,
      NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev
      extents on the seed device will not pass the device boundary check.
      
      [FIX]
      This patch will try to verify the device returned by btrfs_find_device()
      and if it's a dummy then re-search in seed devices.
      
      Fixes: cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent mapping check")
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      be77686f
    • Q
      btrfs: volumes: Make sure no dev extent is beyond device boundary · a2790b99
      Qu Wenruo 提交于
      [ Upstream commit 05a37c48604c19b50873fd9663f9140c150469d1 ]
      
      Add extra dev extent end check against device boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a2790b99
    • N
      btrfs: Fix error handling in btrfs_cleanup_ordered_extents · eb124aaa
      Nikolay Borisov 提交于
      [ Upstream commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 ]
      
      Running btrfs/124 in a loop hung up on me sporadically with the
      following call trace:
      
      	btrfs           D    0  5760   5324 0x00000000
      	Call Trace:
      	 ? __schedule+0x243/0x800
      	 schedule+0x33/0x90
      	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
      	 ? wait_woken+0xa0/0xa0
      	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
      	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
      	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
      	 btrfs_balance+0xbeb/0x1740 [btrfs]
      	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
      	 btrfs_ioctl+0x1691/0x3110 [btrfs]
      	 ? lockdep_hardirqs_on+0xed/0x180
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? _raw_spin_unlock+0x24/0x30
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? do_vfs_ioctl+0xa5/0x6e0
      	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      	 do_vfs_ioctl+0xa5/0x6e0
      	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
      	 ksys_ioctl+0x3a/0x70
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x60/0x1b0
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This happens because during page writeback it's valid for
      writepage_delalloc to instantiate a delalloc range which doesn't belong
      to the page currently being written back.
      
      The reason this case is valid is due to find_lock_delalloc_range
      returning any available range after the passed delalloc_start and
      ignoring whether the page under writeback is within that range.
      
      In turn ordered extents (OE) are always created for the returned range
      from find_lock_delalloc_range. If, however, a failure occurs while OE
      are being created then the clean up code in btrfs_cleanup_ordered_extents
      will be called.
      
      Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
      the case of such 'foreign' range being processed and instead it always
      assumes that the range OE are created for belongs to the page. This
      leads to the first page of such foregin range to not be cleaned up since
      it's deliberately missed and skipped by the current cleaning up code.
      
      Fix this by correctly checking whether the current page belongs to the
      range being instantiated and if so adjsut the range parameters passed
      for cleaning up. If it doesn't, then just clean the whole OE range
      directly.
      
      Fixes: 52427260 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eb124aaa
    • N
      btrfs: Remove extent_io_ops::fill_delalloc · 1669d1d2
      Nikolay Borisov 提交于
      [ Upstream commit 5eaad97af8aeff38debe7d3c69ec3a0d71f8350f ]
      
      This callback is called only from writepage_delalloc which in turn is
      guaranteed to be called from the data page writeout path. In the end
      there is no reason to have the call to this function to be indrected via
      the extent_io_ops structure. This patch removes the callback definition,
      exports the function and calls it directly. No functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_run_delalloc_range ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1669d1d2
    • F
      Btrfs: fix deadlock with memory reclaim during scrub · 338a528b
      Filipe Manana 提交于
      [ Upstream commit a5fb11429167ee6ddeeacc554efaf5776b36433a ]
      
      When a transaction commit starts, it attempts to pause scrub and it blocks
      until the scrub is paused. So while the transaction is blocked waiting for
      scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
      otherwise we risk getting into a deadlock with reclaim.
      
      Checking for scrub pause requests is done early at the beginning of the
      while loop of scrub_stripe() and later in the loop, scrub_extent() and
      scrub_raid56_parity() are called, which in turn call scrub_pages() and
      scrub_pages_for_parity() respectively. These last two functions do memory
      allocations using GFP_KERNEL. Same problem could happen while scrubbing
      the super blocks, since it calls scrub_pages().
      
      We also can not have any of the worker tasks, created by the scrub task,
      doing GFP_KERNEL allocations, because before pausing, the scrub task waits
      for all the worker tasks to complete (also done at scrub_stripe()).
      
      So make sure GFP_NOFS is used for the memory allocations because at any
      time a scrub pause request can happen from another task that started to
      commit a transaction.
      
      Fixes: 58c4e173 ("btrfs: scrub: use GFP_KERNEL on the submission path")
      CC: stable@vger.kernel.org # 4.6+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      338a528b
    • O
      Btrfs: clean up scrub is_dev_replace parameter · fac80347
      Omar Sandoval 提交于
      [ Upstream commit 32934280967d00dc2b5c4d3b63b21a9c8638326e ]
      
      struct scrub_ctx has an ->is_dev_replace member, so there's no point in
      passing around is_dev_replace where sctx is available.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fac80347
  10. 25 8月, 2019 1 次提交
    • F
      Btrfs: fix deadlock between fiemap and transaction commits · f833deae
      Filipe Manana 提交于
      [ Upstream commit a6d155d2e363f26290ffd50591169cb96c2a609e ]
      
      The fiemap handler locks a file range that can have unflushed delalloc,
      and after locking the range, it tries to attach to a running transaction.
      If the running transaction started its commit, that is, it is in state
      TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
      flushoncommit option or the transaction is creating a snapshot for the
      subvolume that contains the file that fiemap is operating on, we end up
      deadlocking. This happens because fiemap is blocked on the transaction,
      waiting for it to complete, and the transaction is waiting for the flushed
      dealloc to complete, which requires locking the file range that the fiemap
      task already locked. The following stack traces serve as an example of
      when this deadlock happens:
      
        (...)
        [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [404571.515956] Call Trace:
        [404571.516360]  ? __schedule+0x3ae/0x7b0
        [404571.516730]  schedule+0x3a/0xb0
        [404571.517104]  lock_extent_bits+0x1ec/0x2a0 [btrfs]
        [404571.517465]  ? remove_wait_queue+0x60/0x60
        [404571.517832]  btrfs_finish_ordered_io+0x292/0x800 [btrfs]
        [404571.518202]  normal_work_helper+0xea/0x530 [btrfs]
        [404571.518566]  process_one_work+0x21e/0x5c0
        [404571.518990]  worker_thread+0x4f/0x3b0
        [404571.519413]  ? process_one_work+0x5c0/0x5c0
        [404571.519829]  kthread+0x103/0x140
        [404571.520191]  ? kthread_create_worker_on_cpu+0x70/0x70
        [404571.520565]  ret_from_fork+0x3a/0x50
        [404571.520915] kworker/u8:6    D    0 31651      2 0x80004000
        [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
        (...)
        [404571.537000] fsstress        D    0 13117  13115 0x00004000
        [404571.537263] Call Trace:
        [404571.537524]  ? __schedule+0x3ae/0x7b0
        [404571.537788]  schedule+0x3a/0xb0
        [404571.538066]  wait_current_trans+0xc8/0x100 [btrfs]
        [404571.538349]  ? remove_wait_queue+0x60/0x60
        [404571.538680]  start_transaction+0x33c/0x500 [btrfs]
        [404571.539076]  btrfs_check_shared+0xa3/0x1f0 [btrfs]
        [404571.539513]  ? extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.539866]  extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.540170]  do_vfs_ioctl+0x526/0x6f0
        [404571.540436]  ksys_ioctl+0x70/0x80
        [404571.540734]  __x64_sys_ioctl+0x16/0x20
        [404571.540997]  do_syscall_64+0x60/0x1d0
        [404571.541279]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
        [404571.543729] btrfs           D    0 14210  14208 0x00004000
        [404571.544023] Call Trace:
        [404571.544275]  ? __schedule+0x3ae/0x7b0
        [404571.544526]  ? wait_for_completion+0x112/0x1a0
        [404571.544795]  schedule+0x3a/0xb0
        [404571.545064]  schedule_timeout+0x1ff/0x390
        [404571.545351]  ? lock_acquire+0xa6/0x190
        [404571.545638]  ? wait_for_completion+0x49/0x1a0
        [404571.545890]  ? wait_for_completion+0x112/0x1a0
        [404571.546228]  wait_for_completion+0x131/0x1a0
        [404571.546503]  ? wake_up_q+0x70/0x70
        [404571.546775]  btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
        [404571.547159]  btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
        [404571.547449]  ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
        [404571.547703]  ? remove_wait_queue+0x60/0x60
        [404571.547969]  btrfs_mksubvol+0x605/0x640 [btrfs]
        [404571.548226]  ? __sb_start_write+0xd4/0x1c0
        [404571.548512]  ? mnt_want_write_file+0x24/0x50
        [404571.548789]  btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
        [404571.549048]  btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
        [404571.549307]  btrfs_ioctl+0x133f/0x3150 [btrfs]
        [404571.549549]  ? mem_cgroup_charge_statistics+0x4c/0xd0
        [404571.549792]  ? mem_cgroup_commit_charge+0x84/0x4b0
        [404571.550064]  ? __handle_mm_fault+0xe3e/0x11f0
        [404571.550306]  ? do_raw_spin_unlock+0x49/0xc0
        [404571.550608]  ? _raw_spin_unlock+0x24/0x30
        [404571.550976]  ? __handle_mm_fault+0xedf/0x11f0
        [404571.551319]  ? do_vfs_ioctl+0xa2/0x6f0
        [404571.551659]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [404571.552087]  do_vfs_ioctl+0xa2/0x6f0
        [404571.552355]  ksys_ioctl+0x70/0x80
        [404571.552621]  __x64_sys_ioctl+0x16/0x20
        [404571.552864]  do_syscall_64+0x60/0x1d0
        [404571.553104]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
      
      If we were joining the transaction instead of attaching to it, we would
      not risk a deadlock because a join only blocks if the transaction is in a
      state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
      flush performed by a transaction is done before it reaches that state,
      when it is in the state TRANS_STATE_COMMIT_START. However a transaction
      join is intended for use cases where we do modify the filesystem, and
      fiemap only needs to peek at delayed references from the current
      transaction in order to determine if extents are shared, and, besides
      that, when there is no current transaction or when it blocks to wait for
      a current committing transaction to complete, it creates a new transaction
      without reserving any space. Such unnecessary transactions, besides doing
      unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
      rotation of the precious backup roots.
      
      So fix this by adding a new transaction join variant, named join_nostart,
      which behaves like the regular join, but it does not create a transaction
      when none currently exists or after waiting for a committing transaction
      to complete.
      
      Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f833deae
  11. 07 8月, 2019 1 次提交
    • F
      Btrfs: fix race leading to fs corruption after transaction abort · 50d70040
      Filipe Manana 提交于
      commit cb2d3daddbfb6318d170e79aac1f7d5e4d49f0d7 upstream.
      
      When one transaction is finishing its commit, it is possible for another
      transaction to start and enter its initial commit phase as well. If the
      first ends up getting aborted, we have a small time window where the second
      transaction commit does not notice that the previous transaction aborted
      and ends up committing, writing a superblock that points to btrees that
      reference extent buffers (nodes and leafs) that were not persisted to disk.
      The consequence is that after mounting the filesystem again, we will be
      unable to load some btree nodes/leafs, either because the content on disk
      is either garbage (or just zeroes) or corresponds to the old content of a
      previouly COWed or deleted node/leaf, resulting in the well known error
      messages "parent transid verify failed on ...".
      The following sequence diagram illustrates how this can happen.
      
              CPU 1                                           CPU 2
      
       <at transaction N>
      
       btrfs_commit_transaction()
         (...)
         --> sets transaction state to
             TRANS_STATE_UNBLOCKED
         --> sets fs_info->running_transaction
             to NULL
      
                                                          (...)
                                                          btrfs_start_transaction()
                                                            start_transaction()
                                                              wait_current_trans()
                                                                --> returns immediately
                                                                    because
                                                                    fs_info->running_transaction
                                                                    is NULL
                                                              join_transaction()
                                                                --> creates transaction N + 1
                                                                --> sets
                                                                    fs_info->running_transaction
                                                                    to transaction N + 1
                                                                --> adds transaction N + 1 to
                                                                    the fs_info->trans_list list
                                                              --> returns transaction handle
                                                                  pointing to the new
                                                                  transaction N + 1
                                                          (...)
      
                                                          btrfs_sync_file()
                                                            btrfs_start_transaction()
                                                              --> returns handle to
                                                                  transaction N + 1
                                                            (...)
      
         btrfs_write_and_wait_transaction()
           --> writeback of some extent
               buffer fails, returns an
      	 error
         btrfs_handle_fs_error()
           --> sets BTRFS_FS_STATE_ERROR in
               fs_info->fs_state
         --> jumps to label "scrub_continue"
         cleanup_transaction()
           btrfs_abort_transaction(N)
             --> sets BTRFS_FS_STATE_TRANS_ABORTED
                 flag in fs_info->fs_state
             --> sets aborted field in the
                 transaction and transaction
      	   handle structures, for
                 transaction N only
           --> removes transaction from the
               list fs_info->trans_list
                                                            btrfs_commit_transaction(N + 1)
                                                              --> transaction N + 1 was not
      							    aborted, so it proceeds
                                                              (...)
                                                              --> sets the transaction's state
                                                                  to TRANS_STATE_COMMIT_START
                                                              --> does not find the previous
                                                                  transaction (N) in the
                                                                  fs_info->trans_list, so it
                                                                  doesn't know that transaction
                                                                  was aborted, and the commit
                                                                  of transaction N + 1 proceeds
                                                              (...)
                                                              --> sets transaction N + 1 state
                                                                  to TRANS_STATE_UNBLOCKED
                                                              btrfs_write_and_wait_transaction()
                                                                --> succeeds writing all extent
                                                                    buffers created in the
                                                                    transaction N + 1
                                                              write_all_supers()
                                                                 --> succeeds
                                                                 --> we now have a superblock on
                                                                     disk that points to trees
                                                                     that refer to at least one
                                                                     extent buffer that was
                                                                     never persisted
      
      So fix this by updating the transaction commit path to check if the flag
      BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
      the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
      transaction in the fs_info->trans_list. If the flag is set, just fail the
      transaction commit with -EROFS, as we do in other places. The exact error
      code for the previous transaction abort was already logged and reported.
      
      Fixes: 49b25e05 ("btrfs: enhance transaction abort infrastructure")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50d70040