1. 09 9月, 2019 1 次提交
  2. 30 4月, 2019 21 次提交
    • Q
      btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safe · 7c15d410
      Qu Wenruo 提交于
      We have a long standing problem with reversed keys that's detected by
      btrfs_set_item_key_safe. This is hard to reproduce so we'd like to
      capture more information for later analysis.
      
      Let's dump the leaf content before triggering BUG_ON() so that we can
      have some clue on what's going wrong.  The output of tree locks should
      help us to debug such problem.
      
      Sample stacktrace:
      
       generic/522             [00:07:05]
       [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05
       [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192!
       [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP
       [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G        W         5.1.0-rc5-default+ #562
       [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
       [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.493845] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.495608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
       [27161.498100] Call Trace:
       [27161.498771]  __btrfs_drop_extents+0x6ec/0xdf0 [btrfs]
       [27161.499872]  btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs]
       [27161.501114]  btrfs_log_inode+0x7ff/0xdc0 [btrfs]
       [27161.502114]  ? __mutex_unlock_slowpath+0x4b/0x2b0
       [27161.503172]  btrfs_log_inode_parent+0x237/0x9c0 [btrfs]
       [27161.504348]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [27161.505374]  btrfs_sync_file+0x1b7/0x480 [btrfs]
       [27161.506371]  __x64_sys_msync+0x180/0x210
       [27161.507208]  do_syscall_64+0x54/0x180
       [27161.507932]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [27161.508839] RIP: 0033:0x7f8ea5aa9c61
       [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61
       [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000
       [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000
       [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938
       [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000
       [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug]
       [27161.522272] ---[ end trace d5afec7ccac6a252 ]---
       [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.534405] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.536048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c15d410
    • D
    • D
      179d1e6a
    • D
      c7da9597
    • D
      c71dd880
    • D
      78ac4f9e
    • D
      25263cd7
    • D
      btrfs: get fs_info from eb in __push_leaf_left · 8087c193
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8087c193
    • D
      btrfs: get fs_info from eb in __push_leaf_right · f72f0010
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f72f0010
    • D
      btrfs: get fs_info from trans in copy_for_split · 94f94ad9
      David Sterba 提交于
      We can read fs_info from the transaction and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94f94ad9
    • D
      btrfs: get fs_info from trans in insert_ptr · 6ad3cf6d
      David Sterba 提交于
      We can read fs_info from the transaction and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ad3cf6d
    • D
      btrfs: get fs_info from trans in balance_node_right · 55d32ed8
      David Sterba 提交于
      We can read fs_info from the transaction and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55d32ed8
    • D
      btrfs: get fs_info from trans in push_node_left · d30a668f
      David Sterba 提交于
      We can read fs_info from the transaction and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d30a668f
    • D
      btrfs: get fs_info from eb in btrfs_verify_level_key · e064d5e9
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e064d5e9
    • D
      btrfs: get fs_info from eb in read_node_slot · d0d20b0f
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0d20b0f
    • D
      btrfs: get fs_info from eb in btrfs_leaf_free_space · e902baac
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e902baac
    • D
      btrfs: get fs_info from eb in clean_tree_block · 6a884d7d
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a884d7d
    • D
      btrfs: get fs_info from eb in tree_mod_log_eb_copy · ed874f0d
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed874f0d
    • D
      btrfs: get fs_info from eb in leaf_data_end · 8f881e8c
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8f881e8c
    • A
      btrfs: use BUG() instead of BUG_ON(1) · 290342f6
      Arnd Bergmann 提交于
      BUG_ON(1) leads to bogus warnings from clang when
      CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
      
      fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
            [-Werror,-Wsometimes-uninitialized]
                      BUG_ON(1);
                      ^~~~~~~~~
      include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
       #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
                                         ^~~~~~~~~~~~~~~~~~~
      include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
       #  define unlikely(x)   (__branch_check__(x, 0, __builtin_constant_p(x)))
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
                                   max_chunk_size);
                                   ^~~~~~~~~~~~~~
      include/linux/kernel.h:860:36: note: expanded from macro 'min'
       #define min(x, y)       __careful_cmp(x, y, <)
                                               ^
      include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
                      __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
                                    ^
      include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
                      typeof(y) unique_y = (y);               \
                                            ^
      fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
                      BUG_ON(1);
                      ^
      include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
       #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
                                     ^
      fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
              u64 max_chunk_size;
                                ^
                                 = 0
      
      Change it to BUG() so clang can see that this code path can never
      continue.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      290342f6
    • Q
      btrfs: Check the first key and level for cached extent buffer · 448de471
      Qu Wenruo 提交于
      [BUG]
      When reading a file from a fuzzed image, kernel can panic like:
      
        BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
        assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3500!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
        Call Trace:
         btrfs_lookup_csum+0x52/0x150 [btrfs]
         __btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
         btrfs_submit_bio_hook+0x103/0x170 [btrfs]
         submit_one_bio+0x59/0x80 [btrfs]
         extent_read_full_page+0x58/0x80 [btrfs]
         generic_file_read_iter+0x2f6/0x9d0
         __vfs_read+0x14d/0x1a0
         vfs_read+0x8d/0x140
         ksys_read+0x52/0xc0
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      The fuzzed image has a corrupted leaf whose first key doesn't match its
      parent:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
        fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
        chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
        	...
                key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19
      
        leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
        leaf 29761536 flags 0x1(WRITTEN) backref revision 1
        fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
        chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
                item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
                        range start 8798638964736 end 8798641262592 length 2297856
      
      When reading the above tree block, we have extent_buffer->refs = 2 in
      the context:
      
      - initial one from __alloc_extent_buffer()
        alloc_extent_buffer()
        |- __alloc_extent_buffer()
           |- atomic_set(&eb->refs, 1)
      
      - one being added to fs_info->buffer_radix
        alloc_extent_buffer()
        |- check_buffer_tree_ref()
           |- atomic_inc(&eb->refs)
      
      So if even we call free_extent_buffer() in read_tree_block or other
      similar situation, we only decrease the refs by 1, it doesn't reach 0
      and won't be freed right now.
      
      The staled eb and its corrupted content will still be kept cached.
      
      Furthermore, we have several extra cases where we either don't do first
      key check or the check is not proper for all callers:
      
      - scrub
        We just don't have first key in this context.
      
      - shared tree block
        One tree block can be shared by several snapshot/subvolume trees.
        In that case, the first key check for one subvolume doesn't apply to
        another.
      
      So for the above reasons, a corrupted extent buffer can sneak into the
      buffer cache.
      
      [FIX]
      Call verify_level_key in read_block_for_search to do another
      verification. For that purpose the function is exported.
      
      Due to above reasons, although we can free corrupted extent buffer from
      cache, we still need the check in read_block_for_search(), for scrub and
      shared tree blocks.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769Reported-by: NYoon Jungyeon <jungyeon@gatech.edu>
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      448de471
  3. 25 2月, 2019 6 次提交
    • F
      Btrfs: remove assertion when searching for a key in a node/leaf · 253002f2
      Filipe Manana 提交于
      At ctree.c:key_search(), the assertion that verifies the first key on a
      child extent buffer corresponds to the key at a specific slot in the
      parent has a disadvantage: we effectively hit a BUG_ON() which requires
      rebooting the machine later. It also does not tell any information about
      which extent buffer is affected, from which root, the expected and found
      keys, etc.
      
      However as of commit 581c1760 ("btrfs: Validate child tree block's
      level and first key"), that assertion is not needed since at the time we
      read an extent buffer from disk we validate that its first key matches the
      key, at the respective slot, in the parent extent buffer. Therefore just
      remove the assertion at key_search().
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      253002f2
    • F
      Btrfs: add missing error handling after doing leaf/node binary search · cbca7d59
      Filipe Manana 提交于
      The function map_private_extent_buffer() can return an -EINVAL error, and
      it is called by generic_bin_search() which will return back the error. The
      btrfs_bin_search() function in turn calls generic_bin_search() and the
      key_search() function calls btrfs_bin_search(), so both can return the
      -EINVAL error coming from the map_private_extent_buffer() function. Some
      callers of these functions were ignoring that these functions can return
      an error, so fix them to deal with error return values.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbca7d59
    • D
      btrfs: merge btrfs_set_lock_blocking_rw with it's caller · 766ece54
      David Sterba 提交于
      The last caller that does not have a fixed value of lock is
      btrfs_set_path_blocking, that actually does the same conditional swtich
      by the lock type so we can merge the branches together and remove the
      helper.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      766ece54
    • D
      btrfs: open code now trivial btrfs_set_lock_blocking · 8bead258
      David Sterba 提交于
      btrfs_set_lock_blocking is now only a simple wrapper around
      btrfs_set_lock_blocking_write. The name does not bring any semantic
      value that could not be inferred from the new function so there's no
      point keeping it.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bead258
    • D
      btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers · 300aa896
      David Sterba 提交于
      We can use the right helper where the lock type is a fixed parameter.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      300aa896
    • Q
      btrfs: qgroup: Use delayed subtree rescan for balance · f616f5cd
      Qu Wenruo 提交于
      Before this patch, qgroup code traces the whole subtree of subvolume and
      reloc trees unconditionally.
      
      This makes qgroup numbers consistent, but it could cause tons of
      unnecessary extent tracing, which causes a lot of overhead.
      
      However for subtree swap of balance, just swap both subtrees because
      they contain the same contents and tree structure, so qgroup numbers
      won't change.
      
      It's the race window between subtree swap and transaction commit could
      cause qgroup number change.
      
      This patch will delay the qgroup subtree scan until COW happens for the
      subtree root.
      
      So if there is no other operations for the fs, balance won't cause extra
      qgroup overhead. (best case scenario)
      Depending on the workload, most of the subtree scan can still be
      avoided.
      
      Only for worst case scenario, it will fall back to old subtree swap
      overhead. (scan all swapped subtrees)
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
      And after file system population, there is no other activity, so it
      should be the best case scenario.
      
                           | v4.20-rc1            | w/ patchset    | diff
      -----------------------------------------------------------------------
      relocated extents    | 22615                | 22457          | -0.1%
      qgroup dirty extents | 163457               | 121606         | -25.6%
      time (sys)           | 22.884s              | 18.842s        | -17.6%
      time (real)          | 27.724s              | 22.884s        | -17.5%
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f616f5cd
  4. 28 1月, 2019 1 次提交
    • F
      Btrfs: fix deadlock when allocating tree block during leaf/node split · a6279470
      Filipe Manana 提交于
      When splitting a leaf or node from one of the trees that are modified when
      flushing pending block groups (extent, chunk, device and free space trees),
      we need to allocate a new tree block, which in turn can result in the need
      to allocate a new block group. After allocating the new block group we may
      need to flush new block groups that were previously allocated during the
      course of the current transaction, which is what may cause a deadlock due
      to attempts to write lock twice the same leaf or node, as when splitting
      a leaf or node we are holding a write lock on it and its parent node.
      
      The same type of deadlock can also happen when increasing the tree's
      height, since we are holding a lock on the existing root while allocating
      the tree block to use as the new root node.
      
      An example trace when the deadlock happens during the leaf split path is:
      
        [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G        W         4.19.16 #1
        [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
        [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
        (...)
        [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
        [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
        [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
        [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
        [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
        [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
        [27175.303601] FS:  0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
        [27175.304468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
        [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [27175.307940] Call Trace:
        [27175.308802]  btrfs_search_slot+0x779/0x9a0 [btrfs]
        [27175.309669]  ? update_space_info+0xba/0xe0 [btrfs]
        [27175.310534]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
        [27175.311397]  btrfs_insert_item+0x60/0xd0 [btrfs]
        [27175.312253]  btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
        [27175.313116]  do_chunk_alloc+0x25f/0x300 [btrfs]
        [27175.313984]  find_free_extent+0x706/0x10d0 [btrfs]
        [27175.314855]  btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
        [27175.315707]  btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
        [27175.316548]  split_leaf+0x130/0x610 [btrfs]
        [27175.317390]  btrfs_search_slot+0x94d/0x9a0 [btrfs]
        [27175.318235]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
        [27175.319087]  alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
        [27175.319938]  __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
        [27175.320792]  btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
        [27175.321643]  delayed_ref_async_start+0x81/0x90 [btrfs]
        [27175.322491]  normal_work_helper+0xd0/0x320 [btrfs]
        [27175.323328]  ? move_linked_works+0x6e/0xa0
        [27175.324160]  process_one_work+0x191/0x370
        [27175.324976]  worker_thread+0x4f/0x3b0
        [27175.325763]  kthread+0xf8/0x130
        [27175.326531]  ? rescuer_thread+0x320/0x320
        [27175.327284]  ? kthread_create_worker_on_cpu+0x50/0x50
        [27175.328027]  ret_from_fork+0x35/0x40
        [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---
      
      Fix this by preventing the flushing of new blocks groups when splitting a
      leaf/node and when inserting a new root node for one of the trees modified
      by the flushing operation, similar to what is done when COWing a node/leaf
      from on of these trees.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383Reported-by: NEli V <eliventer@gmail.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6279470
  5. 09 1月, 2019 1 次提交
  6. 17 12月, 2018 8 次提交
    • A
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini 提交于
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52042d8e
    • F
      Btrfs: send, fix race with transaction commits that create snapshots · be6821f8
      Filipe Manana 提交于
      If we create a snapshot of a snapshot currently being used by a send
      operation, we can end up with send failing unexpectedly (returning
      -ENOENT error to user space for example). The following diagram shows
      how this happens.
      
                  CPU 1                                   CPU2                                CPU3
      
       btrfs_ioctl_send()
        (...)
                                           create_snapshot()
                                            -> creates snapshot of a
                                               root used by the send
                                               task
                                            btrfs_commit_transaction()
                                             create_pending_snapshot()
        __get_inode_info()
         btrfs_search_slot()
          btrfs_search_slot_get_root()
           down_read commit_root_sem
      
           get reference on eb of the
           commit root
            -> eb with bytenr == X
      
           up_read commit_root_sem
      
                                              btrfs_cow_block(root node)
                                               btrfs_free_tree_block()
                                                -> creates delayed ref to
                                                   free the extent
      
                                             btrfs_run_delayed_refs()
                                              -> runs the delayed ref,
                                                 adds extent to
                                                 fs_info->pinned_extents
      
                                             btrfs_finish_extent_commit()
                                              unpin_extent_range()
                                               -> marks extent as free
                                                  in the free space cache
      
                                            transaction commit finishes
      
                                                                             btrfs_start_transaction()
                                                                              (...)
                                                                              btrfs_cow_block()
                                                                               btrfs_alloc_tree_block()
                                                                                btrfs_reserve_extent()
                                                                                 -> allocates extent at
                                                                                    bytenr == X
                                                                                btrfs_init_new_buffer(bytenr X)
                                                                                 btrfs_find_create_tree_block()
                                                                                  alloc_extent_buffer(bytenr X)
                                                                                   find_extent_buffer(bytenr X)
                                                                                    -> returns existing eb,
                                                                                       which the send task got
      
                                                                              (...)
                                                                               -> modifies content of the
                                                                                  eb with bytenr == X
      
          -> uses an eb that now
             belongs to some other
             tree and no more matches
             the commit root of the
             snapshot, resuts will be
             unpredictable
      
      The consequences of this race can be various, and can lead to searches in
      the commit root performed by the send task failing unexpectedly (unable to
      find inode items, returning -ENOENT to user space, for example) or not
      failing because an inode item with the same number was added to the tree
      that reused the metadata extent, in which case send can behave incorrectly
      in the worst case or just fail later for some reason.
      
      Fix this by performing a copy of the commit root's extent buffer when doing
      a search in the context of a send operation.
      
      CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e: Btrfs: move get root out of btrfs_search_slot to a helper
      CC: stable@vger.kernel.org # 4.4.x: f9ddfd05: Btrfs: remove unused check of skip_locking
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be6821f8
    • J
      btrfs: catch cow on deleting snapshots · 83354f07
      Josef Bacik 提交于
      When debugging some weird extent reference bug I suspected that we were
      changing a snapshot while we were deleting it, which could explain my
      bug.  This was indeed what was happening, and this patch helped me
      verify my theory.  It is never correct to modify the snapshot once it's
      being deleted, so mark the root when we are deleting it and make sure we
      complain about it when it happens.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83354f07
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • N
      btrfs: Remove extra reference count bumps in btrfs_compare_trees · 8c7eeb65
      Nikolay Borisov 提交于
      When the 2 comparison trees roots are initialised they are private to
      the function and already have reference counts of 1 each. There is no
      need to further increment the reference count since the cloned buffers
      are already accessed via struct btrfs_path. Eventually the 2 paths used
      for comparison are going to be released, effectively disposing of the
      cloned buffers.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c7eeb65
    • N
      btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind · 24cee18a
      Nikolay Borisov 提交于
      When a rewound buffer is created it already has a ref count of 1 and the
      dummy flag set. Then another ref is taken bumping the count to 2.
      Finally when this buffer is released from btrfs_release_path the extra
      reference is decremented by the special handling code in
      free_extent_buffer.
      
      However, this special code is in fact redundant sinca ref count of 1 is
      still correct since the buffer is only accessed via btrfs_path struct.
      This paves the way forward of removing the special handling in
      free_extent_buffer.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24cee18a
    • N
      btrfs: Remove redundant extent_buffer_get in get_old_root · 6c122e2a
      Nikolay Borisov 提交于
      get_old_root used used only by btrfs_search_old_slot to initialise the
      path structure. The old root is always a cloned buffer (either via alloc
      dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
      from allocation, 1 from extent_buffer_get call in get_old_root.
      
      This latter explicit ref count acquire operation is in fact unnecessary
      since the semantic is such that the newly allocated buffer is handed
      over to the btrfs_path for lifetime management. Considering this just
      remove the extra extent_buffer_get in get_old_root.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6c122e2a
  7. 17 10月, 2018 1 次提交
    • F
      Btrfs: fix deadlock when writing out free space caches · 5ce55557
      Filipe Manana 提交于
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ce55557
  8. 15 10月, 2018 1 次提交
    • L
      Btrfs: kill btrfs_clear_path_blocking · 52398340
      Liu Bo 提交于
      Btrfs's btree locking has two modes, spinning mode and blocking mode,
      while searching btree, locking is always acquired in spinning mode and
      then converted to blocking mode if necessary, and in some hot paths we may
      switch the locking back to spinning mode by btrfs_clear_path_blocking().
      
      When acquiring locks, both of reader and writer need to wait for blocking
      readers and writers to complete before doing read_lock()/write_lock().
      
      The problem is that btrfs_clear_path_blocking() needs to switch nodes
      in the path to blocking mode at first (by btrfs_set_path_blocking) to
      make lockdep happy before doing its actual clearing blocking job.
      
      When switching to blocking mode from spinning mode, it consists of
      
      step 1) bumping up blocking readers counter and
      step 2) read_unlock()/write_unlock(),
      
      this has caused serious ping-pong effect if there're a great amount of
      concurrent readers/writers, as waiters will be woken up and go to
      sleep immediately.
      
      1) Killing this kind of ping-pong results in a big improvement in my 1600k
      files creation script,
      
      MNT=/mnt/btrfs
      mkfs.btrfs -f /dev/sdf
      mount /dev/def $MNT
      time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
              -d  $MNT/0  -d  $MNT/1 \
              -d  $MNT/2  -d  $MNT/3 \
              -d  $MNT/4  -d  $MNT/5 \
              -d  $MNT/6  -d  $MNT/7 \
              -d  $MNT/8  -d  $MNT/9 \
              -d  $MNT/10  -d  $MNT/11 \
              -d  $MNT/12  -d  $MNT/13 \
              -d  $MNT/14  -d  $MNT/15
      
      w/o patch:
      real    2m27.307s
      user    0m12.839s
      sys     13m42.831s
      
      w/ patch:
      real    1m2.273s
      user    0m15.802s
      sys     8m16.495s
      
      1.1) latency histogram from funclatency[1]
      
      Overall with the patch, there're ~50% less write lock acquisition and
      the 95% max latency that write lock takes also reduces to ~100ms from
      >500ms.
      
      --------------------------------------------
      w/o patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 2385222  |****************************************|
               2 -> 3          : 37147    |                                        |
               4 -> 7          : 20452    |                                        |
               8 -> 15         : 13131    |                                        |
              16 -> 31         : 3877     |                                        |
              32 -> 63         : 3900     |                                        |
              64 -> 127        : 2612     |                                        |
             128 -> 255        : 974      |                                        |
             256 -> 511        : 165      |                                        |
             512 -> 1023       : 13       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6743860  |****************************************|
               2 -> 3          : 2146     |                                        |
               4 -> 7          : 190      |                                        |
               8 -> 15         : 38       |                                        |
              16 -> 31         : 4        |                                        |
      
      --------------------------------------------
      w/ patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 1318454  |****************************************|
               2 -> 3          : 6800     |                                        |
               4 -> 7          : 3664     |                                        |
               8 -> 15         : 2145     |                                        |
              16 -> 31         : 809      |                                        |
              32 -> 63         : 219      |                                        |
              64 -> 127        : 10       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6854317  |****************************************|
               2 -> 3          : 2383     |                                        |
               4 -> 7          : 601      |                                        |
               8 -> 15         : 92       |                                        |
      
      2) dbench also proves the improvement,
      dbench -t 120 -D /mnt/btrfs 16
      
      w/o patch:
      Throughput 158.363 MB/sec
      
      w/ patch:
      Throughput 449.52 MB/sec
      
      3) xfstests didn't show any additional failures.
      
      One thing to note is that callers may set path->leave_spinning to have
      all nodes in the path stay in spinning mode, which means callers are
      ready to not sleep before releasing the path, but it won't cause
      problems if they don't want to sleep in blocking mode.
      
      [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.pySigned-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52398340