1. 19 2月, 2020 2 次提交
  2. 13 2月, 2020 1 次提交
  3. 31 1月, 2020 1 次提交
    • F
      Btrfs: fix race between adding and putting tree mod seq elements and nodes · 7227ff4d
      Filipe Manana 提交于
      There is a race between adding and removing elements to the tree mod log
      list and rbtree that can lead to use-after-free problems.
      
      Consider the following example that explains how/why the problems happens:
      
      1) Task A has mod log element with sequence number 200. It currently is
         the only element in the mod log list;
      
      2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
         access the tree mod log. When it enters the function, it initializes
         'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
         before checking if there are other elements in the mod seq list.
         Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
         unlocks the lock 'tree_mod_seq_lock';
      
      3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
         itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
         sequence number of 201;
      
      4) Some other task, name it task C, modifies a btree and because there
         elements in the mod seq list, it adds a tree mod elem to the tree
         mod log rbtree. That node added to the mod log rbtree is assigned
         a sequence number of 202;
      
      5) Task B, which is doing fiemap and resolving indirect back references,
         calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
         calls tree_mod_log_search() - the search returns the mod log node
         from the rbtree with sequence number 202, created by task C;
      
      6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
         the mod log rbtree and finds the node with sequence number 202. Since
         202 is less than the previously computed 'min_seq', (u64)-1, it
         removes the node and frees it;
      
      7) Task B still has a pointer to the node with sequence number 202, and
         it dereferences the pointer itself and through the call to
         __tree_mod_log_rewind(), resulting in a use-after-free problem.
      
      This issue can be triggered sporadically with the test case generic/561
      from fstests, and it happens more frequently with a higher number of
      duperemove processes. When it happens to me, it either freezes the VM or
      it produces a trace like the following before crashing:
      
        [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
        [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
        [ 1245.321307] Code: ....
        [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
        [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
        [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
        [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
        [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
        [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
        [ 1245.321539] FS:  00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
        [ 1245.321591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
        [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 1245.321706] Call Trace:
        [ 1245.321798]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
        [ 1245.321841]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
        [ 1245.321877]  resolve_indirect_refs+0x1eb/0xc60 [btrfs]
        [ 1245.321912]  find_parent_nodes+0x3dc/0x11b0 [btrfs]
        [ 1245.321947]  btrfs_check_shared+0x115/0x1c0 [btrfs]
        [ 1245.321980]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
        [ 1245.322029]  extent_fiemap+0x59d/0x6d0 [btrfs]
        [ 1245.322066]  do_vfs_ioctl+0x45a/0x750
        [ 1245.322081]  ksys_ioctl+0x70/0x80
        [ 1245.322092]  ? trace_hardirqs_off_thunk+0x1a/0x1c
        [ 1245.322113]  __x64_sys_ioctl+0x16/0x20
        [ 1245.322126]  do_syscall_64+0x5c/0x280
        [ 1245.322139]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 1245.322155] RIP: 0033:0x7fdee3942dd7
        [ 1245.322177] Code: ....
        [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
        [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
        [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
        [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
        [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
        [ 1245.322423] Modules linked in: ....
        [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
      
      Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
      sequence number and iterates the rbtree while holding the lock
      'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
      lock, since it is now redundant.
      
      Fixes: bd989ba3 ("Btrfs: add tree modification log functions")
      Fixes: 097b8a7c ("Btrfs: join tree mod log code with the code holding back delayed refs")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7227ff4d
  4. 24 1月, 2020 1 次提交
    • J
      btrfs: free block groups after free'ing fs trees · 4e19443d
      Josef Bacik 提交于
      Sometimes when running generic/475 we would trip the
      WARN_ON(cache->reserved) check when free'ing the block groups on umount.
      This is because sometimes we don't commit the transaction because of IO
      errors and thus do not cleanup the tree logs until at umount time.
      
      These blocks are still reserved until they are cleaned up, but they
      aren't cleaned up until _after_ we do the free block groups work.  Fix
      this by moving the free after free'ing the fs roots, that way all of the
      tree logs are cleaned up and we have a properly cleaned fs.  A bunch of
      loops of generic/475 confirmed this fixes the problem.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4e19443d
  5. 20 1月, 2020 5 次提交
    • D
      btrfs: add the beginning of async discard, discard workqueue · b0643e59
      Dennis Zhou 提交于
      When discard is enabled, everytime a pinned extent is released back to
      the block_group's free space cache, a discard is issued for the extent.
      This is an overeager approach when it comes to discarding and helping
      the SSD maintain enough free space to prevent severe garbage collection
      situations.
      
      This adds the beginning of async discard. Instead of issuing a discard
      prior to returning it to the free space, it is just marked as untrimmed.
      The block_group is then added to a LRU which then feeds into a workqueue
      to issue discards at a much slower rate. Full discarding of unused block
      groups is still done and will be addressed in a future patch of the
      series.
      
      For now, we don't persist the discard state of extents and bitmaps.
      Therefore, our failure recovery mode will be to consider extents
      untrimmed. This lets us handle failure and unmounting as one in the
      same.
      
      On a number of Facebook webservers, I collected data every minute
      accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
      and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
      is where we discard extents synchronously before returning them to the
      free space cache.
      
      discard=sync:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          ---------------------------------------------------------------
           Drive A  |           434           |          1170
           Drive B  |           880           |          2330
           Drive C  |          2943           |          3920
           Drive D  |          4763           |          5701
      
      discard=async:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          --------------------------------------------------------------
           Drive A  |           134           |           956
           Drive B  |            64           |          1972
           Drive C  |            59           |          1032
           Drive D  |            62           |          1200
      
      While it's not great that the stats are cumulative over 1m, all of these
      servers are running the same workload and and the delta between the two
      are substantial. We are spending significantly less time in
      btrfs_finish_extent_commit() which is responsible for discarding.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0643e59
    • O
      btrfs: drop create parameter to btrfs_get_extent() · 39b07b5d
      Omar Sandoval 提交于
      We only pass this as 1 from __extent_writepage_io(). The parameter
      basically means "pretend I didn't pass in a page". This is silly since
      we can simply not pass in the page. Get rid of the parameter from
      btrfs_get_extent(), and since it's used as a get_extent_t callback,
      remove it from get_extent_t and btree_get_extent(), neither of which
      need it.
      
      While we're here, let's document btrfs_get_extent().
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      39b07b5d
    • A
      btrfs: sysfs, merge btrfs_sysfs_add devices_kobj and fsid · bc036bb3
      Anand Jain 提交于
      Merge btrfs_sysfs_add_fsid() and btrfs_sysfs_add_devices_kobj() functions
      as these two are small and they are called one after the other.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc036bb3
    • A
      btrfs: sysfs, rename btrfs_sysfs_add_device() · be2cf92e
      Anand Jain 提交于
      btrfs_sysfs_add_device() creates the directory
      /sys/fs/btrfs/UUID/devices but its function name is misleading. Rename
      it to btrfs_sysfs_add_devices_kobj() instead. No functional changes.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be2cf92e
    • A
      btrfs: sysfs, btrfs_sysfs_add_fsid() drop unused argument parent · c6761a9e
      Anand Jain 提交于
      Commit 24bd69cb ("Btrfs: sysfs: add support to add parent for fsid")
      added parent argument in preparation to show the seed fsid under the
      sprout fsid as in the patch [1] in the mailing list.
      
       [1] Btrfs: sysfs: support seed devices in the sysfs layout
      
      But later this idea was superseded by another idea to rename the fsid as
      in the commit f93c3997 ("btrfs: factor out sysfs code for updating
      sprout fsid").
      
      So we don't need parent argument anymore.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c6761a9e
  6. 19 11月, 2019 14 次提交
  7. 18 11月, 2019 10 次提交
  8. 15 10月, 2019 1 次提交
  9. 09 9月, 2019 3 次提交
    • Q
      btrfs: Detect unbalanced tree with empty leaf before crashing btree operations · 62fdaa52
      Qu Wenruo 提交于
      [BUG]
      With crafted image, btrfs will panic at btree operations:
      
        kernel BUG at fs/btrfs/ctree.c:3894!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1138 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:__push_leaf_left+0x6b6/0x6e0
        RSP: 0018:ffffc0bd4128b990 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffffa0a4ab8f0e38 RCX: 0000000000000000
        RDX: ffffa0a280000000 RSI: 0000000000000000 RDI: ffffa0a4b3814000
        RBP: ffffc0bd4128ba38 R08: 0000000000001000 R09: ffffc0bd4128b948
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000240
        R13: ffffa0a4b556fb60 R14: ffffa0a4ab8f0af0 R15: ffffa0a4ab8f0af0
        FS: 0000000000000000(0000) GS:ffffa0a4b7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f2461c80020 CR3: 000000022b32a006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        push_leaf_left+0x179/0x190
        btrfs_del_items+0x316/0x470
        btrfs_del_csums+0x215/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace c2425e6e89b5558f ]---
      
      [CAUSE]
      The offending csum tree looks like this:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
      	  ...
      	  key (EXTENT_CSUM EXTENT_CSUM 85975040) block 29630464 gen 17
      	  key (EXTENT_CSUM EXTENT_CSUM 89911296) block 29642752 gen 17 <<<
      	  key (EXTENT_CSUM EXTENT_CSUM 92274688) block 29646848 gen 17
      	  ...
      
        leaf 29630464 items 6 free space 1 generation 17 owner CSUM_TREE
      	  item 0 key (EXTENT_CSUM EXTENT_CSUM 85975040) itemoff 3987 itemsize 8
      		  range start 85975040 end 85983232 length 8192
      	  ...
        leaf 29642752 items 0 free space 3995 generation 17 owner 0
      		      ^ empty leaf            invalid owner ^
      
        leaf 29646848 items 1 free space 602 generation 17 owner CSUM_TREE
      	  item 0 key (EXTENT_CSUM EXTENT_CSUM 92274688) itemoff 627 itemsize 3368
      		  range start 92274688 end 95723520 length 3448832
      
      So we have a corrupted csum tree where one tree leaf is completely
      empty, causing unbalanced btree, thus leading to unexpected btree
      balance error.
      
      [FIX]
      For this particular case, we handle it in two directions to catch it:
      - Check if the tree block is empty through btrfs_verify_level_key()
        So that invalid tree blocks won't be read out through
        btrfs_search_slot() and its variants.
      
      - Check 0 tree owner in tree checker
        NO tree is using 0 as its tree owner, detect it and reject at tree
        block read time.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202821Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62fdaa52
    • N
      btrfs: Make reada_tree_block_flagged private · 4f84bd7f
      Nikolay Borisov 提交于
      This function is used only for the readahead machinery. It makes no
      sense to keep it external to reada.c file. Place it above its sole
      caller and make it static. No functional changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4f84bd7f
    • J
      btrfs: move basic block_group definitions to their own header · aac0023c
      Josef Bacik 提交于
      This is prep work for moving all of the block group cache code into its
      own file.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aac0023c
  10. 07 8月, 2019 1 次提交
    • F
      Btrfs: fix sysfs warning and missing raid sysfs directories · d7cd4dd9
      Filipe Manana 提交于
      In the 5.3 merge window, commit 7c7e3014 ("btrfs: sysfs: Replace
      default_attrs in ktypes with groups"), we started using the member
      "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
      to a series of warnings when running some test cases of fstests, such
      as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
      warnings are like the following:
      
        [116648.059212] kernfs: can not remove 'total_bytes', no directory
        [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
        [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
        [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.077143] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.077852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.080585] Call Trace:
        [116648.081262]  remove_files+0x31/0x70
        [116648.081929]  sysfs_remove_group+0x38/0x80
        [116648.082596]  sysfs_remove_groups+0x34/0x70
        [116648.083258]  kobject_del+0x20/0x60
        [116648.083933]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.084608]  close_ctree+0x19a/0x380 [btrfs]
        [116648.085278]  generic_shutdown_super+0x6c/0x110
        [116648.085951]  kill_anon_super+0xe/0x30
        [116648.086621]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.087289]  deactivate_locked_super+0x3a/0x70
        [116648.087956]  cleanup_mnt+0xb4/0x160
        [116648.088620]  task_work_run+0x7e/0xc0
        [116648.089285]  exit_to_usermode_loop+0xfa/0x100
        [116648.089933]  do_syscall_64+0x1cb/0x220
        [116648.090567]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.091197] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.100046] ---[ end trace 22e24db328ccadf8 ]---
        [116648.100618] ------------[ cut here ]------------
        [116648.101175] kernfs: can not remove 'used_bytes', no directory
        [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
        [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
        [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.113268] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.113939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.116649] Call Trace:
        [116648.117326]  remove_files+0x31/0x70
        [116648.117997]  sysfs_remove_group+0x38/0x80
        [116648.118671]  sysfs_remove_groups+0x34/0x70
        [116648.119342]  kobject_del+0x20/0x60
        [116648.120022]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.120707]  close_ctree+0x19a/0x380 [btrfs]
        [116648.121396]  generic_shutdown_super+0x6c/0x110
        [116648.122057]  kill_anon_super+0xe/0x30
        [116648.122702]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.123335]  deactivate_locked_super+0x3a/0x70
        [116648.123961]  cleanup_mnt+0xb4/0x160
        [116648.124586]  task_work_run+0x7e/0xc0
        [116648.125210]  exit_to_usermode_loop+0xfa/0x100
        [116648.125830]  do_syscall_64+0x1cb/0x220
        [116648.126463]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.127080] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.135923] ---[ end trace 22e24db328ccadf9 ]---
      
      These happen because, during the unmount path, we call kobject_del() for
      raid kobjects that are not fully initialized, meaning that we set their
      ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
      their parent kobject, which is done through btrfs_add_raid_kobjects().
      
      We have this split raid kobject setup since commit 75cb379d
      ("btrfs: defer adding raid type kobject until after chunk relocation") in
      order to avoid triggering reclaim during contextes where we can not
      (either we are holding a transaction handle or some lock required by
      the transaction commit path), so that we do the calls to kobject_add(),
      which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
      in contextes where it is safe to trigger reclaim. That change expected
      that a new raid kobject can only be created either when mounting the
      filesystem or after raid profile conversion through the relocation path.
      However, we can have new raid kobject created in other two cases at least:
      
      1) During device replace (or scrub) after adding a device a to the
         filesystem. The replace procedure (and scrub) do calls to
         btrfs_inc_block_group_ro() which can allocate a new block group
         with a new raid profile (because we now have more devices). This
         can be triggered by test cases btrfs/027 and btrfs/176.
      
      2) During a degraded mount trough any write path. This can be triggered
         by test case btrfs/124.
      
      Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
      makes things more complex and fragile, can also introduce deadlocks with
      reclaim the following way:
      
      1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
         anywhere in the replace/scrub path will cause a deadlock with reclaim
         because if reclaim happens and a transaction commit is triggered,
         the transaction commit path will block at btrfs_scrub_pause().
      
      2) During degraded mounts it is essentially impossible to figure out where
         to add extra calls to btrfs_add_raid_kobjects(), because allocation of
         a block group with a new raid profile can happen anywhere, which means
         we can't safely figure out which contextes are safe for reclaim, as
         we can either hold a transaction handle or some lock needed by the
         transaction commit path.
      
      So it is too complex and error prone to have this split setup of raid
      kobjects. So fix the issue by consolidating the setup of the kobjects in a
      single place, at link_block_group(), and setup a nofs context there in
      order to prevent reclaim being triggered by the memory allocations done
      through the call chain of kobject_add().
      
      Besides fixing the sysfs warnings during kobject_del(), this also ensures
      the sysfs directories for the new raid profiles end up created and visible
      to users (a bug that existed before the 5.3 commit 7c7e3014
      ("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
      
      Fixes: 75cb379d ("btrfs: defer adding raid type kobject until after chunk relocation")
      Fixes: 7c7e3014 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7cd4dd9
  11. 17 7月, 2019 1 次提交
    • J
      btrfs: free checksum hash on in close_ctree · bfcea1c6
      Johannes Thumshirn 提交于
      fs_info::csum_hash gets initialized in btrfs_init_csum_hash() which is
      called by open_ctree().
      
      But it only gets freed if open_ctree() fails, not on normal operation.
      
      This leads to a memory leak like the following found by kmemleak:
      unreferenced object 0xffff888132cb8720 (size 96):
      
        comm "mount", pid 450, jiffies 4294912436 (age 17.584s)
        hex dump (first 32 bytes):
          04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000000c9643d4>] crypto_create_tfm+0x2d/0xd0
          [<00000000ae577f68>] crypto_alloc_tfm+0x4b/0xb0
          [<000000002b5cdf30>] open_ctree+0xb84/0x2060 [btrfs]
          [<0000000043204297>] btrfs_mount_root+0x552/0x640 [btrfs]
          [<00000000c99b10ea>] legacy_get_tree+0x22/0x40
          [<0000000071a6495f>] vfs_get_tree+0x1f/0xc0
          [<00000000f180080e>] fc_mount+0x9/0x30
          [<000000009e36cebd>] vfs_kern_mount.part.11+0x6a/0x80
          [<0000000004594c05>] btrfs_mount+0x174/0x910 [btrfs]
          [<00000000c99b10ea>] legacy_get_tree+0x22/0x40
          [<0000000071a6495f>] vfs_get_tree+0x1f/0xc0
          [<00000000b86e92c5>] do_mount+0x6b0/0x940
          [<0000000097464494>] ksys_mount+0x7b/0xd0
          [<0000000057213c80>] __x64_sys_mount+0x1c/0x20
          [<00000000cb689b5e>] do_syscall_64+0x43/0x130
          [<000000002194e289>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Free fs_info::csum_hash in close_ctree() to avoid the memory leak.
      
      Fixes: 6d97c6e3 ("btrfs: add boilerplate code for directly including the crypto framework")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bfcea1c6