1. 09 9月, 2019 5 次提交
    • J
      btrfs: migrate the block group lookup code · 2e405ad8
      Josef Bacik 提交于
      Move these bits first as they are the easiest to move.  Export two of
      the helpers so they can be moved all at once.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor style updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e405ad8
    • J
      btrfs: move basic block_group definitions to their own header · aac0023c
      Josef Bacik 提交于
      This is prep work for moving all of the block group cache code into its
      own file.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aac0023c
    • Q
      btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type · 2a28468e
      Qu Wenruo 提交于
      [BUG]
      With fuzzed image and MIXED_GROUPS super flag, we can hit the following
      BUG_ON():
      
        kernel BUG at fs/btrfs/delayed-ref.c:491!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 1849 Comm: sync Tainted: G           O      5.2.0-custom #27
        RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs]
        Call Trace:
         add_delayed_ref_head+0x20c/0x2d0 [btrfs]
         btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs]
         btrfs_free_tree_block+0x123/0x380 [btrfs]
         __btrfs_cow_block+0x435/0x500 [btrfs]
         btrfs_cow_block+0x110/0x240 [btrfs]
         btrfs_search_slot+0x230/0xa00 [btrfs]
         ? __lock_acquire+0x105e/0x1e20
         btrfs_insert_empty_items+0x67/0xc0 [btrfs]
         alloc_reserved_file_extent+0x9e/0x340 [btrfs]
         __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs]
         ? kvm_clock_read+0x18/0x30
         ? __sched_clock_gtod_offset+0x21/0x50
         btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs]
         btrfs_run_delayed_refs+0x23/0x30 [btrfs]
         btrfs_commit_transaction+0x53/0x9f0 [btrfs]
         btrfs_sync_fs+0x7c/0x1c0 [btrfs]
         ? __ia32_sys_fdatasync+0x20/0x20
         sync_fs_one_sb+0x23/0x30
         iterate_supers+0x95/0x100
         ksys_sync+0x62/0xb0
         __ia32_sys_sync+0xe/0x20
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      This situation is caused by several factors:
      - Fuzzed image
        The extent tree of this fs missed one backref for extent tree root.
        So we can allocated space from that slot.
      
      - MIXED_BG feature
        Super block has MIXED_BG flag.
      
      - No mixed block groups exists
        All block groups are just regular ones.
      
      This makes data space_info->block_groups[] contains metadata block
      groups.  And when we reserve space for data, we can use space in
      metadata block group.
      
      Then we hit the following file operations:
      
      - fallocate
        We need to allocate data extents.
        find_free_extent() choose to use the metadata block to allocate space
        from, and choose the space of extent tree root, since its backref is
        missing.
      
        This generate one delayed ref head with is_data = 1.
      
      - extent tree update
        We need to update extent tree at run_delayed_ref time.
      
        This generate one delayed ref head with is_data = 0, for the same
        bytenr of old extent tree root.
      
      Then we trigger the BUG_ON().
      
      [FIX]
      The quick fix here is to check block_group->flags before using it.
      
      The problem can only happen for MIXED_GROUPS fs. Regular filesystems
      won't have space_info with DATA|METADATA flag, and no way to hit the
      bug.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255Reported-by: NJungyeon Yoon <jungyeon.yoon@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2a28468e
    • Q
      btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate() · 112974d4
      Qu Wenruo 提交于
      [BUG]
      Test case btrfs/156 fails since commit 302167c5 ("btrfs: don't end
      the transaction for delayed refs in throttle") with ENOSPC.
      
      [CAUSE]
      The ENOSPC is reported from btrfs_can_relocate().
      
      This function will check:
      - If this block group is empty, we can relocate
      - If we can enough free space, we can relocate
      
      Above checks are valid but the following check is vague due to its
      implementation:
      - If and only if we can allocated a new block group to contain all the
        used space, we can relocate
      
      This design itself is OK, but the way to determine if we can allocate a
      new block group is problematic.
      
      btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
      device.
      However find_free_dev_extent() only searches commit root and excludes
      dev extents allocated in current trans, this makes it unable to use dev
      extent just freed in current transaction.
      
      So for the following example, btrfs_can_relocate() will report ENOSPC:
      The example block group layout:
      1M      129M        257M       385M      513M       550M
      |///////|///////////|//////////|         |          |
      // = Used bg, consider all bg is 100% used for easy calculation.
      And all block groups are SINGLE, on-disk bytenr is the same as the
      logical bytenr.
      
      1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
      1M      129M        257M       385M      513M       550M
      |///////|           |//////////|/////////|
      In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)
      
      However transid 100 is not committed yet, so in dev commit tree, we
      still have the old dev extents layout:
      1M      129M        257M       385M      513M       550M
      |///////|///////////|//////////|         |          |
      
      2) Try to relocate bg [257M, 385M)
      We goes into btrfs_can_relocate(), no free space in current bgs, so we
      check if we can find large enough free dev extents.
      
      The first slot is [385M, 513M), but that is already used by new bg at
      [385M, 513M), so we continue search.
      
      The remaining slot is [512M, 550M), smaller than the bg's length 128M.
      So btrfs_can_relocate report ENOSPC.
      
      However this is over killed, in fact if we just skip btrfs_can_relocate()
      check, and go into regular relocation routine, at extent reservation time,
      if we can't find free extent, then we fallback to commit transaction,
      which will free up the dev extents and allow new block group to be created.
      
      [FIX]
      The fix here is to remove btrfs_can_relocate() completely.
      
      If we hit the false ENOSPC case just like btrfs/156, extent allocator
      will push harder by committing transaction and we will have space for
      new block group, avoiding the false ENOSPC.
      
      If we really ran out of space, we will hit ENOSPC at
      relocate_block_group(), and btrfs will just reports the ENOSPC error as
      usual.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      112974d4
    • Q
      btrfs: extent-tree: Add comment for inc_block_group_ro() · e9138142
      Qu Wenruo 提交于
      inc_block_group_ro() is only designed to mark one block group read-only,
      it doesn't really care if other block groups have enough free space to
      contain the used space in the block group.
      
      However due to the close connection between this function and
      relocation, sometimes we can be confused and think this function is
      responsible for balance space reservation, which is not true.
      
      Add some comment to make the functionality clear.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9138142
  2. 07 8月, 2019 2 次提交
    • Q
      btrfs: trim: Check the range passed into to prevent overflow · 07301df7
      Qu Wenruo 提交于
      Normally the range->len is set to default value (U64_MAX), but when it's
      not default value, we should check if the range overflows.
      
      And if it overflows, return -EINVAL before doing anything.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07301df7
    • F
      Btrfs: fix sysfs warning and missing raid sysfs directories · d7cd4dd9
      Filipe Manana 提交于
      In the 5.3 merge window, commit 7c7e3014 ("btrfs: sysfs: Replace
      default_attrs in ktypes with groups"), we started using the member
      "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
      to a series of warnings when running some test cases of fstests, such
      as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
      warnings are like the following:
      
        [116648.059212] kernfs: can not remove 'total_bytes', no directory
        [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
        [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
        [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.077143] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.077852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.080585] Call Trace:
        [116648.081262]  remove_files+0x31/0x70
        [116648.081929]  sysfs_remove_group+0x38/0x80
        [116648.082596]  sysfs_remove_groups+0x34/0x70
        [116648.083258]  kobject_del+0x20/0x60
        [116648.083933]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.084608]  close_ctree+0x19a/0x380 [btrfs]
        [116648.085278]  generic_shutdown_super+0x6c/0x110
        [116648.085951]  kill_anon_super+0xe/0x30
        [116648.086621]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.087289]  deactivate_locked_super+0x3a/0x70
        [116648.087956]  cleanup_mnt+0xb4/0x160
        [116648.088620]  task_work_run+0x7e/0xc0
        [116648.089285]  exit_to_usermode_loop+0xfa/0x100
        [116648.089933]  do_syscall_64+0x1cb/0x220
        [116648.090567]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.091197] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.100046] ---[ end trace 22e24db328ccadf8 ]---
        [116648.100618] ------------[ cut here ]------------
        [116648.101175] kernfs: can not remove 'used_bytes', no directory
        [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
        [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
        [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.113268] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.113939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.116649] Call Trace:
        [116648.117326]  remove_files+0x31/0x70
        [116648.117997]  sysfs_remove_group+0x38/0x80
        [116648.118671]  sysfs_remove_groups+0x34/0x70
        [116648.119342]  kobject_del+0x20/0x60
        [116648.120022]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.120707]  close_ctree+0x19a/0x380 [btrfs]
        [116648.121396]  generic_shutdown_super+0x6c/0x110
        [116648.122057]  kill_anon_super+0xe/0x30
        [116648.122702]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.123335]  deactivate_locked_super+0x3a/0x70
        [116648.123961]  cleanup_mnt+0xb4/0x160
        [116648.124586]  task_work_run+0x7e/0xc0
        [116648.125210]  exit_to_usermode_loop+0xfa/0x100
        [116648.125830]  do_syscall_64+0x1cb/0x220
        [116648.126463]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.127080] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.135923] ---[ end trace 22e24db328ccadf9 ]---
      
      These happen because, during the unmount path, we call kobject_del() for
      raid kobjects that are not fully initialized, meaning that we set their
      ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
      their parent kobject, which is done through btrfs_add_raid_kobjects().
      
      We have this split raid kobject setup since commit 75cb379d
      ("btrfs: defer adding raid type kobject until after chunk relocation") in
      order to avoid triggering reclaim during contextes where we can not
      (either we are holding a transaction handle or some lock required by
      the transaction commit path), so that we do the calls to kobject_add(),
      which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
      in contextes where it is safe to trigger reclaim. That change expected
      that a new raid kobject can only be created either when mounting the
      filesystem or after raid profile conversion through the relocation path.
      However, we can have new raid kobject created in other two cases at least:
      
      1) During device replace (or scrub) after adding a device a to the
         filesystem. The replace procedure (and scrub) do calls to
         btrfs_inc_block_group_ro() which can allocate a new block group
         with a new raid profile (because we now have more devices). This
         can be triggered by test cases btrfs/027 and btrfs/176.
      
      2) During a degraded mount trough any write path. This can be triggered
         by test case btrfs/124.
      
      Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
      makes things more complex and fragile, can also introduce deadlocks with
      reclaim the following way:
      
      1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
         anywhere in the replace/scrub path will cause a deadlock with reclaim
         because if reclaim happens and a transaction commit is triggered,
         the transaction commit path will block at btrfs_scrub_pause().
      
      2) During degraded mounts it is essentially impossible to figure out where
         to add extra calls to btrfs_add_raid_kobjects(), because allocation of
         a block group with a new raid profile can happen anywhere, which means
         we can't safely figure out which contextes are safe for reclaim, as
         we can either hold a transaction handle or some lock needed by the
         transaction commit path.
      
      So it is too complex and error prone to have this split setup of raid
      kobjects. So fix the issue by consolidating the setup of the kobjects in a
      single place, at link_block_group(), and setup a nofs context there in
      order to prevent reclaim being triggered by the memory allocations done
      through the call chain of kobject_add().
      
      Besides fixing the sysfs warnings during kobject_del(), this also ensures
      the sysfs directories for the new raid profiles end up created and visible
      to users (a bug that existed before the 5.3 commit 7c7e3014
      ("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
      
      Fixes: 75cb379d ("btrfs: defer adding raid type kobject until after chunk relocation")
      Fixes: 7c7e3014 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7cd4dd9
  3. 04 7月, 2019 4 次提交
  4. 02 7月, 2019 21 次提交
  5. 01 7月, 2019 8 次提交