1. 23 2月, 2021 1 次提交
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
  2. 09 2月, 2021 12 次提交
    • N
      btrfs: zoned: extend zoned allocator to use dedicated tree-log block group · 40ab3be1
      Naohiro Aota 提交于
      This is the 1/3 patch to enable tree log on zoned filesystems.
      
      The tree-log feature does not work on a zoned filesystem as is. Blocks for
      a tree-log tree are allocated mixed with other metadata blocks and btrfs
      writes and syncs the tree-log blocks to devices at the time of fsync(),
      which has a different timing than a global transaction commit. As a
      result, both writing tree-log blocks and writing other metadata blocks
      become non-sequential writes that zoned filesystems must avoid.
      
      Introduce a dedicated block group for tree-log blocks, so that tree-log
      blocks and other metadata blocks can be separate write streams.  As a
      result, each write stream can now be written to devices separately.
      "fs_info->treelog_bg" tracks the dedicated block group and assigns
      "treelog_bg" on-demand on tree-log block allocation time.
      
      This commit extends the zoned block allocator to use the block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40ab3be1
    • N
      btrfs: extend btrfs_rmap_block for specifying a device · 138082f3
      Naohiro Aota 提交于
      btrfs_rmap_block currently reverse-maps the physical addresses on all
      devices to the corresponding logical addresses.
      
      Extend the function to match to a specified device. The old functionality
      of querying all devices is left intact by specifying NULL as target
      device.
      
      A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
      as this function is intended to reverse-map the result of a bio, which
      only has a block_device.
      
      Also export the function for later use.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      138082f3
    • N
      btrfs: zoned: reset zones of unused block groups · dcba6e48
      Naohiro Aota 提交于
      We must reset the zones of a deleted unused block group to rewind the
      zones' write pointers to the zones' start.
      
      To do this, we can use the DISCARD_SYNC code to do the reset when the
      filesystem is running on zoned devices.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dcba6e48
    • N
      btrfs: zoned: implement sequential extent allocation · 2eda5708
      Naohiro Aota 提交于
      Implement a sequential extent allocator for zoned filesystems. This
      allocator only needs to check if there is enough space in the block group
      after the allocation pointer to satisfy the extent allocation request.
      Therefore the allocator never manages bitmaps or clusters. Also, add
      assertions to the corresponding functions.
      
      As zone append writing is used, it would be unnecessary to track the
      allocation offset, as the allocator only needs to check available space.
      But by tracking and returning the offset as an allocated region, we can
      skip modification of ordered extents and checksum information when there
      is no IO reordering.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2eda5708
    • N
      btrfs: zoned: track unusable bytes for zones · 169e0da9
      Naohiro Aota 提交于
      In a zoned filesystem a once written then freed region is not usable
      until the underlying zone has been reset. So we need to distinguish such
      unusable space from usable free space.
      
      Therefore we need to introduce the "zone_unusable" field to the block
      group structure, and "bytes_zone_unusable" to the space_info structure
      to track the unusable space.
      
      Pinned bytes are always reclaimed to the unusable space. But, when an
      allocated region is returned before using e.g., the block group becomes
      read-only between allocation time and reservation time, we can safely
      return the region to the block group. For the situation, this commit
      introduces "btrfs_add_free_space_unused". This behaves the same as
      btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
      rewinds the allocation offset.
      
      Because the read-only bytes tracks free but unusable bytes when the block
      group is read-only, we need to migrate the zone_unusable bytes to
      read-only bytes when a block group is marked read-only.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      169e0da9
    • N
      btrfs: zoned: calculate allocation offset for conventional zones · a94794d5
      Naohiro Aota 提交于
      Conventional zones do not have a write pointer, so we cannot use it to
      determine the allocation offset for sequential allocation if a block
      group contains a conventional zone.
      
      But instead, we can consider the end of the highest addressed extent in
      the block group for the allocation offset.
      
      For new block group, we cannot calculate the allocation offset by
      consulting the extent tree, because it can cause deadlock by taking
      extent buffer lock after chunk mutex, which is already taken in
      btrfs_make_block_group(). Since it is a new block group anyways, we can
      simply set the allocation offset to 0.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a94794d5
    • N
      btrfs: zoned: load zone's allocation offset · 08e11a3d
      Naohiro Aota 提交于
      A zoned filesystem must allocate blocks at the zones' write pointer. The
      device's write pointer position can be mapped to a logical address within
      a block group. To facilitate this, add an "alloc_offset" to the
      block-group to track the logical addresses of the write pointer.
      
      This logical address is populated in btrfs_load_block_group_zone_info()
      from the write pointers of corresponding zones.
      
      For now, zoned filesystems the single profile. Supporting non-single
      profile with zone append writing is not trivial. For example, in the DUP
      profile, we send a zone append writing IO to two zones on a device. The
      device reply with written LBAs for the IOs. If the offsets of the
      returned addresses from the beginning of the zone are different, then it
      results in different logical addresses.
      
      We need fine-grained logical to physical mapping to support such separated
      physical address issue. Since it should require additional metadata type,
      disable non-single profiles for now.
      
      This commit supports the case all the zones in a block group are
      sequential. The next patch will handle the case having a conventional
      zone.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08e11a3d
    • J
      btrfs: release path before calling to btrfs_load_block_group_zone_info · 4afd2fe8
      Johannes Thumshirn 提交于
      Since we have no write pointer in conventional zones, we cannot
      determine the allocation offset from it. Instead, we set the allocation
      offset after the highest addressed extent. This is done by reading the
      extent tree in btrfs_load_block_group_zone_info().
      
      However, this function is called from btrfs_read_block_groups(), so the
      read lock for the tree node could be recursively taken.
      
      To avoid this unsafe locking scenario, release the path before reading
      the extent tree to get the allocation offset.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4afd2fe8
    • J
      btrfs: do not block on deleted bgs mutex in the cleaner · ddfd08cb
      Josef Bacik 提交于
      While running some stress tests I started getting hung task messages.
      This is because the delete unused block groups code has to take the
      delete_unused_bgs_mutex to do it's work, which is taken by balance to
      make sure we don't delete block groups while we're balancing.
      
      The problem is that balance can take a while, and so we were getting
      hung task warnings.  We don't need to block and run these things, and
      the cleaner is needed to do other work, so trylock on this mutex and
      just bail if we can't acquire it right away.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ddfd08cb
    • J
      btrfs: splice remaining dirty_bg's onto the transaction dirty bg list · 938fcbfb
      Josef Bacik 提交于
      While doing error injection testing with my relocation patches I hit the
      following assert:
      
        assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3357!
        invalid opcode: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 24351 Comm: umount Tainted: G        W         5.10.0-rc3+ #193
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x1a
        RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
        RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
        RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
        RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
        R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
        R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
        FS:  00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
        Call Trace:
         btrfs_free_block_groups.cold+0x55/0x55
         close_ctree+0x2c5/0x306
         ? fsnotify_destroy_marks+0x14/0x100
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20
         deactivate_locked_super+0x36/0xa0
         cleanup_mnt+0x12d/0x190
         task_work_run+0x5c/0xa0
         exit_to_user_mode_prepare+0x1b1/0x1d0
         syscall_exit_to_user_mode+0x54/0x280
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happened because I injected an error in btrfs_cow_block() while
      running the dirty block groups.  When we run the dirty block groups, we
      splice the list onto a local list to process.  However if an error
      occurs, we only cleanup the transactions dirty block group list, not any
      pending block groups we have on our locally spliced list.
      
      In fact if we fail to allocate a path in this function we'll also fail
      to clean up the splice list.
      
      Fix this by splicing the list back onto the transaction dirty block
      group list so that the block groups are cleaned up.  Then add a 'out'
      label and have the error conditions jump to out so that the errors are
      handled properly.  This also has the side-effect of fixing a problem
      where we would clear 'ret' on error because we unconditionally ran
      btrfs_run_delayed_refs().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      938fcbfb
    • J
      btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself · 2187374f
      Josef Bacik 提交于
      Currently we pass things around to figure out if we maybe freeing data
      based on the state of the delayed refs head.  This makes the accounting
      sort of confusing and hard to follow, as it's distinctly separate from
      the delayed ref heads stuff, but also depends on it entirely.
      
      Fix this by explicitly adjusting the space_info->total_bytes_pinned in
      the delayed refs code.  We now have two places where we modify this
      counter, once where we create the delayed and destroy the delayed refs,
      and once when we pin and unpin the extents.  This means there is a
      slight overlap between delayed refs and the pin/unpin mechanisms, but
      this is simply used by the ENOSPC infrastructure to determine if we need
      to commit the transaction, so there's no adverse affect from this, we
      might simply commit thinking it will give us enough space when it might
      not.
      
      CC: stable@vger.kernel.org # 5.10
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2187374f
    • N
      btrfs: document fs_info in btrfs_rmap_block · 9ee9b979
      Nikolay Borisov 提交于
      Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block'
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ee9b979
  3. 26 1月, 2021 1 次提交
    • J
      btrfs: fix possible free space tree corruption with online conversion · 2f96e402
      Josef Bacik 提交于
      While running btrfs/011 in a loop I would often ASSERT() while trying to
      add a new free space entry that already existed, or get an EEXIST while
      adding a new block to the extent tree, which is another indication of
      double allocation.
      
      This occurs because when we do the free space tree population, we create
      the new root and then populate the tree and commit the transaction.
      The problem is when you create a new root, the root node and commit root
      node are the same.  During this initial transaction commit we will run
      all of the delayed refs that were paused during the free space tree
      generation, and thus begin to cache block groups.  While caching block
      groups the caching thread will be reading from the main root for the
      free space tree, so as we make allocations we'll be changing the free
      space tree, which can cause us to add the same range twice which results
      in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
      variety of different errors when running delayed refs because of a
      double allocation.
      
      Fix this by marking the fs_info as unsafe to load the free space tree,
      and fall back on the old slow method.  We could be smarter than this,
      for example caching the block group while we're populating the free
      space tree, but since this is a serious problem I've opted for the
      simplest solution.
      
      CC: stable@vger.kernel.org # 4.9+
      Fixes: a5ed9182 ("Btrfs: implement the free space B-tree")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f96e402
  4. 18 1月, 2021 1 次提交
  5. 10 12月, 2020 4 次提交
    • B
      btrfs: skip space_cache v1 setup when not using it · af456a2c
      Boris Burkov 提交于
      If we are not using space cache v1, we should not create the free space
      object or free space inodes. This comes up when we delete the existing
      free space objects/inodes when migrating to v2, only to see them get
      recreated for every dirtied block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af456a2c
    • B
      btrfs: remove free space items when disabling space cache v1 · 36b216c8
      Boris Burkov 提交于
      When the filesystem transitions from space cache v1 to v2 or to
      nospace_cache, it removes the old cached data, but does not remove
      the FREE_SPACE items nor the free space inodes they point to. This
      doesn't cause any issues besides being a bit inefficient, since these
      items no longer do anything useful.
      
      To fix it, when we are mounting, and plan to disable the space cache,
      destroy each block group's free space item and free space inode.
      The code to remove the items is lifted from the existing use case of
      removing the block group, with a light adaptation to handle whether or
      not we have already looked up the free space inode.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36b216c8
    • B
      btrfs: only mark bg->needs_free_space if free space tree is on · 997e3e2e
      Boris Burkov 提交于
      If we attempt to create a free space tree while any block groups have
      needs_free_space set, we will double add the new free space item
      and hit EEXIST. Previously, we only created the free space tree on a new
      mount, so we never hit the case, but if we try to create it on a
      remount, such block groups could exist and trip us up.
      
      We don't do anything with this field unless the free space tree is
      enabled, so there is no harm in not setting it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      997e3e2e
    • N
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota 提交于
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12659251
  6. 08 12月, 2020 5 次提交
    • N
    • J
      btrfs: protect fs_info->caching_block_groups by block_group_cache_lock · bbb86a37
      Josef Bacik 提交于
      I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0+ #101 Not tainted
        ------------------------------------------------------
        btrfs-cleaner/3445 is trying to acquire lock:
        ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
      
        but task is already holding lock:
        ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
      	 down_write+0x3d/0x70
      	 btrfs_cache_block_group+0x2d5/0x510
      	 find_free_extent+0xb6e/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&space_info->groups_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 find_free_extent+0x2ed/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0xb9/0x3d0
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x32/0x170
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x614/0x9d0
      	 btrfs_find_root+0x35/0x1b0
      	 btrfs_read_tree_root+0x61/0x120
      	 btrfs_get_root_ref+0x14b/0x600
      	 find_parent_nodes+0x3e6/0x1b30
      	 btrfs_find_all_roots_safe+0xb4/0x130
      	 btrfs_find_all_roots+0x60/0x80
      	 btrfs_qgroup_trace_extent_post+0x27/0x40
      	 btrfs_add_delayed_data_ref+0x3fd/0x460
      	 btrfs_free_extent+0x42/0x100
      	 __btrfs_mod_ref+0x1d7/0x2f0
      	 walk_up_proc+0x11c/0x400
      	 walk_up_tree+0xf0/0x180
      	 btrfs_drop_snapshot+0x1c7/0x780
      	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
      	 cleaner_kthread+0xd4/0x140
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->commit_root_sem);
      				 lock(&space_info->groups_sem);
      				 lock(&fs_info->commit_root_sem);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        3 locks held by btrfs-cleaner/3445:
         #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
         #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
         #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        stack backtrace:
        CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         __lock_acquire+0x1167/0x2150
         ? __bfs+0x42/0x210
         lock_acquire+0xb9/0x3d0
         ? __btrfs_tree_read_lock+0x32/0x170
         down_read_nested+0x43/0x130
         ? __btrfs_tree_read_lock+0x32/0x170
         __btrfs_tree_read_lock+0x32/0x170
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x614/0x9d0
         ? find_held_lock+0x2b/0x80
         btrfs_find_root+0x35/0x1b0
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_read_tree_root+0x61/0x120
         btrfs_get_root_ref+0x14b/0x600
         find_parent_nodes+0x3e6/0x1b30
         btrfs_find_all_roots_safe+0xb4/0x130
         btrfs_find_all_roots+0x60/0x80
         btrfs_qgroup_trace_extent_post+0x27/0x40
         btrfs_add_delayed_data_ref+0x3fd/0x460
         btrfs_free_extent+0x42/0x100
         __btrfs_mod_ref+0x1d7/0x2f0
         walk_up_proc+0x11c/0x400
         walk_up_tree+0xf0/0x180
         btrfs_drop_snapshot+0x1c7/0x780
         ? btrfs_clean_one_deleted_snapshot+0x73/0x110
         btrfs_clean_one_deleted_snapshot+0xfb/0x110
         cleaner_kthread+0xd4/0x140
         ? btrfs_alloc_root+0x50/0x50
         kthread+0x13a/0x150
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      while testing another lockdep fix.  This happens because we're using the
      commit_root_sem to protect fs_info->caching_block_groups, which creates
      a dependency on the groups_sem -> commit_root_sem, which is problematic
      because we will allocate blocks while holding tree roots.  Fix this by
      making the list itself protected by the fs_info->block_group_cache_lock.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbb86a37
    • J
      btrfs: load free space cache asynchronously · e747853c
      Josef Bacik 提交于
      While documenting the usage of the commit_root_sem, I noticed that we do
      not actually take the commit_root_sem in the case of the free space
      cache.  This is problematic because we're supposed to hold that sem
      while we're reading the commit roots, which is what we do for the free
      space cache.
      
      The reason I did it inline when I originally wrote the code was because
      there's the case of unpinning where we need to make sure that the free
      space cache is loaded if we're going to use the free space cache.  But
      we can accomplish the same thing by simply waiting for the cache to be
      loaded.
      
      Rework this code to load the free space cache asynchronously.  This
      allows us to greatly cleanup the caching code because now it's all
      shared by the various caching methods.  We also are now in a position to
      have the commit_root semaphore held while we're loading the free space
      cache.  And finally our modification of ->last_byte_to_unpin is removed
      because it can be handled in the proper way on commit.
      
      Some care must be taken when replaying the log, when we expect that the
      free space cache will be read entirely before we start excluding space
      to replay. This could lead to overwriting space during replay.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e747853c
    • J
      btrfs: load free space cache into a temporary ctl · cd79909b
      Josef Bacik 提交于
      The free space cache has been special in that we would load it right
      away instead of farming the work off to a worker thread.  This resulted
      in some weirdness that had to be taken into account for this fact,
      namely that if we every found a block group being cached the fast way we
      had to wait for it to finish, because we could get the cache before it
      had been validated and we may throw the cache away.
      
      To handle this particular case instead create a temporary
      btrfs_free_space_ctl to load the free space cache into.  Then once we've
      validated that it makes sense, copy it's contents into the actual
      block_group->free_space_ctl.  This allows us to avoid the problems of
      needing to wait for the caching to complete, we can clean up the discard
      extent handling stuff in __load_free_space_cache, and we no longer need
      to do the merge_space_tree() because the space is added one by one into
      the real free_space_ctl.  This will allow further reworks of how we
      handle loading the free space cache.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd79909b
    • J
      btrfs: introduce mount option rescue=ignorebadroots · 42437a63
      Josef Bacik 提交于
      In the face of extent root corruption, or any other core fs wide root
      corruption we will fail to mount the file system.  This makes recovery
      kind of a pain, because you need to fall back to userspace tools to
      scrape off data.  Instead provide a mechanism to gracefully handle bad
      roots, so we can at least mount read-only and possibly recover data from
      the file system.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42437a63
  7. 26 10月, 2020 1 次提交
    • J
      btrfs: drop the path before adding block group sysfs files · 7837fa88
      Josef Bacik 提交于
      Dave reported a problem with my rwsem conversion patch where we got the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-default+ #1297 Not tainted
        ------------------------------------------------------
        kswapd0/76 is trying to acquire lock:
        ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 kmem_cache_alloc+0x30/0x9c0
      	 alloc_inode+0x81/0x90
      	 iget_locked+0xcd/0x1a0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x210
      	 sysfs_get_tree+0x1a/0x50
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (kernfs_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x58/0x80
      	 sysfs_create_dir_ns+0x70/0xd0
      	 kobject_add_internal+0xbb/0x2d0
      	 kobject_add+0x7a/0xd0
      	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
      	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
      	 open_ctree+0x981/0x1108 [btrfs]
      	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (btrfs-extent-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 check_committed_ref+0x69/0x200 [btrfs]
      	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
      	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
      	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
      	 writepage_delalloc+0xae/0x160 [btrfs]
      	 __extent_writepage+0x262/0x420 [btrfs]
      	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
      	 extent_writepages+0x43/0x90 [btrfs]
      	 do_writepages+0x40/0xe0
      	 __writeback_single_inode+0x62/0x610
      	 writeback_sb_inodes+0x20f/0x500
      	 wb_writeback+0xef/0x4a0
      	 wb_do_writeback+0x49/0x2e0
      	 wb_workfn+0x81/0x340
      	 process_one_work+0x233/0x5d0
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-fs-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
      	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
      	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
      	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
      	 btrfs_mksubvol+0x2db/0x470 [btrfs]
      	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
      	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
      	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
      	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
      	 __x64_sys_ioctl+0x6f/0xa0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 validate_chain+0xa6e/0x2a20
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x16d/0x3b0
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x230/0x6a0
      	 balance_pgdat+0x325/0x750
      	 kswapd+0x206/0x4d0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/76:
         #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x77/0x97
         check_noncircular+0xff/0x110
         ? save_trace+0x50/0x470
         check_prev_add+0x91/0xc60
         validate_chain+0xa6e/0x2a20
         ? save_trace+0x50/0x470
         __lock_acquire+0x582/0xac0
         lock_acquire+0xca/0x430
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? __lock_acquire+0x582/0xac0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? btrfs_evict_inode+0x30b/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x3cc/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x16d/0x3b0
         shrink_slab+0xb1/0x2e0
         shrink_node+0x230/0x6a0
         balance_pgdat+0x325/0x750
         kswapd+0x206/0x4d0
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_mod_delayed_work+0xc0/0xc0
         ret_from_fork+0x1f/0x30
      
      This happens because we are still holding the path open when we start
      adding the sysfs files for the block groups, which creates a dependency
      on fs_reclaim via the tree lock.  Fix this by dropping the path before
      we start doing anything with sysfs.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7837fa88
  8. 07 10月, 2020 6 次提交
    • J
      btrfs: do not create raid sysfs entries under any locks · 49ea112d
      Josef Bacik 提交于
      While running xfstests btrfs/177 I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #5 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x7a/0xb0
      	 sysfs_create_dir_ns+0x60/0xb0
      	 kobject_add_internal+0xc0/0x2c0
      	 kobject_add+0x6e/0x90
      	 btrfs_sysfs_add_block_group_type+0x102/0x160
      	 btrfs_make_block_group+0x167/0x230
      	 btrfs_alloc_chunk+0x54f/0xb80
      	 btrfs_chunk_alloc+0x18e/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_new_inode+0x225/0x730
      	 btrfs_create+0xab/0x1f0
      	 lookup_open.isra.0+0x52d/0x690
      	 path_openat+0x2a7/0x9e0
      	 do_filp_open+0x75/0x100
      	 do_sys_openat2+0x7b/0x130
      	 __x64_sys_openat+0x46/0x70
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 do_unlinkat+0x1a9/0x2b0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because when we link in a block group with a new raid index
      type we'll create the corresponding sysfs entries for it.  This is
      problematic because while restriping we're holding the chunk_mutex, and
      while mounting we're holding the tree locks.
      
      Fixing this isn't pretty, we move the call to the sysfs stuff into the
      btrfs_create_pending_block_groups() work, where we're not holding any
      locks.  This creates a slight race where other threads could see that
      there's no sysfs kobj for that raid type, and race to create the
      sysfs dir.  Fix this by wrapping the creation in space_info->lock, so we
      only get one thread calling kobject_add() for the new directory.  We
      don't worry about the lock on cleanup as it only gets deleted on
      unmount.
      
      On mount it's more straightforward, we loop through the space_infos
      already, just check every raid index in each space_info and added the
      sysfs entries for the corresponding block groups.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49ea112d
    • J
      btrfs: kill the RCU protection for fs_info->space_info · 72804905
      Josef Bacik 提交于
      We have this thing wrapped in an RCU lock, but it's really not needed.
      We create all the space_info's on mount, and we destroy them on unmount.
      The list never changes and we're protected from messing with it by the
      normal mount/umount path, so kill the RCU stuff around it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72804905
    • M
      btrfs: make read_block_group_item return void · 4c448ce8
      Marcos Paulo de Souza 提交于
      Since it's inclusion on 9afc6649 ("btrfs: block-group: refactor how
      we read one block group item") this function always returned 0, so there
      is no need to check for the returned value.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c448ce8
    • J
      btrfs: call btrfs_try_granting_tickets when reserving space · 99ffb43e
      Josef Bacik 提交于
      If we have compression on we could free up more space than we reserved,
      and thus be able to make a space reservation.  Add the call for this
      scenario.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99ffb43e
    • J
      btrfs: call btrfs_try_granting_tickets when freeing reserved bytes · 3308234a
      Josef Bacik 提交于
      We were missing a call to btrfs_try_granting_tickets in
      btrfs_free_reserved_bytes, so add it to handle the case where we're able
      to satisfy an allocation because we've freed a pending reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3308234a
    • R
      btrfs: delete duplicated words + other fixes in comments · 260db43c
      Randy Dunlap 提交于
      Delete repeated words in fs/btrfs/.
      {to, the, a, and old}
      and change "into 2 part" to "into 2 parts".
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      260db43c
  9. 27 8月, 2020 1 次提交
    • M
      btrfs: block-group: fix free-space bitmap threshold · e3e39c72
      Marcos Paulo de Souza 提交于
      [BUG]
      After commit 9afc6649 ("btrfs: block-group: refactor how we read one
      block group item"), cache->length is being assigned after calling
      btrfs_create_block_group_cache. This causes a problem since
      set_free_space_tree_thresholds calculates the free-space threshold to
      decide if the free-space tree should convert from extents to bitmaps.
      
      The current code calls set_free_space_tree_thresholds with cache->length
      being 0, which then makes cache->bitmap_high_thresh zero. This implies
      the system will always use bitmap instead of extents, which is not
      desired if the block group is not fragmented.
      
      This behavior can be seen by a test that expects to repair systems
      with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
      created FREE_SPACE_BITMAP.
      
      [FIX]
      Call set_free_space_tree_thresholds after setting cache->length. There
      is now a WARN_ON in set_free_space_tree_thresholds to help preventing
      the same mistake to happen again in the future.
      
      Link: https://github.com/kdave/btrfs-progs/issues/251
      Fixes: 9afc6649 ("btrfs: block-group: refactor how we read one block group item")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3e39c72
  10. 27 7月, 2020 8 次提交