1. 26 9月, 2022 11 次提交
  2. 13 9月, 2022 2 次提交
    • F
      btrfs: fix hang during unmount when stopping a space reclaim worker · a362bb86
      Filipe Manana 提交于
      Often when running generic/562 from fstests we can hang during unmount,
      resulting in a trace like this:
      
        Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
        Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
        Sep 07 11:55:32 debian9 kernel:       Not tainted 6.0.0-rc2-btrfs-next-122 #1
        Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        Sep 07 11:55:32 debian9 kernel: task:umount          state:D stack:    0 pid:49438 ppid: 25683 flags:0x00004000
        Sep 07 11:55:32 debian9 kernel: Call Trace:
        Sep 07 11:55:32 debian9 kernel:  <TASK>
        Sep 07 11:55:32 debian9 kernel:  __schedule+0x3c8/0xec0
        Sep 07 11:55:32 debian9 kernel:  ? rcu_read_lock_sched_held+0x12/0x70
        Sep 07 11:55:32 debian9 kernel:  schedule+0x5d/0xf0
        Sep 07 11:55:32 debian9 kernel:  schedule_timeout+0xf1/0x130
        Sep 07 11:55:32 debian9 kernel:  ? lock_release+0x224/0x4a0
        Sep 07 11:55:32 debian9 kernel:  ? lock_acquired+0x1a0/0x420
        Sep 07 11:55:32 debian9 kernel:  ? trace_hardirqs_on+0x2c/0xd0
        Sep 07 11:55:32 debian9 kernel:  __wait_for_common+0xac/0x200
        Sep 07 11:55:32 debian9 kernel:  ? usleep_range_state+0xb0/0xb0
        Sep 07 11:55:32 debian9 kernel:  __flush_work+0x26d/0x530
        Sep 07 11:55:32 debian9 kernel:  ? flush_workqueue_prep_pwqs+0x140/0x140
        Sep 07 11:55:32 debian9 kernel:  ? trace_clock_local+0xc/0x30
        Sep 07 11:55:32 debian9 kernel:  __cancel_work_timer+0x11f/0x1b0
        Sep 07 11:55:32 debian9 kernel:  ? close_ctree+0x12b/0x5b3 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  ? __trace_bputs+0x10b/0x170
        Sep 07 11:55:32 debian9 kernel:  close_ctree+0x152/0x5b3 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  ? evict_inodes+0x166/0x1c0
        Sep 07 11:55:32 debian9 kernel:  generic_shutdown_super+0x71/0x120
        Sep 07 11:55:32 debian9 kernel:  kill_anon_super+0x14/0x30
        Sep 07 11:55:32 debian9 kernel:  btrfs_kill_super+0x12/0x20 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  deactivate_locked_super+0x2e/0xa0
        Sep 07 11:55:32 debian9 kernel:  cleanup_mnt+0x100/0x160
        Sep 07 11:55:32 debian9 kernel:  task_work_run+0x59/0xa0
        Sep 07 11:55:32 debian9 kernel:  exit_to_user_mode_prepare+0x1a6/0x1b0
        Sep 07 11:55:32 debian9 kernel:  syscall_exit_to_user_mode+0x16/0x40
        Sep 07 11:55:32 debian9 kernel:  do_syscall_64+0x48/0x90
        Sep 07 11:55:32 debian9 kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
        Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
        Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
        Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
        Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
        Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
        Sep 07 11:55:32 debian9 kernel:  </TASK>
      
      What happens is the following:
      
      1) The cleaner kthread tries to start a transaction to delete an unused
         block group, but the metadata reservation can not be satisfied right
         away, so a reservation ticket is created and it starts the async
         metadata reclaim task (fs_info->async_reclaim_work);
      
      2) Writeback for all the filler inodes with an i_size of 2K starts
         (generic/562 creates a lot of 2K files with the goal of filling
         metadata space). We try to create an inline extent for them, but we
         fail when trying to insert the inline extent with -ENOSPC (at
         cow_file_range_inline()) - since this is not critical, we fallback
         to non-inline mode (back to cow_file_range()), reserve extents, create
         extent maps and create the ordered extents;
      
      3) An unmount starts, enters close_ctree();
      
      4) The async reclaim task is flushing stuff, entering the flush states one
         by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
         delayed iputs.
      
         After running the delayed iputs and before calling
         btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
         and btrfs_add_delayed_iput() is called for each one through
         btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
         in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
      
         So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
         for fs_info->nr_delayed_iputs to become 0;
      
      5) The current transaction is committed by the transaction kthread, we then
         start unpinning extents and end up calling btrfs_try_granting_tickets()
         through unpin_extent_range(), since we released some space.
         This results in satisfying the ticket created by the cleaner kthread at
         step 1, waking up the cleaner kthread;
      
      6) At close_ctree() we ask the cleaner kthread to park;
      
      7) The cleaner kthread starts the transaction, deletes the unused block
         group, and then calls kthread_should_park(), which returns true, so it
         parks. And at this point we have the delayed iputs added by the
         completion of the ordered extents still pending;
      
      8) Then later at close_ctree(), when we call:
      
             cancel_work_sync(&fs_info->async_reclaim_work);
      
         We hang forever, since the cleaner was parked and no one else can run
         delayed iputs after that, while the reclaim task is waiting for the
         remaining delayed iputs to be completed.
      
      Fix this by waiting for all ordered extents to complete and running the
      delayed iputs before attempting to stop the async reclaim tasks. Note that
      we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
      other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
      flag to be set on an ordered extent, but the delayed iput is added after
      that, when doing the final btrfs_put_ordered_extent(). So instead wait for
      the work queues used for executing ordered extent completion to be empty,
      which works because we do the final put on an ordered extent at
      btrfs_finish_ordered_io() (while we are in the unmount context).
      
      Fixes: d6fd0ae2 ("Btrfs: fix missing delayed iputs on unmount")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a362bb86
    • F
      btrfs: fix hang during unmount when stopping block group reclaim worker · 8a1f1e3d
      Filipe Manana 提交于
      During early unmount, at close_ctree(), we try to stop the block group
      reclaim task with cancel_work_sync(), but that may hang if the block group
      reclaim task is currently at btrfs_relocate_block_group() waiting for the
      flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During
      unmount we only clear that flag later, after trying to stop the block
      group reclaim task.
      
      Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the
      block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that
      if the reclaim task is waiting on that bit, it will stop immediately after
      being woken, because it sees the filesystem is closing (with a call to
      btrfs_fs_closing()), and then returns immediately with -EINTR.
      
      Fixes: 31e70e52 ("btrfs: fix hang during unmount when block group reclaim task is running")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8a1f1e3d
  3. 05 9月, 2022 1 次提交
    • N
      btrfs: zoned: fix API misuse of zone finish waiting · d5b81ced
      Naohiro Aota 提交于
      The commit 2ce543f4 ("btrfs: zoned: wait until zone is finished when
      allocation didn't progress") implemented a zone finish waiting mechanism
      to the write path of zoned mode. However, using
      wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
      wait_var_event() just hangs because no one ever wakes it up once it goes
      into sleep.
      
      Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
      on fs_info->flags with a proper barrier installed.
      
      Fixes: 2ce543f4 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5b81ced
  4. 17 8月, 2022 1 次提交
  5. 03 8月, 2022 2 次提交
  6. 25 7月, 2022 16 次提交
    • N
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota 提交于
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ce543f4
    • N
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota 提交于
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7b12a62
    • F
      btrfs: set the objectid of the btree inode's location key · adac5584
      Filipe Manana 提交于
      We currently don't use the location key of the btree inode, its content
      is set to zeroes, as it's a special inode that is not persisted (it has
      no inode item stored in any btree).
      
      At btrfs_ino(), an inline function used extensively in btrfs, we have
      this special check if the given inode's location objectid is 0, and if it
      is, we return the value stored in the VFS' inode i_ino field instead
      (which is BTRFS_BTREE_INODE_OBJECTID for the btree inode).
      
      To reduce the code at btrfs_ino(), we can simply set the objectid of the
      btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the
      need to check for the special case of the objectid being zero, with the
      side effect of reducing the overall code size and having less code to
      execute, as btrfs_ino() is an inline function.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620502	 189240	  29032	1838774	 1c0eb6	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1617487	 189240	  29032	1835759	 1c02ef	fs/btrfs/btrfs.ko
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adac5584
    • C
      btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully · ea1f0ced
      Christoph Hellwig 提交于
      btrfs_wq_submit_bio is used for writeback under memory pressure.
      Instead of failing the I/O when we can't allocate the async_submit_bio,
      just punt back to the synchronous submission path.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea1f0ced
    • C
      btrfs: do not return errors from btrfs_map_bio · 1a722d8f
      Christoph Hellwig 提交于
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches
      what the block layer submission does and avoids any confusion on who
      needs to handle errors.
      
      As this requires touching all the callers, rename the function to
      btrfs_submit_bio, which describes the functionality much better.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a722d8f
    • N
      btrfs: don't print 'has skinny extents' anymore on mount · 49f468c9
      Nikolay Borisov 提交于
      Skinny extents have been a default mkfs feature since version 3.18 i
      (introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs:
      make skinny-metadata default") ). It really doesn't bring any value to
      users to simply remove it.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49f468c9
    • N
      btrfs: don't print 'flagging with big metadata' anymore on mount · 6b769dac
      Nikolay Borisov 提交于
      Added in commit 727011e0 ("Btrfs: allow metadata blocks larger than
      the page size") in 2010 and it's been default for mkfs since 3.12
      (2013).  The message doesn't really convey any useful information to
      users. Remove it.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b769dac
    • N
      btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA · e26b04c4
      Nikolay Borisov 提交于
      Commit 6f93e834 seemingly inadvertently moved the code responsible
      for flagging the filesystem as having BIG_METADATA to a place where
      setting the flag was essentially lost. This means that
      filesystems created with kernels containing this bug (starting with 5.15)
      can potentially be mounted by older (pre-3.4) kernels. In reality
      chances for this happening are low because there are other incompat
      flags introduced in the mean time. Still the correct behavior is to set
      INCOMPAT_BIG_METADATA flag and persist this in the superblock.
      
      Fixes: 6f93e834 ("btrfs: fix upper limit for max_inline for page size 64K")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e26b04c4
    • D
      btrfs: print checksum type and implementation at mount time · c8a5f8ca
      David Sterba 提交于
      Per user request, print the checksum type and implementation at mount
      time among the messages. The checksum is user configurable and the
      actual crypto implementation is useful to see for performance reasons.
      The same information is also available after mount in
      /sys/fs/FSID/checksum file.
      
      Example:
      
        [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm
      
      Link: https://github.com/kdave/btrfs-progs/issues/483Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8a5f8ca
    • Q
      btrfs: output mirror number for bad metadata · 8f0ed7d4
      Qu Wenruo 提交于
      When handling a real world transid mismatch image, it's hard to know
      which copy is corrupted, as the error messages just look like this:
      
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
      
      We don't even know if the retry is caused by btrfs or the VFS retry.
      
      To make things a little easier to read, add mirror number for all
      related tree block read errors.
      
      So the above messages would look like this:
      
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update messages, add "logical" ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8f0ed7d4
    • Q
      btrfs: reject log replay if there is unsupported RO compat flag · dc4d3168
      Qu Wenruo 提交于
      [BUG]
      If we have a btrfs image with dirty log, along with an unsupported RO
      compatible flag:
      
      log_root		30474240
      ...
      compat_flags		0x0
      compat_ro_flags		0x40000003
      			( FREE_SPACE_TREE |
      			  FREE_SPACE_TREE_VALID |
      			  unknown flag: 0x40000000 )
      
      Then even if we can only mount it RO, we will still cause metadata
      update for log replay:
      
        BTRFS info (device dm-1): flagging fs with big metadata feature
        BTRFS info (device dm-1): using free space tree
        BTRFS info (device dm-1): has skinny extents
        BTRFS info (device dm-1): start tree-log replay
      
      This is definitely against RO compact flag requirement.
      
      [CAUSE]
      RO compact flag only forces us to do RO mount, but we will still do log
      replay for plain RO mount.
      
      Thus this will result us to do log replay and update metadata.
      
      This can be very problematic for new RO compat flag, for example older
      kernel can not understand v2 cache, and if we allow metadata update on
      RO mount and invalidate/corrupt v2 cache.
      
      [FIX]
      Just reject the mount unless rescue=nologreplay is provided:
      
        BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
      
      We don't want to set rescue=nologreply directly, as this would make the
      end user to read the old data, and cause confusion.
      
      Since the such case is really rare, we're mostly fine to just reject the
      mount with an error message, which also includes the proper workaround.
      
      CC: stable@vger.kernel.org #4.9+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc4d3168
    • C
      btrfs: remove btrfs_end_io_wq · d7b9416f
      Christoph Hellwig 提交于
      All reads bio that go through btrfs_map_bio need to be completed in
      user context.  And read I/Os are the most common and timing critical
      in almost any file system workloads.
      
      Embed a work_struct into struct btrfs_bio and use it to complete all
      read bios submitted through btrfs_map, using the REQ_META flag to decide
      which workqueue they are placed on.
      
      This removes the need for a separate 128 byte allocation (typically
      rounded up to 192 bytes by slab) for all reads with a size increase
      of 24 bytes for struct btrfs_bio.  Future patches will reorganize
      struct btrfs_bio to make use of this extra space for writes as well.
      
      (All sizes are based a on typical 64-bit non-debug build)
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7b9416f
    • C
      btrfs: centralize setting REQ_META · 08a6f464
      Christoph Hellwig 提交于
      Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
      We'll start relying on this flag inside of btrfs in a bit, and this
      ensures it is always set correctly.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08a6f464
    • C
      btrfs: don't use btrfs_bio_wq_end_io for compressed writes · fed8a72d
      Christoph Hellwig 提交于
      Compressed write bio completion is the only user of btrfs_bio_wq_end_io
      for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
      here as we only real need user context for the final completion of a
      compressed_bio structure, and not every single bio completion.
      
      Add a work_struct to struct compressed_bio instead and use that to call
      finish_compressed_bio_write.  This allows to remove all handling of
      write bios in the btrfs_bio_wq_end_io infrastructure.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fed8a72d
    • C
      btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d
      Christoph Hellwig 提交于
      Instead of attaching an extra allocation an indirect call to each
      low-level bio issued by the RAID code, add a work_struct to struct
      btrfs_raid_bio and only defer the per-rbio completion action.  The
      per-bio action for all the I/Os are trivial and can be safely done
      from interrupt context.
      
      As a nice side effect this also allows sharing the boilerplate code
      for the per-bio completions
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d34e123d
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  7. 16 7月, 2022 3 次提交
  8. 06 6月, 2022 1 次提交
    • F
      btrfs: fix hang during unmount when block group reclaim task is running · 31e70e52
      Filipe Manana 提交于
      When we start an unmount, at close_ctree(), if we have the reclaim task
      running and in the middle of a data block group relocation, we can trigger
      a deadlock when stopping an async reclaim task, producing a trace like the
      following:
      
      [629724.498185] task:kworker/u16:7   state:D stack:    0 pid:681170 ppid:     2 flags:0x00004000
      [629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [629724.501267] Call Trace:
      [629724.501759]  <TASK>
      [629724.502174]  __schedule+0x3cb/0xed0
      [629724.502842]  schedule+0x4e/0xb0
      [629724.503447]  btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
      [629724.504534]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [629724.505442]  flush_space+0x423/0x630 [btrfs]
      [629724.506296]  ? rcu_read_unlock_trace_special+0x20/0x50
      [629724.507259]  ? lock_release+0x220/0x4a0
      [629724.507932]  ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
      [629724.508940]  ? do_raw_spin_unlock+0x4b/0xa0
      [629724.509688]  btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
      [629724.510922]  process_one_work+0x252/0x5a0
      [629724.511694]  ? process_one_work+0x5a0/0x5a0
      [629724.512508]  worker_thread+0x52/0x3b0
      [629724.513220]  ? process_one_work+0x5a0/0x5a0
      [629724.514021]  kthread+0xf2/0x120
      [629724.514627]  ? kthread_complete_and_exit+0x20/0x20
      [629724.515526]  ret_from_fork+0x22/0x30
      [629724.516236]  </TASK>
      [629724.516694] task:umount          state:D stack:    0 pid:719055 ppid:695412 flags:0x00004000
      [629724.518269] Call Trace:
      [629724.518746]  <TASK>
      [629724.519160]  __schedule+0x3cb/0xed0
      [629724.519835]  schedule+0x4e/0xb0
      [629724.520467]  schedule_timeout+0xed/0x130
      [629724.521221]  ? lock_release+0x220/0x4a0
      [629724.521946]  ? lock_acquired+0x19c/0x420
      [629724.522662]  ? trace_hardirqs_on+0x1b/0xe0
      [629724.523411]  __wait_for_common+0xaf/0x1f0
      [629724.524189]  ? usleep_range_state+0xb0/0xb0
      [629724.524997]  __flush_work+0x26d/0x530
      [629724.525698]  ? flush_workqueue_prep_pwqs+0x140/0x140
      [629724.526580]  ? lock_acquire+0x1a0/0x310
      [629724.527324]  __cancel_work_timer+0x137/0x1c0
      [629724.528190]  close_ctree+0xfd/0x531 [btrfs]
      [629724.529000]  ? evict_inodes+0x166/0x1c0
      [629724.529510]  generic_shutdown_super+0x74/0x120
      [629724.530103]  kill_anon_super+0x14/0x30
      [629724.530611]  btrfs_kill_super+0x12/0x20 [btrfs]
      [629724.531246]  deactivate_locked_super+0x31/0xa0
      [629724.531817]  cleanup_mnt+0x147/0x1c0
      [629724.532319]  task_work_run+0x5c/0xa0
      [629724.532984]  exit_to_user_mode_prepare+0x1a6/0x1b0
      [629724.533598]  syscall_exit_to_user_mode+0x16/0x40
      [629724.534200]  do_syscall_64+0x48/0x90
      [629724.534667]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [629724.535318] RIP: 0033:0x7fa2b90437a7
      [629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
      [629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
      [629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
      [629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
      [629724.541796]  </TASK>
      
      This happens because:
      
      1) Before entering close_ctree() we have the async block group reclaim
         task running and relocating a data block group;
      
      2) There's an async metadata (or data) space reclaim task running;
      
      3) We enter close_ctree() and park the cleaner kthread;
      
      4) The async space reclaim task is at flush_space() and runs all the
         existing delayed iputs;
      
      5) Before the async space reclaim task calls
         btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
         doing the data block group relocation, creates a delayed iput at
         replace_file_extents() (called when COWing leaves that have file extent
         items pointing to relocated data extents, during the merging phase
         of relocation roots);
      
      6) The async reclaim space reclaim task blocks at
         btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;
      
      7) The task at close_ctree() then calls cancel_work_sync() to stop the
         async space reclaim task, but it blocks since that task is waiting for
         the delayed iput to be run;
      
      8) The delayed iput is never run because the cleaner kthread is parked,
         and no one else runs delayed iputs, resulting in a hang.
      
      So fix this by stopping the async block group reclaim task before we
      park the cleaner kthread.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31e70e52
  9. 18 5月, 2022 1 次提交
  10. 16 5月, 2022 2 次提交